Fix eval bug, need to do full hourly eval #28
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant improvements to the evaluation and prediction workflow, focusing on enhanced logging for traceability, improved documentation and usage instructions for scripts, and added debugging output for metric computation. Additionally, a new utility script is provided for clearing static predictions in BigQuery. These changes collectively improve transparency, reproducibility, and reliability in the evaluation pipeline.
Logging and Debugging Enhancements
get_static_evaluationandget_dynamic_evaluationendpoints inbackend/app/main.py, including start/end markers, evaluation period, metric values, and error handling for better traceability and debugging. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]compute_metrics_for_periodinsrc/gaca_ews/evaluation/storage.pyto print query parameters, per-horizon metrics, accumulation steps, and final overall metrics, aiding in diagnosing evaluation results. [1] [2] [3] [4]Script Improvements and New Utilities
scripts/clear_static_predictions.pyto safely delete static evaluation predictions from BigQuery for a specified period, with row counting, confirmation prompts, and clear user messaging.scripts/generate_historical_predictions.py, emphasizing the importance of using hourly intervals (--interval 1) to avoid diurnal bias in evaluation metrics.src/gaca_ews/cli/main.py) and script (scripts/generate_historical_predictions.py) from 24 hours to 1 hour, with updated help messages and docstrings to guide users toward best practices for unbiased evaluation. [1] [2] [3]