Next steps to MVP

(too lazy to use [GH project](https://github.com/users/Joelius300/projects/4))

## Blocking

Without these, we cannot release the Aare Oraku MVP.

- [x] Season-aware evaluation; winter is almost irrelevant and start of the season is potentially more important than peak season. You could even consider only evaluating summer.
  - See notes in [idea-dump](https://github.com/Joelius300/aare-forecast/blob/main/docs/idea-dump.md)
  - Can implement ~residuals~ (see note on name below) yourself on top of the existing backtest call. Just add the err metric and use [this snippet](https://github.com/unit8co/darts/blob/e1067715fc729377ec4775e763b7a325227eb541/darts/models/forecasting/forecasting_model.py#L2339-L2345) to convert to TimeSeries (if you even need it, you can create a dataframe if that's easier). If _just adding err_ doesn't work, maybe you can reuse the residuals method actually, because the DPD metric is also per time step and the ones we calculate currently are not.
    - Note that in most contexts, "residuals" means the in-sample error, so basically the training error. They can be analyzed to see if there's remaining information (e.g. seasonality) that the model is not taking advantage of, or some bias (e.g. always predicting too low). The errors of the validation data (which we're interested in for this model evaluation) wouldn't be called residuals, that's just the raw validation error.
    - Note that there exists [`plot_residuals_analysis`](https://unit8co.github.io/darts/_modules/darts/utils/statistics.html#plot_residuals_analysis), but it only takes a single ts, so you might want to do some averaging yourself (e.g. of the ACF values) and then create a custom plot. Comparing different seasons for residual analysis might also be interesting!
  - Ask the others for help to determine which months should get which weight. Use the average visitors of the app and webpage per week/month to find the most interesting times.
- [ ] Create a comprehensive report for the current model (make it reusable)
  - think about if this should be done on the test data (which is still biased in our favour, since the future cov air temp is always correct during training and evaluation, but not production)
  - horizon and stride should match the one we want to release
  - performance compared to baselines
  - visualizations, maybe interactive?
  - breakdown of errors by week/month
  - detailed analysis of the peak temperature error (try to think like an unassuming, simplistic user, that just wants to know the max of tomorrow). This includes also looking at the distribution of the raw DPD, not just MADPD and doing a per week/month analysis.
  - average error per lag up to the horizon, including the distribution, so you can make an informed choice for the "fake" uncertainty bands (if we decide to do that). Note that these should also be per week/month or at least per importance-season (winter will 100% be lower than summer).
  - new metric could be average time the peak is missed (peak was 2h earlier than predicted). complements the peak height metric nicely.
  - interpretability (SHAP) (if it's easy; otherwise leave this for now)
  - in the end, the report will be used to determine "can we use this model or do we need a better one" and if we do use it, how it's framed and communicated to users -> potentially conflicts with the theory behind test set
- [x] Monitoring of the app. Note sure what/how, will have to discuss with the others, but it would definitely be a good idea.
  - Communication from prediction service (or I guess dokku managed cron) -> loki logs and forecast_meta table
  - Communication from the API (I also want to know how many requests it gets, just in case) -> only loki logs, since we don't have a central prometheus instance atm. If we add that I'll scrape the metrcis and make a grafana dashboard.
  - Grafana alerts -> used uptimerobot with a health endpoint instead
- [x] Expose flow forecasts from BAFU via API, think about API structuring (internally and endpoints)
- [ ] Update MVP docs to align with actual solution. Also write about versioning, even though it's not really important (two component model version?).

## Non-blocking

These should be the next steps, but we can release without them.

- [ ] Backups & external access to db: The historical meteotest predictions are very valuable and we don't want to lose them ((((grüess gö use)))).
- [x] Make prediction service more resilient especially regarding external sources. It should fail gracefully if not all required data is available and write the reason clearly into the DB as failed run. It should try to fetch all external sources and failure of one should not influence another. This is especially important because I'd like to start archiving the [BAFU flow prediction](https://www.bafu.admin.ch/bafu/de/home/themen/wasser/extremereignisse/hydrologische-vorhersagen-des-bundes.html) even though I'm not using it in a model yet. [Alplakes](https://www.alplakes.eawag.ch) is next. Ps. you could also write both of them, maybe they have an archive we can use.
- [x] Make prediction service async. Most of the time is spent with library imports (fucking python), but then fetching something from the internet and writing to the DB is clearly IO bound.
- [x] #5
- [ ] #7
- [x] Update packages, esp. Darts (should retrain the model if you do)
- [ ] Add a test suite. It can be simple, but at least the evaluation and inference data compilation should be tested; ~~the custom metric relies on darts internals and it's still pre 1.0, so this could break every patch. Has to be done before upgrading to the next darts version!~~ Custom darts metric was replaced with more performant pandas-based batch eval.
- [x] Similar to testing, add a DVC script that trains the current best model so we can see when things suddenly produce different results for example. Full automated regression tests are probably overkill for this project.
- [x] Use src layout with packaging for service packages too to avoid ugly lib conflict.
- [ ] Store more relevant metadata in metadata column. model name, version and trained_at might be better. I'll backup the models we use anyway. Also maybe dump entire origin and hparams in a json property just in case?
- [ ] Make it more agnostic for forecasting different variables (both in API and service) since we later might add a model to refine the flow forecasts from BAFU. There should be one table per variable and a location column (we don't know yet if different places will use different models or not). Temp and flow should be treated the same way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next steps to MVP #6

Blocking

Non-blocking

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Next steps to MVP #6

Description

Blocking

Non-blocking

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions