Model improvement ideas

See also [idea-dump](https://github.com/Joelius300/aare-forecast/blob/main/docs/idea-dump.md)

- [ ] Incorporate the new MeteoSwiss archive and extend the training and validation periods
- [ ] Analyze meteotest features more and think about some transforms (e.g. log for ss)
- [ ] Try sample weights based on the defined "important" periods. Since the regions are much larger than the horizon, it's important that the sample weights are not normalized per batch, but luckily in [darts they are not](https://github.com/unit8co/darts/blob/27b46d21fd8dc3e14876a314974098984e516c54/darts/models/forecasting/pl_forecasting_module.py#L443).
- [ ] Try other (torch) models and tune even more, you haven't even tried other optimizers yet
- [ ] Try probabilistic models and make sure your eval pipeline etc. works with them (maybe new metrics and report parts?)
- [ ] Try minimizing the peak error further, either via optuna (multi-objective) or additional sample weights that emphasize the afternoon slightly more than the rest of the day (could do a fancy curve based on the distribution of temperature values throughout the day (during summer)).
- [ ] Account for [glacier melt](https://www.nccs.admin.ch/nccs/de/home/sektoren/wasserwirtschaft/auswirkungen-auf-den-wasserhaushalt/schneeundgletscher.html)
    - <https://www.sciencedirect.com/science/article/pii/S0022169421013536>
- [ ] Analyze white-box models like <https://hess.copernicus.org/articles/22/2343/2018/>
- [ ] Don't forget to look at covariates EDA again (https://github.com/Joelius300/aare-forecast/blob/97f2b3c93e91da5513ec3c28902c9f8fa94a915c/notebooks/05_covariantes-eda.ipynb) (also contains even more LLM yap)


### LLMs yapping

<details>
<summary>Claude (Sonnet 4.5)</summary>

For hourly river temperature forecasting over ~100 hours, I'd recommend a hybrid approach that combines domain knowledge with modern deep learning architectures. Here's my detailed strategy:

## Feature Engineering

**Core Features:**

1. **Historical river temperature** (lookback: 168-336 hours / 7-14 days)
   - River temperature has strong autocorrelation and daily/weekly patterns
   - Include lag features: t-1, t-24, t-48, t-72, t-168
   - Rolling statistics: 24h, 72h, 168h means and standard deviations
   - Rate of change features (delta temperature over 1h, 6h, 24h)

2. **Air temperature** (lookback: 48-168 hours, forecast: full 100h horizon)
   - Strongest driver of river temperature (typically 0.7-0.9 correlation)
   - Historical: same lag and rolling features as river temp
   - **Critical**: Use forecasted air temperature from NWP models (GFS, ECMWF, or NOAA) for the prediction horizon
   - Consider air-water temperature difference (thermal deficit/surplus)

3. **Solar radiation / Daylight hours** (lookback: 48-72 hours)
   - Direct heating effect on water
   - Can derive from astronomical calculations (deterministic) or use forecasts
   - Include time-of-day and day-of-year encodings (cyclical: sin/cos transforms)

4. **River discharge/flow** (lookback: 168-336 hours if available)
   - Higher flow → more thermal inertia, slower temperature response
   - Flow regime changes affect mixing and heat transfer
   - If forecasts available, include them; otherwise use persistence or climatology

5. **Meteorological variables** (lookback: 48-72 hours, forecasts when available)
   - Wind speed: affects evaporative cooling
   - Humidity/dewpoint: influences latent heat flux
   - Cloud cover: modulates solar radiation
   - Precipitation: can cause sudden temperature drops

**Temporal Features:**
- Hour of day (cyclical encoding: sin/cos of 2π*hour/24)
- Day of year (cyclical encoding for seasonality)
- Day of week (weekend/weekday patterns for regulated rivers)
- Month indicators

**Spatial Features (if multi-site):**
- Upstream station temperatures (lag by flow travel time)
- Elevation, latitude (static features)

## Feature Transformations

1. **Normalization**: Z-score standardization per feature (critical for neural networks)
2. **Cyclical encoding**: sin/cos transforms for hour, day-of-year
3. **Differencing**: First-order differences for non-stationary series
4. **Interaction terms**: Air temp × solar radiation, discharge × air temp
5. **Log transforms**: For skewed features like discharge
6. **Rolling aggregations**: Multiple window sizes to capture different temporal scales

## Model Recommendations

Given your hourly, 100-hour horizon requirement, here are my top choices:

### **Tier 1: Primary Recommendations**

**1. TiDE (Time-series Dense Encoder)** - My top choice
- **Why**: Excellent for medium-horizon forecasting, computationally efficient, handles covariates well
- **Architecture**: Dense encoder-decoder with residual connections
- **Strengths**: 
  - Superior to Transformer-based models on many benchmarks
  - Much faster training than attention mechanisms
  - Naturally handles both past and future covariates (like forecasted air temp)
  - Good interpolation between lookback and horizon
- **Config**: Lookback 168h, hidden dims [256, 256], decoder layers 2-3

**2. Temporal Fusion Transformer (TFT)**
- **Why**: Designed specifically for multi-horizon forecasting with mixed data types
- **Strengths**:
  - Built-in variable selection network (identifies important features)
  - Handles static, known future, and observed inputs elegantly
  - Quantile regression for uncertainty quantification
  - Interpretable attention weights
- **Limitation**: Computationally expensive, may be overkill for 100h horizon
- **Config**: 4-8 attention heads, lookback 168-336h

**3. TSMixer**
- **Why**: Recent architecture showing strong performance, computationally efficient
- **Strengths**:
  - MLP-based with separate time and feature mixing
  - Very fast training and inference
  - Competitive with Transformers at lower computational cost
- **Config**: 4-8 mixing blocks, lookback 168h

### **Tier 2: Strong Alternatives**

**4. N-BEATS with Exogenous Variables (N-BEATSx)**
- Pure deep learning, interpretable basis functions
- Excellent for patterns with clear trend and seasonality
- May need modification to fully leverage meteorological forecasts

**5. DeepAR (Amazon)**
- Probabilistic forecasting with autoregressive RNN
- Good uncertainty estimates
- Handles missing data well
- Older architecture but still competitive

**6. LightGBM/XGBoost with engineered lags**
- **Pragmatic baseline** that often outperforms complex models
- Fast training, excellent feature importance
- Use with extensive feature engineering (lags, rolling stats, interactions)
- Multi-step approach: direct or recursive forecasting

### **Tier 3: Classical/Hybrid**

**7. SARIMAX + Neural Network residual modeling**
- SARIMAX captures linear dynamics and seasonality
- NN model (simple MLP) learns residuals
- Interpretable and often surprisingly effective

**8. Prophet + XGBoost**
- Prophet for trend/seasonality, XGBoost for complex non-linearities
- Good when you have strong domain knowledge about change points

## Recommended Approach

**Phase 1: Baseline**
- Start with XGBoost/LightGBM with rich features (1-2 days development)
- Establishes performance floor and identifies important features

**Phase 2: Deep Learning**
- Implement **TiDE** as primary model (best balance of performance/complexity)
- Implement **TFT** if you need interpretability and uncertainty quantification
- Compare with **TSMixer** for potential efficiency gains

**Phase 3: Ensemble**
- Weighted average of top 2-3 models
- Often yields 5-15% improvement over single models

## Training Strategy

1. **Loss function**: MAE or Huber loss (robust to outliers)
2. **Validation**: Time-series cross-validation with expanding window
3. **Horizon-specific losses**: Weight near-term predictions more heavily if needed
4. **Quantile loss**: For probabilistic forecasts (critical for decision-making)

## External Data Sources

1. **Weather forecasts**: NOAA GFS, ECMWF (via APIs like Open-Meteo)
2. **Historical climate**: ERA5 reanalysis for gap-filling
3. **River gauge data**: USGS real-time water data
4. **Satellite data**: Land surface temperature if available (MODIS)

## Key Success Factors

- **Future covariates are crucial**: Access to 100h air temperature forecasts is potentially the single most important factor
- **Domain knowledge**: Physics-informed features (heat transfer principles) boost performance
- **Operational constraints**: TiDE/TSMixer offer best latency for real-time deployment
- **Uncertainty matters**: Use quantile regression or ensemble methods for decision support

For a production system, I'd likely deploy TiDE with XGBoost as a fallback, using NWP forecasts for meteorological variables, and retrain weekly on the most recent 2-3 years of data.

</details>

<details>
<summary>ChatGPT (GPT-5.2)</summary>

Below is how I would **design a short-term, hourly river temperature forecasting system (~100 hours ahead)** if the goal is *high accuracy, operational robustness, and interpretability*—and with modern SOTA time-series models in mind (TSMixer, TFT, TiDE, etc.).

I’ll structure this as:

1. **Physical intuition first (what actually drives river temperature)**
2. **Feature groups & lookback horizons**
3. **Transformations & representations**
4. **External data & forecast inputs**
5. **Modeling approaches (classical → SOTA deep learning)**
6. **Training & validation strategy**
7. **What I’d deploy in practice**

---

## 1. Physical drivers of river temperature (guides feature choice)

River temperature is governed by:

* **Atmospheric heat exchange** (air temp, radiation, wind, humidity)
* **Hydrology** (discharge, velocity, depth)
* **Thermal inertia** (water volume, prior temps)
* **Solar forcing** (diurnal + seasonal cycles)
* **Upstream conditions** (lagged influence)
* **Anthropogenic controls** (dams, releases, power plants)

Short-term (4-day) forecasts are dominated by:

* Recent river temperature
* Air temperature & solar radiation
* Discharge / flow regime
* Diurnal cycles

This strongly favors **autoregressive + exogenous (ARX) models** with weather forecasts.

---

## 2. Feature groups & lookback horizons

### A. Target history (most important)

| Feature           | Lookback   | Notes                                                    |
| ----------------- | ---------- | -------------------------------------------------------- |
| River temperature | 48–168 hrs | Thermal inertia is strong; I’d start with **72–120 hrs** |

**Why**:

* Captures diurnal cycles
* Captures heat storage & advection effects
* More than ~7 days rarely helps short-term horizons

---

### B. Air temperature (observed + forecast)

| Feature         | Lookback   | Forecast horizon |
| --------------- | ---------- | ---------------- |
| Air temperature | 48–168 hrs | 0–100 hrs        |

**Key driver**

* Strongly correlated but phase-shifted
* Rivers lag air temp by **2–12 hours** depending on depth & flow

👉 I’d explicitly include:

* Lagged air temp (t−3, t−6, t−12, t−24)
* Forecasted air temp as known future covariates

---

### C. Solar radiation / shortwave radiation

| Feature                  | Lookback  | Notes                  |
| ------------------------ | --------- | ---------------------- |
| Incoming solar radiation | 24–72 hrs | Drives diurnal heating |

If unavailable:

* Proxy via **clear-sky radiation × cloud cover**
* Or cyclical time encodings

---

### D. Streamflow / discharge (if available)

| Feature          | Lookback   | Notes                              |
| ---------------- | ---------- | ---------------------------------- |
| Discharge / flow | 72–336 hrs | Controls thermal capacity & mixing |

* High flow → dampened temp swings
* Low flow → stronger diurnal response

I’d include:

* Raw discharge
* log(discharge)
* Rolling mean (24–72 hrs)

---

### E. Meteorology (secondary but helpful)

| Feature           | Lookback  |
| ----------------- | --------- |
| Wind speed        | 24–72 hrs |
| Relative humidity | 24–72 hrs |
| Cloud cover       | 24–72 hrs |
| Precipitation     | 24–72 hrs |

These improve performance especially during:

* Storms
* Heat waves
* Rapid transitions

---

### F. Temporal encodings (critical)

| Feature     | Type     |
| ----------- | -------- |
| Hour of day | sin/cos  |
| Day of year | sin/cos  |
| Day of week | optional |

Diurnal signal is *huge* at hourly resolution.

---

## 3. Transformations & representations

### A. Normalization

* Per-site z-score or robust scaling
* Discharge → log transform

### B. Rolling statistics

Helpful for shallow models and stability:

* Rolling mean / std of river temp (6h, 24h, 72h)
* Rolling temp-air temp difference

### C. Anomaly features

* Temp anomaly vs climatology (same hour, same DOY)
* Improves generalization across seasons

### D. Lag selection

Instead of dense lags:

* Explicit lags at [1, 3, 6, 12, 24, 48, 72] hours
* Let deep models learn interactions

---

## 4. External data sources (high impact)

### Meteorological forecasts (essential)

* **ECMWF / GFS / HRRR**
* Hourly air temperature, radiation, wind, clouds
* Treated as **known future covariates**

### Hydrologic forecasts (if available)

* Flow forecasts from HEC-RAS / NOAA / local hydrologic models

### Static features

* River depth, width
* Upstream distance
* Shading / canopy
* Dam presence

Static features are very useful for **TFT & TiDE**.

---

## 5. Model choices (from baseline to SOTA)

### 1. Strong baselines (always start here)

#### 🔹 SARIMAX

* AR terms on river temp
* Exogenous air temperature + radiation
* Surprisingly strong for short horizons

#### 🔹 Gradient Boosting (XGBoost / LightGBM)

* Lagged features + weather forecasts
* Fast, interpretable, robust
* Often competitive with deep models for <5-day horizons

---

### 2. Deep learning models (recommended)

#### 🟢 **Temporal Fusion Transformer (TFT)** ⭐⭐⭐⭐⭐

**My first choice if data volume is sufficient**

Why:

* Explicit handling of:

  * Static features
  * Past observed covariates
  * Known future covariates (weather forecasts)
* Attention provides interpretability
* Strong performance on short-term multivariate forecasting

Config:

* Lookback: 72–168 hrs
* Horizon: 100 hrs
* Quantile loss (uncertainty matters in rivers)

---

#### 🟢 **TiDE (Time-series Dense Encoder)** ⭐⭐⭐⭐☆

Why:

* Very strong for short horizons
* Simpler than TFT
* Handles future covariates cleanly
* Excellent when physical relationships are smooth

Less interpretable than TFT but often more stable.

---

#### 🟢 **TSMixer / TSMixer+** ⭐⭐⭐⭐☆

Why:

* Extremely efficient
* Strong for high-frequency signals
* Works well with large lag windows

Caveat:

* Weaker handling of static features
* Less physically interpretable

---

#### 🟡 **N-BEATSx**

* Works well if exogenous drivers dominate
* Less explicit temporal alignment than TFT

---

### 3. Hybrid / physics-aware approaches (advanced)

* Residual model:

  * Physical energy-balance model
  * ML learns residual error
* Improves extrapolation during extremes

---

## 6. Training & validation strategy

### Data splitting

* **Rolling origin evaluation**
* Seasonal hold-outs (train on spring/summer → test on heat waves)

### Loss functions

* MAE (operational)
* Quantile loss (P10 / P50 / P90)
* Penalize diurnal phase errors

### Metrics

* RMSE
* Diurnal amplitude error
* Max daily temperature error (ecologically critical)

---

## 7. What I would deploy

**Practical SOTA stack:**

1. **Primary model**: Temporal Fusion Transformer
2. **Backup model**: LightGBM with lags
3. **Ensemble**: TFT + GBM (often best)

**Inputs:**

* 96–168 hrs history
* 100 hrs of weather forecasts
* Static river attributes

**Outputs:**

* Hourly forecasts
* Prediction intervals
* Feature attribution (TFT attention)

---

## Summary (TL;DR)

* **Most influential features**:
  River temp history > air temp (obs + forecast) > solar radiation > discharge
* **Lookback**:
  72–120 hrs for temp, 48–72 hrs for met variables
* **Best models**:

  * 🥇 Temporal Fusion Transformer
  * 🥈 TiDE
  * 🥉 TSMixer
* **Key advantage**: known future weather covariates
* **Baseline still matters**: GBM & SARIMAX remain strong

If you want, I can:

* Sketch a **TFT feature schema**
* Compare **TFT vs TiDE vs TSMixer** for river physics
* Provide **PyTorch / Darts / GluonTS configs**
* Discuss **cold-start rivers or ungauged basins**

Just tell me where you want to go next.


</details>

<details>
<summary>Perplexity</summary>

For short-term (≤4 days, hourly) river temperature forecasting, a multivariate deep time series model (e.g. TFT, TSMixer, or TiDE) using recent river conditions plus weather forecasts as exogenous inputs is a strong choice, often outperforming simpler baselines when you have enough data. The most influential features are typically air temperature, recent river temperature history, and discharge/flow, with auxiliary static features (e.g. elevation, shading) improving generalization across sites. [[publish.csiro](https://www.publish.csiro.au/mf/pdf/MF24152)](https://www.publish.csiro.au/mf/pdf/MF24152)

## Influential features

Key dynamic features (hourly or aggregated from finer data):

- River temperature:
  - Recent lags (e.g. 0–72 h) capture thermal inertia and diurnal cycle.
  - Rolling stats (mean, min, max over last 6–24 h) to smooth sensor noise and emphasize thermal memory. [[sciencedirect](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)

- Air temperature:
  - Strongest single predictor; can use current and lagged air temperature (0–48 h). [[iwaponline](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)
  - Separate features for min/max/mean and diurnal amplitude (max–min) improve fit. [[iwaponline](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)

- Discharge / flow and water level:
  - Affects residence time and mixing; rolling averages over 6–72 h useful. [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)
  - Sudden flow changes (derivative) as a feature help capture cold/high-flow or warm/low-flow events.

- Radiation and meteorology:
  - Global solar radiation or proxy (clear-sky index, cloud cover), humidity, wind speed, precipitation. [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)
  - Wind and radiation drive heat exchange at the water surface; precipitation events can rapidly change temperature via runoff.

Important static / slowly-varying features:

- River/segment attributes:
  - Elevation, river width, depth class, distance from source, slope, presence of lakes/reservoirs. [[publish.csiro](https://www.publish.csiro.au/mf/pdf/MF24152)](https://www.publish.csiro.au/mf/pdf/MF24152)
  - Channel orientation (aspect) and riparian shading or % forest cover in near-stream buffer. [[frontiersin](https://www.frontiersin.org/journals/earth-science/articles/10.3389/feart.2022.1033673/full)](https://www.frontiersin.org/journals/earth-science/articles/10.3389/feart.2022.1033673/full)

- Site / catchment identifiers:
  - One-hot or learned embeddings for station ID and basin ID to allow multi-site models to share structure. [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)

## Lookback windows and feature engineering

For a 1–100 h forecast horizon, a typical design:

- Historical lookback window:
  - 3–7 days of hourly data (72–168 steps) for:
    - River temperature (lags every hour).
    - Air temperature and meteorology.
    - Flow/discharge and water level. [[sciencedirect](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)
  - Longer windows (up to 30–60 days) only if there are slow processes (reservoirs, snowmelt), but that increases model cost.

- Target horizon:
  - Direct multi-horizon: model outputs the next 1–100 hourly temperatures in one shot (used by TFT, TiDE, TSMixer). [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)

Potential transformations:

- Temporal encoding:
  - Sine/cosine encodings for hour-of-day, day-of-year to encode diurnal and seasonal cycles.
  - Categorical month or season features.

- Lags and rolling features:
  - Explicit lagged features at multiples of 24 h (e.g. t−24, t−48) to capture repeated diurnal patterns.
  - Rolling mean/max/min/standard deviation (e.g. 6, 12, 24, 72 h) for air/river temperature, flow, and radiation.

- Nonlinear transforms:
  - Log-transform flow/discharge (often highly skewed).
  - Square or interaction terms if using simpler ML models (e.g. AT², AT×flow) to approximate nonlinear heat processes. [[publish.csiro](https://www.publish.csiro.au/mf/pdf/MF24152)](https://www.publish.csiro.au/mf/pdf/MF24152)
  - Standardization or robust scaling of all continuous features per site or globally.

- Quality & gap handling:
  - Use flags for imputed values and for sensor issues.
  - Optional residual-correction scheme (e.g. data assimilation or post-hoc error model) if operational performance is critical. [[iwaponline](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)

## External data sources

Useful external predictors (both observations and forecasts):

- Numerical weather prediction (NWP) forecasts:
  - Hourly or 3-hourly forecasts for:
    - Air temperature at 2 m.
    - Incoming shortwave radiation (or global radiation).
    - Cloud cover, humidity, wind speed, precipitation.
  - Include forecasts up to +100 h directly as known future covariates (this is what TFT-style models are designed for). [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)

- Hydrological model outputs (if available):
  - Forecasted discharge, water level, snowmelt contribution, reservoir releases.
  - Groundwater/baseflow proxies for buffering effects. [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)

- Remote sensing / static geospatial layers:
  - Land cover, tree cover, elevation (DEM), river network attributes, lake/reservoir masks. [[frontiersin](https://www.frontiersin.org/journals/earth-science/articles/10.3389/feart.2022.1033673/full)](https://www.frontiersin.org/journals/earth-science/articles/10.3389/feart.2022.1033673/full)
  - Can be pre-aggregated per reach or site and used as static features.

- Climate indices / large-scale predictors:
  - For 100 h horizon they are usually marginal, but if long history exists, coarse climate indicators (e.g. anomalies) may help with seasonal bias.

## Model classes and concrete algorithms

Baseline and traditional approaches:

- Statistical / simpler ML:
  - Multiple linear regression or generalized additive models using air temperature, flow, and temporal factors. [[publish.csiro](https://www.publish.csiro.au/mf/pdf/MF24152)](https://www.publish.csiro.au/mf/pdf/MF24152)
  - Tree-based models (Random Forest, Gradient Boosting) with engineered lags and rolling features; robust with limited data but weaker at long horizons. [[iwaponline](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)
  - These provide baselines and interpretability but require manual horizon-specific modelling or recursive forecasting.

Neural sequence models:

- Recurrent and CNN-based:
  - LSTM/GRU models for univariate or multivariate sequence-to-sequence forecasting; proven effective for river temperature at daily scale and extendable to hourly. [[sciencedirect](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)
  - Temporal convolutional networks (TCN) providing parallelism and long-range receptive fields.

Modern SOTA architectures for multivariate time series:

- Temporal Fusion Transformer (TFT):
  - Handles:
    - Historic observed inputs (past covariates).
    - Known future inputs (e.g. NWP forecasts, calendar).
    - Static features through embeddings and context vectors.
  - Provides attention-based variable selection and interpretable attributions; has shown excellent performance for multi-horizon water temperature forecasting using meteorological forecasts. [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)

- TSMixer:
  - MLP-Mixer-style model designed for time series, mixing along time and feature dimensions.
  - Strong baseline for multivariate forecasting, computationally efficient, and works well with long lookbacks and horizons.

- TiDE:
  - Encoder–decoder MLP model specialized for long-horizon forecasting with separate modules for historical and future-known covariates.
  - Good trade-off between accuracy and speed, often competitive with transformers for structured datasets.

- Others:
  - N-BEATS/N-HiTS: strong general-purpose forecasters; can work well for univariate or low-dimensional settings if exogenous drivers are encoded appropriately.
  - Probabilistic variants (e.g. TFT with quantile loss) to provide prediction intervals, important for operations. [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)

For your problem (hourly, 0–100 h, multivariate drivers):

- A suitable setup is:
  - Multi-site, multi-variate model.
  - 72–168 h lookback, 100 h forecast horizon.
  - Known future NWP covariates as inputs.
  - Model choice:
    - Start with TFT or TiDE if you need interpretability and handle known future meteorology.
    - TSMixer or TCN as a lighter, strong alternative where compute is constrained.

## Practical design choices

- Training strategy:
  - Use rolling-window sampling: for each time t, input is past history window, output is next 1–100 h.
  - Use time-aware validation (blocked or rolling-origin) to avoid leakage. [[iwaponline](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)](https://iwaponline.com/jh/article/23/3/605/81372/Prediction-of-river-water-temperature-using)

- Losses and metrics:
  - Quantile or distributional loss for probabilistic outputs (e.g. CRPS, pinball loss), plus MAE/RMSE for point forecasts. [[hess.copernicus](https://hess.copernicus.org/articles/29/1685/2025/)](https://hess.copernicus.org/articles/29/1685/2025/)
  - Consider separate metrics by horizon (e.g. 0–24 h vs 24–100 h) and by season.

- Model selection:
  - Start with simple baselines (autoregressive + air temperature + flow) to establish a lower bound.
  - Incrementally add exogenous forecasts and move to TFT/TiDE/TSMixer once the pipeline is stable and enough data (≥ several years hourly) is available. [[sciencedirect](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)](https://www.sciencedirect.com/science/article/abs/pii/S0022169421000639)

</details>

Feature	Lookback
Wind speed	24–72 hrs
Relative humidity	24–72 hrs
Cloud cover	24–72 hrs
Precipitation	24–72 hrs

Model improvement ideas #7

Description

LLMs yapping

Feature Engineering

Feature Transformations

Model Recommendations

Tier 1: Primary Recommendations

Tier 2: Strong Alternatives

Tier 3: Classical/Hybrid

Recommended Approach

Training Strategy

External Data Sources

Key Success Factors

1. Physical drivers of river temperature (guides feature choice)

2. Feature groups & lookback horizons

A. Target history (most important)

B. Air temperature (observed + forecast)

C. Solar radiation / shortwave radiation

D. Streamflow / discharge (if available)

E. Meteorology (secondary but helpful)

F. Temporal encodings (critical)

3. Transformations & representations

A. Normalization

B. Rolling statistics

C. Anomaly features

D. Lag selection

4. External data sources (high impact)

Meteorological forecasts (essential)

Hydrologic forecasts (if available)

Static features

5. Model choices (from baseline to SOTA)

1. Strong baselines (always start here)

🔹 SARIMAX

🔹 Gradient Boosting (XGBoost / LightGBM)

2. Deep learning models (recommended)

🟢 Temporal Fusion Transformer (TFT) ⭐⭐⭐⭐⭐

🟢 TiDE (Time-series Dense Encoder) ⭐⭐⭐⭐☆

🟢 TSMixer / TSMixer+ ⭐⭐⭐⭐☆

🟡 N-BEATSx

3. Hybrid / physics-aware approaches (advanced)

6. Training & validation strategy

Data splitting

Loss functions

Metrics

7. What I would deploy

Summary (TL;DR)

Influential features

Lookback windows and feature engineering

External data sources

Model classes and concrete algorithms

Practical design choices

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions