A chronological catalog of loss functions for time series forecasting, temporal prediction, and sequential modeling.
1. Mean Absolute Error (MAE) / L1 Loss (Classical) β The average absolute difference between predicted and true values; robust to outliers and widely used as a baseline point forecasting loss.
π Least Absolute Deviations (Wikipedia) β Classical statistical method
π» PyTorch torch.nn.L1Loss
2. Mean Squared Error (MSE) / L2 Loss (Classical) β The average squared difference between predicted and true values; penalizes large errors disproportionately, making it sensitive to outliers.
π Least Squares (Wikipedia) β Classical statistical method (Gauss, Legendre)
π» PyTorch torch.nn.MSELoss
3. Huber Loss (1964) β A piecewise loss that behaves as L2 for small errors and L1 for large errors, combining MSE's smoothness with MAE's robustness to outliers.
π Robust Estimation of a Location Parameter β Peter J. Huber
π» PyTorch torch.nn.HuberLoss
4. Quantile Loss / Pinball Loss (1978) β Asymmetric loss that penalizes over- and under-prediction differently based on a chosen quantile, enabling prediction interval estimation and probabilistic forecasting. π Regression Quantiles β Roger Koenker, Gilbert Bassett Jr. π» GluonTS QuantileLoss
5. MAPE (Mean Absolute Percentage Error) (Classical) β Scale-independent percentage error measuring relative forecast accuracy; undefined when true values are zero and asymmetrically penalizes positive vs. negative errors. π Another Look at Measures of Forecast Accuracy β Rob J. Hyndman, Anne B. Koehler (2006, critical analysis) π» Nixtla/neuralforecast (MAPE metric)
6. sMAPE (Symmetric MAPE) (1999) β Symmetric variant of MAPE that normalizes by the average of predicted and true values, addressing MAPE's asymmetry but still problematic near zero. π A Better Measure of Relative Prediction Accuracy for Model Selection and Model Estimation β J. Scott Armstrong π» Nixtla/neuralforecast (sMAPE metric)
7. MASE (Mean Absolute Scaled Error) (2006) β Scale-free error metric that normalizes MAE by the in-sample MAE of a naive (random walk) forecast, well-defined for zero values and suitable for comparing across series. π Another Look at Measures of Forecast Accuracy β Rob J. Hyndman, Anne B. Koehler π» Nixtla/neuralforecast (MASE metric)
8. Negative Log-Likelihood (Gaussian) (Classical) β Probabilistic forecasting loss that jointly learns the predicted mean and variance of a Gaussian distribution, penalizing both inaccurate point predictions and miscalibrated uncertainty. π Pattern Recognition and Machine Learning, Β§1.2.4 β Christopher M. Bishop (2006) π» GluonTS GaussianOutput
9. CRPS (Continuous Ranked Probability Score) (2007) β A proper scoring rule for probabilistic forecasts that measures the integrated squared difference between the predicted CDF and the empirical CDF of the observation, generalizing MAE to distributions. π Strictly Proper Scoring Rules, Prediction, and Estimation β Tilmann Gneiting, Adrian E. Raftery π» GluonTS EnergyScore / CRPS
10. DILATE Loss (2019) β Combines a shape-based loss (soft-DTW) with a Temporal Distortion Index (TDI) penalty, jointly optimizing for both shape accuracy and temporal alignment in time series prediction. π Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models β Vincent Le Guen, Nicolas Thome π» vincent-leguen/DILATE
11. DeepAR Loss (2020) β Autoregressive RNN trained with the negative log-likelihood of parametric distributions (Gaussian, negative binomial, beta, etc.), producing calibrated probabilistic forecasts via ancestral sampling. π DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks β David Salinas, Valentin Flunkert, Jan Gasthaus, Tim Januschowski π» awslabs/gluonts (DeepAR)
12. N-BEATS Loss (2020) β Interpretable deep architecture using basis expansion with backward/forward residual stacking; trained with MAPE, sMAPE, or MASE losses depending on the evaluation metric, achieving pure DL state-of-the-art on M4. π N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting β Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, Yoshua Bengio π» ServiceNow/N-BEATS
13. Informer Loss (2021) β MSE loss applied to long-sequence time series forecasting with ProbSparse self-attention and generative-style decoder, enabling direct multi-step prediction without autoregressive accumulation of error. π Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting β Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, Wancai Zhang π» zhouhaoyi/Informer2020
14. Autoformer Loss (2021) β MSE loss with a novel auto-correlation mechanism replacing standard self-attention, combined with progressive series decomposition (trend + seasonal) for long-term forecasting. π Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting β Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long π» thuml/Autoformer
15. FEDformer Loss (2022) β MSE loss with frequency-enhanced attention that operates in the Fourier/wavelet domain, capturing global temporal patterns with linear complexity for long-term forecasting. π FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting β Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, Rong Jin π» MAZiqing/FEDformer
16. PatchTST Loss (2023) β MSE loss with channel-independent patching that segments time series into subseries-level patches fed to a vanilla Transformer, reducing computation and capturing local semantic information for multivariate forecasting. π A Time Series is Worth 64 Words: Long-term Forecasting with Transformers β Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam π» yuqinie98/PatchTST
17. TimesNet Loss (2023) β MSE loss with a 2D variation modeling approach that uses FFT-based period detection to reshape 1D time series into 2D tensors, capturing both intra-period and inter-period variations via 2D convolutions. π TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis β Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, Mingsheng Long π» thuml/Time-Series-Library (TimesNet)
18. iTransformer Loss (2024) β MSE loss with an inverted Transformer architecture that applies attention on the variate dimension (not time), treating each time series as a token to capture multivariate correlations more effectively. π iTransformer: Inverted Transformers Are Effective for Time Series Forecasting β Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, Mingsheng Long π» thuml/iTransformer
19. TimesFM Loss (2024) β Patched decoder-only transformer foundation model trained on a large corpus of real-world and synthetic time series, using quantile heads (quantile loss) for probabilistic forecasting with zero-shot generalization. π A Decoder-Only Foundation Model for Time-Series Forecasting β Abhimanyu Das, Weihao Kong, Rajat Sen, Yichen Zhou π» google-research/timesfm
20. DTW Loss (Dynamic Time Warping) (1978) β Alignment-based distance that finds the optimal non-linear warping path between two time series, allowing temporal distortion; non-differentiable in its original form. π Dynamic Programming Algorithm Optimization for Spoken Word Recognition β Hiroaki Sakoe, Seibi Chiba π» tslearn DTW
21. Soft-DTW Loss (2017) β A differentiable relaxation of DTW that replaces the hard minimum with a soft-minimum (log-sum-exp), enabling gradient-based optimization of DTW-like alignment losses for time series. π Soft-DTW: a Differentiable Loss Function for Time-Series β Marco Cuturi, Mathieu Blondel π» mblondel/soft-dtw
22. TDI (Temporal Distortion Index) (2019) β Measures the temporal alignment quality between predicted and true time series by computing the area between the DTW warping path and the diagonal, quantifying how much temporal distortion exists. π Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models β Vincent Le Guen, Nicolas Thome π» vincent-leguen/DILATE (TDI component)
23. Variational Inference Loss for State Space Models (2018) β ELBO-based loss for deep state space models that combines a reconstruction term (negative log-likelihood) with a KL divergence regularizer, enabling probabilistic forecasting with learned latent dynamics. π Deep State Space Models for Time Series Forecasting β Syama Sundar Rangapuram, Matthias W. Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, Tim Januschowski π» awslabs/gluonts (DeepState)
| Library | Description | Link |
|---|---|---|
| GluonTS | AWS probabilistic time series modeling (DeepAR, DeepState, Transformer, etc.) | awslabs/gluonts |
| NeuralForecast | Nixtla's production-ready neural forecasting (N-BEATS, NHITS, PatchTST, etc.) | Nixtla/neuralforecast |
| pytorch-forecasting | High-level PyTorch forecasting API (TFT, DeepAR, N-BEATS, etc.) | jdb78/pytorch-forecasting |
| TSlib (Time-Series-Library) | Unified benchmark for time series (Informer, Autoformer, TimesNet, iTransformer, etc.) | thuml/Time-Series-Library |
| # | Loss Function | Year | Category | Key Innovation |
|---|---|---|---|---|
| 1 | MAE / L1 Loss | Classical | Point | Absolute error, outlier-robust |
| 2 | MSE / L2 Loss | Classical | Point | Squared error, smooth gradients |
| 3 | Huber Loss | 1964 | Point | L1/L2 hybrid, robust |
| 4 | Quantile / Pinball Loss | 1978 | Probabilistic | Asymmetric quantile regression |
| 5 | MAPE | Classical | Point | Scale-independent percentage error |
| 6 | sMAPE | 1999 | Point | Symmetric percentage error |
| 7 | MASE | 2006 | Point | Scaled by naive forecast baseline |
| 8 | Gaussian NLL | Classical | Probabilistic | Learned mean + variance |
| 9 | CRPS | 2007 | Probabilistic | Proper scoring rule for CDFs |
| 10 | DILATE Loss | 2019 | Shape+Temporal | Soft-DTW + temporal distortion |
| 11 | DeepAR Loss | 2020 | Probabilistic | Autoregressive parametric NLL |
| 12 | N-BEATS Loss | 2020 | Point | Basis expansion + residual stacking |
| 13 | Informer Loss | 2021 | Point | ProbSparse attention + MSE |
| 14 | Autoformer Loss | 2021 | Point | Auto-correlation + decomposition |
| 15 | FEDformer Loss | 2022 | Point | Fourier/wavelet attention + MSE |
| 16 | PatchTST Loss | 2023 | Point | Channel-independent patching + MSE |
| 17 | TimesNet Loss | 2023 | Point | FFT-based 2D variation + MSE |
| 18 | iTransformer Loss | 2024 | Point | Inverted attention (variate dim) |
| 19 | TimesFM Loss | 2024 | Probabilistic | Foundation model + quantile heads |
| 20 | DTW Loss | 1978 | Alignment | Non-linear temporal warping |
| 21 | Soft-DTW Loss | 2017 | Alignment | Differentiable DTW relaxation |
| 22 | TDI | 2019 | Alignment | Warping path distortion area |
| 23 | VI Loss (Deep SSM) | 2018 | Probabilistic | ELBO for latent state dynamics |