Time Series Forecasting — Loss Functions

A chronological catalog of loss functions for time series forecasting, temporal prediction, and sequential modeling.

Part I — Classical & Probabilistic Forecasting Losses

1. Mean Absolute Error (MAE) / L1 Loss (Classical) — The average absolute difference between predicted and true values; robust to outliers and widely used as a baseline point forecasting loss. 📄 Least Absolute Deviations (Wikipedia) — Classical statistical method 💻 PyTorch torch.nn.L1Loss

2. Mean Squared Error (MSE) / L2 Loss (Classical) — The average squared difference between predicted and true values; penalizes large errors disproportionately, making it sensitive to outliers. 📄 Least Squares (Wikipedia) — Classical statistical method (Gauss, Legendre) 💻 PyTorch torch.nn.MSELoss

3. Huber Loss (1964) — A piecewise loss that behaves as L2 for small errors and L1 for large errors, combining MSE's smoothness with MAE's robustness to outliers. 📄 Robust Estimation of a Location Parameter — Peter J. Huber 💻 PyTorch torch.nn.HuberLoss

4. Quantile Loss / Pinball Loss (1978) — Asymmetric loss that penalizes over- and under-prediction differently based on a chosen quantile, enabling prediction interval estimation and probabilistic forecasting. 📄 Regression Quantiles — Roger Koenker, Gilbert Bassett Jr. 💻 GluonTS QuantileLoss

5. MAPE (Mean Absolute Percentage Error) (Classical) — Scale-independent percentage error measuring relative forecast accuracy; undefined when true values are zero and asymmetrically penalizes positive vs. negative errors. 📄 Another Look at Measures of Forecast Accuracy — Rob J. Hyndman, Anne B. Koehler (2006, critical analysis) 💻 Nixtla/neuralforecast (MAPE metric)

6. sMAPE (Symmetric MAPE) (1999) — Symmetric variant of MAPE that normalizes by the average of predicted and true values, addressing MAPE's asymmetry but still problematic near zero. 📄 A Better Measure of Relative Prediction Accuracy for Model Selection and Model Estimation — J. Scott Armstrong 💻 Nixtla/neuralforecast (sMAPE metric)

7. MASE (Mean Absolute Scaled Error) (2006) — Scale-free error metric that normalizes MAE by the in-sample MAE of a naive (random walk) forecast, well-defined for zero values and suitable for comparing across series. 📄 Another Look at Measures of Forecast Accuracy — Rob J. Hyndman, Anne B. Koehler 💻 Nixtla/neuralforecast (MASE metric)

8. Negative Log-Likelihood (Gaussian) (Classical) — Probabilistic forecasting loss that jointly learns the predicted mean and variance of a Gaussian distribution, penalizing both inaccurate point predictions and miscalibrated uncertainty. 📄 Pattern Recognition and Machine Learning, §1.2.4 — Christopher M. Bishop (2006) 💻 GluonTS GaussianOutput

9. CRPS (Continuous Ranked Probability Score) (2007) — A proper scoring rule for probabilistic forecasts that measures the integrated squared difference between the predicted CDF and the empirical CDF of the observation, generalizing MAE to distributions. 📄 Strictly Proper Scoring Rules, Prediction, and Estimation — Tilmann Gneiting, Adrian E. Raftery 💻 GluonTS EnergyScore / CRPS

10. DILATE Loss (2019) — Combines a shape-based loss (soft-DTW) with a Temporal Distortion Index (TDI) penalty, jointly optimizing for both shape accuracy and temporal alignment in time series prediction. 📄 Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models — Vincent Le Guen, Nicolas Thome 💻 vincent-leguen/DILATE

Part II — Deep Learning Forecasting Losses

11. DeepAR Loss (2020) — Autoregressive RNN trained with the negative log-likelihood of parametric distributions (Gaussian, negative binomial, beta, etc.), producing calibrated probabilistic forecasts via ancestral sampling. 📄 DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks — David Salinas, Valentin Flunkert, Jan Gasthaus, Tim Januschowski 💻 awslabs/gluonts (DeepAR)

12. N-BEATS Loss (2020) — Interpretable deep architecture using basis expansion with backward/forward residual stacking; trained with MAPE, sMAPE, or MASE losses depending on the evaluation metric, achieving pure DL state-of-the-art on M4. 📄 N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting — Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, Yoshua Bengio 💻 ServiceNow/N-BEATS

13. Informer Loss (2021) — MSE loss applied to long-sequence time series forecasting with ProbSparse self-attention and generative-style decoder, enabling direct multi-step prediction without autoregressive accumulation of error. 📄 Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting — Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, Wancai Zhang 💻 zhouhaoyi/Informer2020

14. Autoformer Loss (2021) — MSE loss with a novel auto-correlation mechanism replacing standard self-attention, combined with progressive series decomposition (trend + seasonal) for long-term forecasting. 📄 Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting — Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long 💻 thuml/Autoformer

15. FEDformer Loss (2022) — MSE loss with frequency-enhanced attention that operates in the Fourier/wavelet domain, capturing global temporal patterns with linear complexity for long-term forecasting. 📄 FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting — Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, Rong Jin 💻 MAZiqing/FEDformer

16. PatchTST Loss (2023) — MSE loss with channel-independent patching that segments time series into subseries-level patches fed to a vanilla Transformer, reducing computation and capturing local semantic information for multivariate forecasting. 📄 A Time Series is Worth 64 Words: Long-term Forecasting with Transformers — Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam 💻 yuqinie98/PatchTST

17. TimesNet Loss (2023) — MSE loss with a 2D variation modeling approach that uses FFT-based period detection to reshape 1D time series into 2D tensors, capturing both intra-period and inter-period variations via 2D convolutions. 📄 TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis — Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, Mingsheng Long 💻 thuml/Time-Series-Library (TimesNet)

18. iTransformer Loss (2024) — MSE loss with an inverted Transformer architecture that applies attention on the variate dimension (not time), treating each time series as a token to capture multivariate correlations more effectively. 📄 iTransformer: Inverted Transformers Are Effective for Time Series Forecasting — Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, Mingsheng Long 💻 thuml/iTransformer

19. TimesFM Loss (2024) — Patched decoder-only transformer foundation model trained on a large corpus of real-world and synthetic time series, using quantile heads (quantile loss) for probabilistic forecasting with zero-shot generalization. 📄 A Decoder-Only Foundation Model for Time-Series Forecasting — Abhimanyu Das, Weihao Kong, Rajat Sen, Yichen Zhou 💻 google-research/timesfm

Part III — Specialized Time Series Losses

20. DTW Loss (Dynamic Time Warping) (1978) — Alignment-based distance that finds the optimal non-linear warping path between two time series, allowing temporal distortion; non-differentiable in its original form. 📄 Dynamic Programming Algorithm Optimization for Spoken Word Recognition — Hiroaki Sakoe, Seibi Chiba 💻 tslearn DTW

21. Soft-DTW Loss (2017) — A differentiable relaxation of DTW that replaces the hard minimum with a soft-minimum (log-sum-exp), enabling gradient-based optimization of DTW-like alignment losses for time series. 📄 Soft-DTW: a Differentiable Loss Function for Time-Series — Marco Cuturi, Mathieu Blondel 💻 mblondel/soft-dtw

22. TDI (Temporal Distortion Index) (2019) — Measures the temporal alignment quality between predicted and true time series by computing the area between the DTW warping path and the diagonal, quantifying how much temporal distortion exists. 📄 Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models — Vincent Le Guen, Nicolas Thome 💻 vincent-leguen/DILATE (TDI component)

23. Variational Inference Loss for State Space Models (2018) — ELBO-based loss for deep state space models that combines a reconstruction term (negative log-likelihood) with a KL divergence regularizer, enabling probabilistic forecasting with learned latent dynamics. 📄 Deep State Space Models for Time Series Forecasting — Syama Sundar Rangapuram, Matthias W. Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, Tim Januschowski 💻 awslabs/gluonts (DeepState)

Unified Libraries

Library	Description	Link
GluonTS	AWS probabilistic time series modeling (DeepAR, DeepState, Transformer, etc.)	awslabs/gluonts
NeuralForecast	Nixtla's production-ready neural forecasting (N-BEATS, NHITS, PatchTST, etc.)	Nixtla/neuralforecast
pytorch-forecasting	High-level PyTorch forecasting API (TFT, DeepAR, N-BEATS, etc.)	jdb78/pytorch-forecasting
TSlib (Time-Series-Library)	Unified benchmark for time series (Informer, Autoformer, TimesNet, iTransformer, etc.)	thuml/Time-Series-Library

📊 Summary Table

#	Loss Function	Year	Category	Key Innovation
1	MAE / L1 Loss	Classical	Point	Absolute error, outlier-robust
2	MSE / L2 Loss	Classical	Point	Squared error, smooth gradients
3	Huber Loss	1964	Point	L1/L2 hybrid, robust
4	Quantile / Pinball Loss	1978	Probabilistic	Asymmetric quantile regression
5	MAPE	Classical	Point	Scale-independent percentage error
6	sMAPE	1999	Point	Symmetric percentage error
7	MASE	2006	Point	Scaled by naive forecast baseline
8	Gaussian NLL	Classical	Probabilistic	Learned mean + variance
9	CRPS	2007	Probabilistic	Proper scoring rule for CDFs
10	DILATE Loss	2019	Shape+Temporal	Soft-DTW + temporal distortion
11	DeepAR Loss	2020	Probabilistic	Autoregressive parametric NLL
12	N-BEATS Loss	2020	Point	Basis expansion + residual stacking
13	Informer Loss	2021	Point	ProbSparse attention + MSE
14	Autoformer Loss	2021	Point	Auto-correlation + decomposition
15	FEDformer Loss	2022	Point	Fourier/wavelet attention + MSE
16	PatchTST Loss	2023	Point	Channel-independent patching + MSE
17	TimesNet Loss	2023	Point	FFT-based 2D variation + MSE
18	iTransformer Loss	2024	Point	Inverted attention (variate dim)
19	TimesFM Loss	2024	Probabilistic	Foundation model + quantile heads
20	DTW Loss	1978	Alignment	Non-linear temporal warping
21	Soft-DTW Loss	2017	Alignment	Differentiable DTW relaxation
22	TDI	2019	Alignment	Warping path distortion area
23	VI Loss (Deep SSM)	2018	Probabilistic	ELBO for latent state dynamics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time Series Forecasting — Loss Functions

Part I — Classical & Probabilistic Forecasting Losses

Part II — Deep Learning Forecasting Losses

Part III — Specialized Time Series Losses

Unified Libraries

📊 Summary Table

FilesExpand file tree

time-series-forecasting.md

Latest commit

History

time-series-forecasting.md

File metadata and controls

Time Series Forecasting — Loss Functions

Part I — Classical & Probabilistic Forecasting Losses

Part II — Deep Learning Forecasting Losses

Part III — Specialized Time Series Losses

Unified Libraries

📊 Summary Table