In this paper, we propose SHoTClean series, a unified framework that bridges soft and hard constraints for effective and efficient multivariate time series cleaning.
As shown in figure below, there are four specialized algorithms for multivariate time-series cleaning—each targeting distinct computational scenarios based on our constrained optimization formulation.
- SHoTClean-B: An offline batch algorithm employing pruned dynamic programming to achieve global optimality.
- SHoTClean-S: Online streaming variant utilizing incremental dynamic programming to achieve local optimality.
- SHoTClean-P: Online streaming variant accelerates SHoTClean-S via CDQ divide-and-conquer and Fenwick tree to attain near-linear complexity.
- SHoTClean-C: Online streaming variant employing causal modeling, designed solely for multivariate datasets.
Experiments conducted on 10 real-world datasets with 10 state-of-the-art methods demonstrate the ShoTClean's superiority. SHoTClean achieve i) 6.8%–90.0% and 7.8%–82.1% improvements in accuracy (RMSE metric) compared to the second best methods in offline and online settings, respectively; ii) an average two-order-of-magnitude runtime speed-up on large-scale datasets; and iii) superior robustness, with consistent high performance under extreme 80% contamination level and high-dimensional datasets.
- Python.version = 3.9.21
- tigramite.version = 5.2.7.0
- sklearn.version = 1.6.1
- Other dependencies are listed in
requirements.txt.
We evaluate SHoTClean on 10 real-world datasets, with detailed statistics summarized in the table below.
| Dimension | Datasets | Length | # dim | Source | Field |
|---|---|---|---|---|---|
| Univariate | TOTALSA | 593 | 1 | FRED | Trade |
| Univariate | STOCK | 12,824 | 1 | SCREEN | Finance |
| Univariate | COVID-19 | 14,001 | 1 | JHU CSSE | Health |
| Univariate | CA | 43,824 | 1 | California ISO | Energy |
| Univariate | ID_a40b | 137,898 | 1 | AIOps | KPI |
| Multivariate | Porto | 16,749 | 2 | Kaggle | Trajectory |
| Multivariate | ECG | 650,000 | 2 | UCR | Health |
| Multivariate | Exchange | 7,588 | 8 | LSTNet | Finance |
| Multivariate | AEP | 19,735 | 21 | UCI | Energy |
| Multivariate | PSM | 132,481 | 25 | eBay | Server |
| Multivariate | SWaT | 14,996 | 26 | iTrust | Industrial |
| Multivariate | WADI | 784,537 | 73 | iTrust | Industrial |
All datasets have been processed with ./experiments/generate_dataset.py and the outputs are available under ./datasets. To ensure fair comparisons and account for varying dataset sizes, we applied the following selection criteria; all others remain in their original form:
- TOTALSA: records from January 1, 1976 to May 1, 2025.
- COVID-19: records from China, France, Russia, UK, US.
- CA: all energy-consumption records from 2019 to 2023.
- KPI: only the entry with ID a40b1df87e3f1c87.
- Porto: the first ten trajectories containing more than 1,200 points.
- ECG: records from 118e06 with SNR 6dB.
- SWaT & WADI: only numeric features.
Please note that the SWaT and WADI datasets aren't included here, as their distribution rights belong to iTRUST; to gain access, please use the links provided in the table. Of course, you’re also welcome to download the original raw datasets and process them however you like.
| Dimension | Algorithm | Scenario | Type |
|---|---|---|---|
| Univariate | EWMA | online | smoothing |
| Univariate | SCREEN | online | constraint |
| Univariate | SpeedAcc | online | constraint |
| Univariate | LsGreedy | online | statistical |
| Univariate | Akane | offline | statistical |
| Multivariate | ARIMA | offline | smoothing |
| Multivariate | Clean4MTS | offline | constraint+statistical |
| Multivariate | MTCSC | online | constraint+statistical |
| Multivariate | TranAD | online | anomaly detection |
| Multivariate | IMDiffusion | online | anomaly detection |
We’ve reimplemented SCREEN, SpeedAcc, LsGreedy, MTCSC, and other algorithms in Python and merged them into our framework (see the ./baselines directory for details), while all remaining algorithms still run in their original environments.
The repository is organized into directories and standalone scripts as follows:
baselines/: Python implementations of some baseline algorithmsdatasets/: processed datasets ready for useexperiments/: scripts for running each experiment (see the next section for details)results/: stores the output of experimentstools/: utility classes and helpersOptimization_Solver.py: global optimization solverSHoTClean_B.py: multi-dimensional SHoTClean-B implementationSHoTClean_B1.py: single-dimensional SHoTClean-B implementationSHoTClean_C.py: multi-dimensional SHoTClean-C implementationSHoTClean_P1.py: single-dimensional SHoTClean-P implementationSHoTClean_S.py: multi-dimensional SHoTClean-S implementationSHoTClean_S1.py: single-dimensional SHoTClean-S implementation
We provide a set of Python scripts to reproduce all of our experiments and visualizations:
Dataset-specific experiments
experiment_CA.pyRuns a suite of experiments on the CA dataset.experiment_CA_segment.pyRuns a suite of experiments on the CA dataset with contiguous-segment injection strategy.experiment_COVID.pyRuns a suite of experiments on the COVID-19 dataset.experiment_exchange.pyRuns a suite of experiments on the Exchange dataset.experiment_FRED.pyRuns a suite of experiments on the TOTALSA dataset.experiment_KPI.pyRuns a suite of experiments on the KPI dataset.experiment_Porto.pyRuns a suite of experiments on the Porto dataset.experiment_PSM.pyRuns a suite of experiments on the PSM dataset.experiment_stock.pyRuns a suite of experiments on the Stock dataset.experiment_SWaT.pyRuns a suite of experiments on the SWaT dataset.experiment_SWAT_partial.pyRuns a suite of experiments on the SWaT dataset with partial-dimension injection strategy.experiment_UCI.pyRuns a suite of experiments on the AEP dataset.experiment_UCR.pyRuns a suite of experiments on the ECG dataset.experiment_WADI.pyRuns a suite of experiments on the WADI dataset.
General-purpose scripts
generate_dataset.pyPreprocesses all raw datasets and writes them into./datasets/.example_intro.pyGenerates the figure for the introduction.example_B.pyUsage example for the SHoTClean-B variant.example_P.pyUsage example for the SHoTClean-P variant.experiment_window_size.pyBenchmark across different window sizes.experiment_tSNE.pyProduce t-SNE visualizations of the cleaned data.experiment_DTW.pyRun Dynamic Time Warping (DTW) experiments.
From the project's root directory (./), run an experiment with a command like:
python -m experiments.experiment_CA.py
Simply replace experiment_CA.py with the name of any other experiment script you wish to execute.
