CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

CausalCompass is a flexible and extensible benchmark suite for evaluating the robustness of time-series causal discovery (TSCD) methods under misspecified modeling assumptions.

Abstract

Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a reliance on untestable causal assumptions and by the lack of robustness-oriented evaluation in existing benchmarks. To address these challenges, we propose CausalCompass, a flexible and extensible benchmark suite designed to assess the robustness of time-series causal discovery (TSCD) methods under violations of modeling assumptions. To demonstrate the practical utility of CausalCompass, we conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios. Our experimental results indicate that no single method consistently attains optimal performance across all settings. Nevertheless, the methods exhibiting superior overall performance across diverse scenarios are almost invariably deep learning-based approaches. We further provide hyperparameter sensitivity analyses to deepen the understanding of these findings. We also find, somewhat surprisingly, that NTS-NOTEARS relies heavily on standardized preprocessing in practice, performing poorly in the vanilla setting but exhibiting strong performance after standardization. Finally, our work aims to provide a comprehensive and systematic evaluation of TSCD methods under assumption violations, thereby facilitating their broader adoption in real-world applications.

Key Features

8 assumption-violation scenarios: Confounders, nonstationarity, measurement error, standardization, missing data, mixed data, min-max normalization, and trend/seasonality
2 vanilla models: VAR (linear) and Lorenz-96 (nonlinear)
11 TSCD algorithms spanning 6 major methodological categories:
- Granger causality-based: VAR, LGC
- Constraint-based: PCMCI
- Noise-based: VARLiNGAM
- Score-based: DYNOTEARS, NTS-NOTEARS
- Topology-based: TSCI
- Deep learning-based: cMLP, cLSTM, CUTS, CUTS+
Rigorous experimental protocols:
- Multiple random seeds for statistical reliability
- Comprehensive hyperparameter grids
Automated infrastructure:
- Shell scripts for reproducible experiment execution
- LaTeX table generation for publication-ready results
- Origin-compatible data export for radar plots

Data Generation

All datasets can be generated using the scripts in the data_generation/ directory.

Quick Start

Generate all datasets for a specific scenario:

cd data_generation

# Vanilla datasets
python vanilla.py

# Assumption violation scenarios
python confounder.py
python measurement_error.py
python missing.py
python mixed_data.py
python nonstationary.py
python standardized.py		# Includes z-score and min-max normalization
python trendseason.py

Example: Generate Measurement Error Data

from data_generation.measurement_error import simulate_var_with_measure_error

# Generate VAR data with measurement error
p = 10          # Number of variables
T = 1000        # Time steps
lag = 3         # Lag order
gamma = 1.2    # Error variance = 1.2 × data variance
seed = 0 	    # Random seed for reproducibility

data, beta, gc = simulate_var_with_measure_error(
    p=p, T=T, lag=lag, gamma=gamma, seed=seed
)

print(f"Data shape: {data.shape}")        # (1000, 10)
print(f"Ground truth GC: {gc.shape}")     # (10, 10)

Generated Data Structure

Generated datasets will be saved in the following structure:

datasets/
├── vanilla/
├── confounder/
├── measurement_error/
├── missing/
├── mixed_data/
├── nonstationary/
├── standardized/
└── trendseason/

The generated datasets follow the naming convention:

[scenario]_[params]_[model]_p[p]_T[T]_[optional]_seed[seed].npz

Example: confounder_rho0.5_VAR_p10_T1000_seed0.npz

Each .npz file contains:

data: Time series observations (T × D)
gc: Ground truth causality graph (D × D)
Additional scenario-specific metadata

The datasets/ directory contains sample datasets. Complete datasets can be generated using the provided scripts.

Full Datasets Download

For convenience and reproducibility, the complete datasets archive is publicly available at Google Drive.

Benchmark Scenarios

1. Vanilla

Standard VAR and Lorenz-96 systems without assumption violations.

2. Confounders

Hidden confounders create spurious correlations between observed variables.

3. Measurement Error

Gaussian noise proportional to data variance is added to observations.

4. Missing Data

Random missing values with specified probability, interpolated using zero-order hold.

5. Mixed Data

Mixture of continuous and discrete variables.

6. Nonstationarity

Time-varying noise variance and time-varying coefficients.

7. Standardized Data

Z-score and min-max normalization applied to time series.

8. Trend and Seasonality

Trends and seasonal patterns added to observations.

Running Experiments

Automated Experiment Execution

Run all TSCD algorithms automatically using the provided shell scripts:

# Navigate to scripts directory
cd scripts

# Run all experiments (11 algorithms)
chmod +x run_all.sh
./run_all.sh

# Or run individual algorithms
chmod +x run_*.sh
./run_var.sh        # VAR
./run_lgc.sh        # LGC
./run_pcmci.sh      # PCMCI
./run_varlingam.sh  # VARLiNGAM
./run_dynotears.sh  # DYNOTEARS
./run_ntsnotears.sh # NTS-NOTEARS
./run_tsci.sh       # TSCI
./run_ngc.sh        # NGC (cMLP and cLSTM)
./run_cuts.sh       # CUTS
./run_cutsplus.sh   # CUTS+

The run_all.sh script orchestrates all 11 algorithms and handles:

Automatic error detection and reporting
Progress tracking with timestamps
Failed script counting and exit code management

Note: Results are saved in JSON format with performance metrics (AUPRC, AUROC) and hyperparameter configurations.

Result Analysis

Generating LaTeX Tables

Convert experimental results to publication-ready LaTeX tables:

python result2latex.py

This generates:

Comparison tables across all scenarios and methods
Performance metrics (AUPRC/AUROC) with best results highlighted
Separate tables for VAR and Lorenz-96 with different parameters

Output files: table_VAR_p10_T1000.tex, table_Lorenz_p10_T1000_F10.tex, etc.

Generating Origin Data Files

Export results for radar plots and visualization:

python generate_origin_tables.py

These scripts generate .txt files compatible with Origin for creating:

Radar plots comparing method performance across scenarios
Hyperparameter sensitivity visualizations

Project Structure

CausalCompass/
│
├── algs/                   # Algorithm implementations
│   ├── cuts/               # CUTS implementation
│   ├── cutsplus/           # CUTS+ implementation
│   ├── lgc/                # LGC implementation
│   ├── ngc/                # NGC implementation
│   ├── ntsnotears/         # NTS-NOTEARS implementation
│   ├── tsci/               # TSCI implementation
│   ├── var/                # VAR implementation
│   ├── varlingam/          # VARLiNGAM implementation
│   └── __init__.py         # Package initialization
│
├── data_generation/        # Data generation scripts
│   ├── vanilla.py          # VAR and Lorenz-96
│   ├── confounder.py       # Confounders scenario
│   ├── measurement_error.py # Measurement error scenario
│   ├── missing.py          # Missing data scenario
│   ├── mixed_data.py       # Mixed data scenario
│   ├── non_gaussian.py     # Non-Gaussian noise scenario
│   ├── nonstationary.py    # Nonstationarity scenario
│   ├── standardized.py     # z-score and min-max scenario
│   └── trendseason.py      # Trend and seasonality scenario
│
├── datasets/               # Sample datasets (fully reproducible via scripts) 
│   └── [scenario]/         # Organized by scenario
│
├── scripts/                # Experiment execution scripts
│   ├── run_all.sh         # Master script to run all experiments
│   ├── run_var.sh         # VAR experiments
│   ├── run_lgc.sh         # LGC experiments
│   ├── run_pcmci.sh       # PCMCI experiments
│   ├── run_varlingam.sh   # VARLiNGAM experiments
│   ├── run_dynotears.sh   # DYNOTEARS experiments
│   ├── run_ntsnotears.sh  # NTS-NOTEARS experiments
│   ├── run_tsci.sh        # TSCI experiments
│   ├── run_ngc.sh         # NGC experiments
│   ├── run_cuts.sh        # CUTS experiments
│   └── run_cutsplus.sh    # CUTS+ experiments
│
├── result2latex.py         # Generate LaTeX tables from results
├── generate_origin_tables.py              # Generate Origin data files
│
└── README.md               # This file

Citation

If you use this code or datasets in your research, please cite:

@misc{yi2026causalcompass,
  title   = {{CausalCompass}: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios},
  author  = {Yi, Huiyang and Shen, Xiaojian and Wu, Yonggang and Chen, Duxin and Wang, He and Yu, Wenwu},
  year    = {2026},
  note    = {Under review as a conference paper}
}

Note: The final bibliographic information (e.g., venue and proceedings details) will be updated upon paper acceptance.

License

The code in this repository is released under the MIT License.
The datasets generated and provided by this repository are released under the CC BY 4.0 License.

Contributing

Contributions are welcome! If you encounter bugs, have suggestions for improvements, or would like to extend CausalCompass with additional assumption-violation scenarios or evaluation protocols, please feel free to open an issue or submit a pull request.

Contact

For questions or issues, please:

Open an issue in this repository
Email: yihuiyang@seu.edu.cn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

Table of Contents

Abstract

Key Features

Data Generation

Quick Start

Example: Generate Measurement Error Data

Generated Data Structure

Full Datasets Download

Benchmark Scenarios

1. Vanilla

2. Confounders

3. Measurement Error

4. Missing Data

5. Mixed Data

6. Nonstationarity

7. Standardized Data

8. Trend and Seasonality

Running Experiments

Automated Experiment Execution

Result Analysis

Generating LaTeX Tables

Generating Origin Data Files

Project Structure

Citation

License

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
algs		algs
data_generation		data_generation
datasets		datasets
images		images
scripts		scripts
LICENSE		LICENSE
LICENSE-CC-BY-4.0		LICENSE-CC-BY-4.0
README.md		README.md
generate_origin_tables.py		generate_origin_tables.py
result2latex.py		result2latex.py

Folders and files

Latest commit

History

Repository files navigation

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

Table of Contents

Abstract

Key Features

Data Generation

Quick Start

Example: Generate Measurement Error Data

Generated Data Structure

Full Datasets Download

Benchmark Scenarios

1. Vanilla

2. Confounders

3. Measurement Error

4. Missing Data

5. Mixed Data

6. Nonstationarity

7. Standardized Data

8. Trend and Seasonality

Running Experiments

Automated Experiment Execution

Result Analysis

Generating LaTeX Tables

Generating Origin Data Files

Project Structure

Citation

License

Contributing

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages