This repository contains the official implementation of the paper "A Multimodal Depression Detection Framework Based on Large-Scale Pre-trained Models: Cross-Domain Generalization and Clinical Deployment Strategies" (submitted to Journal of Medical Systems).
We provide a robust, reproducible framework for detecting depression from speech and text using Wav2Vec 2.0 and Chinese RoBERTa, specifically designed to handle cross-domain shifts in heterogeneous clinical environments.
- Dual-Stream Architecture: Late fusion of acoustic (Wav2Vec 2.0) and semantic (RoBERTa) features to capture complementary clinical cues.
- Cross-Domain Adaptation: A complete pipeline for Zero-shot Evaluation and Few-shot Fine-tuning, addressing the "generalization gap" in multi-center deployment.
- Trustworthy AI: Integrated Post-hoc Calibration (Temperature Scaling) mechanism to ensure reliable risk probability outputs (minimizing ECE).
- Reproducibility: Standardized data processing and training protocols with fixed seeds.
-
Clone the repository
git clone https://github.com/yourusername/multimodal-depression-detection.git cd multimodal-depression-detection -
Create a virtual environment (Recommended)
conda create -n depression-detect python=3.10 conda activate depression-detect
-
Install dependencies
pip install -r requirements.txt
Note on Privacy: Due to strict medical data privacy regulations, the original clinical datasets (CMDC, EATD, PDCH) cannot be publicly released. Researchers should prepare their own datasets following the format below.
Prepare a .csv file with the following columns:
| Column | Description | Example |
|---|---|---|
audio_path |
Path to the .wav file |
data/audio/subject_001.wav |
text |
Transcribed text | "我最近睡眠质量很差..." |
label |
0 (Control) or 1 (Depressed) | 1 |
split |
train / val / test | train |
A sample manifest is provided in data/sample_manifest.csv.
Train the multimodal model on your source dataset (e.g., CMDC-like data).
python train.py --config configs/config.yamlEvaluate a trained model on a target dataset.
python infer.py \
--config configs/config.yaml \
--checkpoint_path outputs/best_model.ckpt \
--output_csv outputs/predictions.csvFine-tune a pre-trained model on a small target dataset (e.g., PDCH) to fix distribution shifts.
python fine_tune_on_pdch.py \
--base_config configs/config.yaml \
--target_manifest data/target_manifest.csv \
--pretrained_ckpt outputs/source_best.ckptApply temperature scaling to calibrate prediction probabilities.
python kfold_posthoc_calibration.py --kfold_root outputs/kfold_results| Scenario | Dataset | Metric | Score | Note |
|---|---|---|---|---|
| Baseline | CMDC (In-domain) | F1-Macro | 98.7% | High robustness |
| Direct Transfer | PDCH (Zero-shot) | F1-Macro | 26.0% | Catastrophic drop |
| Adaptation | PDCH (Few-shot) | F1-Macro | 70.5% | +44.5% Recovery |
See the paper for full experimental details.
If you find this code useful, please cite our paper:
@article{meng2025multimodal,
title={A Multimodal Depression Detection Framework Based on Large-Scale Pre-trained Models: Cross-Domain Generalization and Clinical Deployment Strategies},
author={Meng, Yu and Wang, Yihe and Ouyang, Zhiyuan and et al.},
journal={Journal of Medical Systems},
year={2025}
}We thank the authors of Wav2Vec 2.0 and Transformers for their open-source contributions.