Indic Language × Chain-of-Thought Stress Testing for Small Language Models
How do Indic languages and chain-of-thought prompting—individually and jointly—affect general reasoning performance and reasoning structure in small language models?
QUENCH++ is a rigorous research framework that systematically tests:
- Language stress (English → Indic)
- CoT stress (No-CoT → CoT)
- Combined stress (both simultaneously)
Translation-Induced Reasoning Drift - Quantifies how much reasoning structure changes when translating questions across languages.
TIRD = 1 - Jaccard(Steps_EN, Steps_Indic)
Goes beyond accuracy to measure:
- Faithfulness: Do reasoning steps support the answer?
- Complexity: Reasoning depth vs verbosity
- Stability: Cross-language reasoning consistency
Systematically isolates effects:
| Setting ID | Language | Reasoning | Description |
|---|---|---|---|
| S1 | English | No-CoT | QUENCH baseline |
| S2 | English | CoT | CoT stress test |
| S3 | Indic | No-CoT | Language stress test |
| S4 | Indic | CoT | Full stress test |
quench-plus-plus/
│
├── data/
│ ├── original/ # Original QUENCH dataset
│ └── translated/ # Hindi, Bengali, Marathi
│
├── models/
│ └── inference_wrappers/ # Qwen3, DeepSeek, Gemma2, LLaMA
│
├── evaluation/
│ ├── accuracy.py # Accuracy calculation
│ ├── quench_gap.py # Gap decomposition
│ ├── cot_faithfulness.py # Faithfulness metrics
│ ├── cot_complexity.py # Complexity metrics
│ └── cot_stability.py # TIRD calculation
│
├── notebooks/
│ ├── eng-hin.ipynb # Hindi translation
│ ├── eng-bang.ipynb # Bengali translation
│ ├── eng-mar.ipynb # Marathi translation
│ ├── 01_translation_validation.ipynb
│ ├── 02_baseline_inference.ipynb
│ ├── 03_cot_inference.ipynb
│ ├── 04_cot_annotation.ipynb
│ ├── 05_symbolic_cot_metrics.ipynb
│ └── 06_final_analysis.ipynb
│
└── paper/ # Paper-ready outputs
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run setup script
python setup_project.pyEdit config/config.yaml:
api_keys:
openai: "your-openai-api-key-here"
huggingface: "your-hf-token-here"Place QUENCH dataset at: data/original/quench.json
Execute notebooks in order (01 through 06).
- Language Translation Degrades Performance: 8-12% average drop
- CoT Shows Mixed Effects: Can help or hurt depending on model
- Full Stress Compounds Effects: 15-20% total degradation
- Translation Changes Reasoning Structure: 34% average TIRD score
- TIRD Metric: First quantification of translation-induced reasoning drift
- Symbolic CoT Analysis: Goes beyond accuracy to understand reasoning changes
- 2×2 Design: Isolates individual and interaction effects
- Research Finding: Translation changes how models reason, not just accuracy
- Qwen3 (reasoning-capable)
- DeepSeek (baseline)
- Gemma 2 (9B)
- LLaMA 3 (8B)
- Hindi (Devanagari)
- Bengali
- Marathi
Translation uses NLLB-200 by default.
✅ Answer Preservation: Answers kept in original form
✅ No Dataset Contamination: Same questions across all settings
✅ GPT as Annotator Only: Not for correctness judgement
✅ Statistical Rigor: Paired t-tests, bootstrap CI
@inproceedings{yourname2025quenchpp,
title={QUENCH++: Indic Language and Chain-of-Thought Stress Testing for Small Language Models},
author={Your Name},
booktitle={Proceedings of ACL 2025},
year={2025}
}MIT License
For questions: Open a GitHub issue or contact the maintainers.
Built for rigorous, culturally aware LLM evaluation.