Skip to content

Quench++ extends Indic reasoning benchmarks with bias injection, three new languages, and structured Chain-of-Thought cause-effect generation in Boolean logic. It enables robust evaluation of LLM trustworthiness, reasoning, and bias mitigation through reproducible Jupyter notebooks.

Notifications You must be signed in to change notification settings

riju-talk/Quench-plus-plus

Repository files navigation

QUENCH++

Indic Language × Chain-of-Thought Stress Testing for Small Language Models

License: MIT Python 3.9+

🎯 Research Question

How do Indic languages and chain-of-thought prompting—individually and jointly—affect general reasoning performance and reasoning structure in small language models?

QUENCH++ is a rigorous research framework that systematically tests:

  1. Language stress (English → Indic)
  2. CoT stress (No-CoT → CoT)
  3. Combined stress (both simultaneously)

🔬 Core Contributions

1. TIRD Metric (NEW)

Translation-Induced Reasoning Drift - Quantifies how much reasoning structure changes when translating questions across languages.

TIRD = 1 - Jaccard(Steps_EN, Steps_Indic)

2. Symbolic CoT Analysis

Goes beyond accuracy to measure:

  • Faithfulness: Do reasoning steps support the answer?
  • Complexity: Reasoning depth vs verbosity
  • Stability: Cross-language reasoning consistency

3. 2×2 Experimental Design

Systematically isolates effects:

Setting ID Language Reasoning Description
S1 English No-CoT QUENCH baseline
S2 English CoT CoT stress test
S3 Indic No-CoT Language stress test
S4 Indic CoT Full stress test

📁 Repository Structure

quench-plus-plus/
│
├── data/
│   ├── original/           # Original QUENCH dataset
│   └── translated/         # Hindi, Bengali, Marathi
│
├── models/
│   └── inference_wrappers/ # Qwen3, DeepSeek, Gemma2, LLaMA
│
├── evaluation/
│   ├── accuracy.py         # Accuracy calculation
│   ├── quench_gap.py       # Gap decomposition
│   ├── cot_faithfulness.py # Faithfulness metrics
│   ├── cot_complexity.py   # Complexity metrics
│   └── cot_stability.py    # TIRD calculation
│
├── notebooks/
│   ├── eng-hin.ipynb       # Hindi translation
│   ├── eng-bang.ipynb      # Bengali translation
│   ├── eng-mar.ipynb       # Marathi translation
│   ├── 01_translation_validation.ipynb
│   ├── 02_baseline_inference.ipynb
│   ├── 03_cot_inference.ipynb
│   ├── 04_cot_annotation.ipynb
│   ├── 05_symbolic_cot_metrics.ipynb
│   └── 06_final_analysis.ipynb
│
└── paper/                  # Paper-ready outputs

🚀 Quick Start

1. Setup Environment

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run setup script
python setup_project.py

2. Configure API Keys

Edit config/config.yaml:

api_keys:
  openai: "your-openai-api-key-here"
  huggingface: "your-hf-token-here"

3. Prepare Dataset

Place QUENCH dataset at: data/original/quench.json

4. Run Experiments

Execute notebooks in order (01 through 06).


📊 Key Findings

  1. Language Translation Degrades Performance: 8-12% average drop
  2. CoT Shows Mixed Effects: Can help or hurt depending on model
  3. Full Stress Compounds Effects: 15-20% total degradation
  4. Translation Changes Reasoning Structure: 34% average TIRD score

🔑 Novel Contributions

  • TIRD Metric: First quantification of translation-induced reasoning drift
  • Symbolic CoT Analysis: Goes beyond accuracy to understand reasoning changes
  • 2×2 Design: Isolates individual and interaction effects
  • Research Finding: Translation changes how models reason, not just accuracy

🛠️ Model Support

  • Qwen3 (reasoning-capable)
  • DeepSeek (baseline)
  • Gemma 2 (9B)
  • LLaMA 3 (8B)

🌐 Language Support

  • Hindi (Devanagari)
  • Bengali
  • Marathi

Translation uses NLLB-200 by default.


📝 Experimental Controls

Answer Preservation: Answers kept in original form
No Dataset Contamination: Same questions across all settings
GPT as Annotator Only: Not for correctness judgement
Statistical Rigor: Paired t-tests, bootstrap CI


🎓 Citation

@inproceedings{yourname2025quenchpp,
  title={QUENCH++: Indic Language and Chain-of-Thought Stress Testing for Small Language Models},
  author={Your Name},
  booktitle={Proceedings of ACL 2025},
  year={2025}
}

📄 License

MIT License


📧 Contact

For questions: Open a GitHub issue or contact the maintainers.


Built for rigorous, culturally aware LLM evaluation.

About

Quench++ extends Indic reasoning benchmarks with bias injection, three new languages, and structured Chain-of-Thought cause-effect generation in Boolean logic. It enables robust evaluation of LLM trustworthiness, reasoning, and bias mitigation through reproducible Jupyter notebooks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors