Skip to content

Commit e21af7d

Browse files
committed
readme
1 parent e141795 commit e21af7d

File tree

1 file changed

+39
-19
lines changed

1 file changed

+39
-19
lines changed

readme.md

Lines changed: 39 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,9 @@ Financial Reasoning* (EMNLP 2025 submission)
2222

2323
## 📚 Key Features
2424

25-
- **54 topics** across **12 financial domains**
25+
- **58 topics** across **12 financial domains**
2626
- **5 symbolic templates per topic** (2 easy, 2 intermediate, 1 advanced)
27+
- **2,900 machine-verifiable instances** (58 topics × 5 templates × 10 seeds)
2728
- **Executable Python traces** for step-level answer verification
2829
- **ChainEval**, a custom metric for evaluating both final answers and intermediate steps
2930

@@ -42,9 +43,11 @@ This example shows a symbolic template for Compound Interest:
4243

4344
```
4445
finchain/
46+
├── assets/ # Logos, taxonomy visuals, and supporting figures
47+
├── chaineval/ # ChainEval metric implementation & aggregation
4548
├── data/
46-
│ └── templates/ # Symbolic prompt templates for 54 financial topics
47-
├── eval/ # ChainEval evaluation scripts (coming soon)
49+
│ └── templates/ # Symbolic prompt templates for 58 financial topics
50+
├── website/ # Static site & leaderboard assets
4851
└── README.md
4952
```
5053

@@ -55,7 +58,7 @@ Each instance includes:
5558

5659
## 🧭 Taxonomy of Domains and Topics
5760

58-
FinChain covers 54 financial topics across 12 domains:
61+
FinChain covers 58 financial topics across 12 domains:
5962

6063
<p align="center">
6164
<img src="assets/taxonomy.png" width="3000"/>
@@ -87,15 +90,17 @@ This allows precise tracking of where models hallucinate, skip, or miscalculate.
8790

8891
## 📈 Benchmarking Results
8992

90-
We evaluate **30 models**, including:
91-
- GPT-4.1, GPT-4o-mini, LLaMA 3.3 70B
92-
- Qwen3, DeepSeek-R1, Mixtral, Mathstral
93-
- Fin-tuned models: Fino1, FinR1, WiroAI Finance Qwen
93+
We evaluate **26 models** spanning four categories:
94+
- **Frontier proprietary**: GPT-5/4.1 (+ mini variants), Claude Sonnet 4.5/4/3.7, Gemini 2.5 Pro/Flash, DeepSeek V3.x/R1, Grok 4 Heavy/Fast
95+
- **Finance-specific**: Fino1, FinR1, DianJin-R1, WiroAI Finance-LLaMA, WiroAI Finance-Qwen
96+
- **Math-enhanced**: WizardMath, MetaMath, Mathstral, Qwen2.5-Math
97+
- **General-purpose open**: LLaMA 3.1, Qwen 2.5/3
9498

9599
**Findings:**
96-
- Larger models outperform smaller financial-tuned models
97-
- Even top models struggle on advanced templates and multi-hop symbolic chains
98-
- FinChain reveals reasoning gaps not captured by standard accuracy metrics
100+
- Frontier models lead ChainEval yet still struggle on advanced, compositional templates
101+
- Finance-tuned and math-enhanced 7B models (FinR1, Mathstral) approach frontier performance under ChainEval
102+
- Domain-wise analysis shows frontier systems remain balanced, while fine-tuned models excel in their target areas (e.g., FinR1 in reporting/risk, Mathstral in quantitative domains)
103+
- Accuracy drops across all model families from basic to advanced templates, highlighting persistent gaps in symbolic financial reasoning
99104

100105
## 🚀 Quick Start
101106

@@ -109,11 +114,25 @@ Explore templates:
109114
ls data/templates/
110115
```
111116

112-
Evaluate predictions (scripts coming soon):
117+
Generate sample problems (each template script exposes a `main()` helper):
113118
```bash
114-
python eval/eval_chain.py --pred path/to/your_outputs.jsonl
119+
python data/templates/investment_analysis/npv.py
115120
```
116121

122+
Evaluate model predictions with ChainEval:
123+
```bash
124+
python chaineval/evaluate_predictions.py --input path/to/your_outputs.jsonl --output evals/
125+
```
126+
127+
Aggregate metrics across domains, subtopics, and difficulty levels:
128+
```bash
129+
python chaineval/aggregate.py
130+
```
131+
132+
## 📘 Documentation
133+
134+
- Detailed methodology, data pipeline, and evaluation discussion are available in the accompanying paper (`paper.pdf`).
135+
117136
## 💬 Feedback & Contributions
118137

119138
**FinChain is an ongoing project**, and we’re continuously working to expand its coverage, refine evaluation metrics, and improve data quality. We **welcome feedback, suggestions, and community contributions**—whether it's about financial domains we missed, new evaluation ideas, or improving symbolic template diversity. If you're interested in collaborating or contributing, feel free to open an issue or contact us directly.
@@ -126,7 +145,7 @@ If you find **FinChain** useful in your research, please consider citing our pap
126145

127146
@article{xie2025finchain,
128147
title={FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning},
129-
author={Xie, Zhuohan and Sahnan, Dhruv and Banerjee, Debopriyo and Georgiev, Georgi and Thareja, Rushil and Madmoun, Hachem and Su, Jinyan and Singh, Aaryamonvikram and Wang, Yuxia and Xing, Rui and Koto, Fajri and Li, Haonan and Koychev, Ivan and Chakraborty, Tanmoy and Lahlou, Salem and Stoyanov, Veselin and Nakov, Preslav},
148+
author={Xie, Zhuohan and Orel, Daniil and Thareja, Rushil and Sahnan, Dhruv and Madmoun, Hachem and Zhang, Fan and Banerjee, Debopriyo and Georgiev, Georgi and Peng, Xueqing and Qian, Lingfei and Huang, Jimin and Su, Jinyan and Singh, Aaryamonvikram and Xing, Rui and Elbadry, Rania and Xu, Chen and Li, Haonan and Koto, Fajri and Koychev, Ivan and Chakraborty, Tanmoy and Wang, Yuxia and Lahlou, Salem and Stoyanov, Veselin and Ananiadou, Sophia and Nakov, Preslav},
130149
journal={arXiv preprint arXiv:2506.02515},
131150
year={2025}
132151
}
@@ -139,12 +158,13 @@ If you find **FinChain** useful in your research, please consider citing our pap
139158

140159
FinChain is developed by:
141160

142-
Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, Georgi Georgiev,
143-
Rushil Thareja, Hachem Madmoun, Jinyan Su, Aaryamonvikram Singh,
144-
Yuxia Wang, Rui Xing, Fajri Koto, Haonan Li, Ivan Koychev,
145-
Tanmoy Chakraborty, Salem Lahlou, Veselin Stoyanov, Preslav Nakov
161+
Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun,
162+
Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian,
163+
Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry,
164+
Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty,
165+
Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, Preslav Nakov
146166

147-
Affiliations: MBZUAI, Sofia University, Quantsquare, Cornell University, IIT Delhi
167+
Affiliations: MBZUAI, Syllogia, The University of Tokyo, Sofia University "St. Kliment Ohridski", The Fin AI, Cornell University, The University of Melbourne, IIT Delhi, INSAIT, The University of Manchester
148168

149169
For questions or collaborations, contact: **zhuohan.xie@mbzuai.ac.ae**
150170

0 commit comments

Comments
 (0)