Official implementation of "GanitLLM: Bengali Mathematical Reasoning with Language Models"
GanitLLM is the first comprehensive framework for training and evaluating Bengali mathematical reasoning capabilities in language models. We introduce:
- Ganit Dataset: A curated dataset of 25k Bengali math problems with difficulty-stratified splits
- Curriculum GRPO (CGRPO): A novel training approach combining curriculum learning with Group Relative Policy Optimization
- GanitLLM Models: A family of models (0.6B, 1.7B, 4B parameters) achieving state-of-the-art performance on Bengali math reasoning
- Multi-stage training pipeline: SFT → Curriculum Learning → GRPO
- Difficulty-aware data curation using ensemble LLM evaluation
- Support for both "thinking" and "instruct" model variants
- Comprehensive evaluation on MGSM, MSVAMP, and in-domain test sets
- Efficient training with LoRA, FlashAttention, and vLLM
- Python 3.12+
- CUDA 12.8+
- GPU with at least 24GB VRAM (for training)
# Clone the repository
git clone https://github.com/dipta007/GanitLLM.git
cd GanitLLM
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
make setup
# Or manually:
uv venv --python 3.12 --seed
source .venv/bin/activate
uv syncMAX_JOBS=4 uv pip install flash-attn --no-build-isolation --no-cacheThe Ganit dataset is available on HuggingFace:
| Name | Description | Size |
|---|---|---|
| dipta007/Ganit | GanitSFT + GanitRLVR + GanitDEV | 25k rows |
from datasets import load_dataset
dataset = load_dataset("dipta007/Ganit")Each example contains:
problem: Bengali math problem statementbengali_solution: Answer in Bengali numeralsenglish_solution: Answer in English numeralsdifficulty: One of {easy, medium, hard, olympiad}messages: Chat-formatted training data
| Difficulty | Criteria | Count |
|---|---|---|
| Easy | >75% LLMs correct | ~4k |
| Medium | 50-75% LLMs correct | ~4k |
| Hard | 25-50% LLMs correct | ~4k |
| Olympiad | <25% LLMs correct | ~4k |
All models are available on HuggingFace:
| Model | Parameters | Training | HuggingFace Link |
|---|---|---|---|
| GanitLLM-4B_SFT_CGRPO | 4B | SFT + CGRPO | Link |
| GanitLLM-4B_SFT_GRPO | 4B | SFT + GRPO | Link |
| GanitLLM-4B_SFT | 4B | SFT | Link |
| GanitLLM-4B_CGRPO | 4B | CGRPO | Link |
| GanitLLM-1.7B_SFT_CGRPO | 1.7B | SFT + CGRPO | Link |
| GanitLLM-1.7B_SFT_GRPO | 1.7B | SFT + GRPO | Link |
| GanitLLM-1.7B_SFT | 1.7B | SFT | Link |
| GanitLLM-1.7B_CGRPO | 1.7B | CGRPO | Link |
| GanitLLM-0.6B_SFT_CGRPO | 0.6B | SFT + CGRPO | Link |
| GanitLLM-0.6B_SFT_GRPO | 0.6B | SFT + GRPO | Link |
| GanitLLM-0.6B_SFT | 0.6B | SFT | Link |
| GanitLLM-0.6B_CGRPO | 0.6B | CGRPO | Link |
Model Types:
- SFT_CGRPO: Best performing models with multi-stage training (SFT → Curriculum GRPO)
- SFT_GRPO: Models with standard SFT followed by GRPO
- SFT: Models with supervised fine-tuning only (foundation for RL training)
- CGRPO: Models trained with curriculum GRPO only (no SFT)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "dipta007/GanitLLM-4B_SFT_CGRPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
# Format the prompt
problem = "একটি দোকানে ১২টি আপেল আছে। যদি ৫টি আপেল বিক্রি হয়, তাহলে কতটি আপেল বাকি থাকবে?"
prompt = f"""A conversation takes place between the user and the assistant. The user asks questions, and the assistant solves them. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>. Reasoning and answer must be in Bengali. Final answer must be a number nothing else.
Question: {problem}"""
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=2048, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)PYTHONPATH=. uv run src/training/sft.py \
--model_name Qwen/Qwen3-4B \
--dataset_path data/sft/train.jsonl \
--output_dir checkpoints/sftPYTHONPATH=. uv run src/training/unsloth/grpo_lora.py \
--model_name checkpoints/sft \
--dataset_path data/rl/train.jsonl \
--output_dir checkpoints/grpo First, prepare curriculum-ordered data:
PYTHONPATH=. uv run src/data_gen/curriculum_shuffle.py \
--input data/rl/train.jsonl \
--output data/rl/train_curriculum.jsonlThen run GRPO with curriculum data:
PYTHONPATH=. uv run src/training/unsloth/grpo_lora.py \
--model_name checkpoints/sft \
--dataset_path data/rl/train_curriculum.jsonl \
--output_dir checkpoints/cgrpoEvaluate on MGSM Bengali:
PYTHONPATH=. uv run src/evaluation/eval.py \
--model dipta007/GanitLLM-4B_SFT_CGRPO \
--dataset mgsm \
--output_dir results/To process your own Bengali math data:
# Step 1: Filter and validate Bengali content
PYTHONPATH=. uv run src/data_gen/0.filter.py --input raw_data.jsonl --output filtered.jsonl
# Step 2: Remove exact duplicates
PYTHONPATH=. uv run src/data_gen/1.deduplication.py --input filtered.jsonl --output deduped.jsonl
# Step 3: Remove near-duplicates using MinHash
PYTHONPATH=. uv run src/data_gen/2.minhash.py --input deduped.jsonl --output minhash.jsonl
# Step 4: Decontaminate against test sets
PYTHONPATH=. uv run src/data_gen/3.decontamination.py --input minhash.jsonl --output clean.jsonl
# Step 5: Tag difficulty using ensemble LLM evaluation
PYTHONPATH=. uv run src/data_gen/4.tag_difficulty.py --input clean.jsonl --output final.jsonlOr run the full pipeline:
make process-dataset INPUT=raw_data.jsonl OUTPUT=final.jsonlOur GRPO training uses multi-component rewards:
- Format Reward: Validates
<think>and<answer>tag structure - Bengali Reasoning Reward: Ensures >80% Bengali text in reasoning
- Answer Correctness:
- Bengali numeral match: +2.0
- English numeral match: +1.0
- Length Penalty: Soft penalty for responses >2000 tokens
For results and more details, please refer to our paper.
If you find this work useful, please cite our paper:
will be updatedFor questions or issues, please open a GitHub issue or contact the authors.