This repository contains the official code and resources for the research project, "Evaluating Small Language Models on the JEEBench Expert Benchmark". The project provides a fully automated framework to assess the expert-level reasoning capabilities of Small Language Models (SLMs) on a challenging subset of the JEEBench benchmark.
The deployment of Small Language Models (SLMs) in high-stakes educational applications is hampered by a gap between their purported capabilities and proven performance on expert-level reasoning tasks. This research addresses this challenge by evaluating seven state-of-the-art SLMs on a curated 120-problem subset of JEEBench, a definitive benchmark derived from India's highly competitive IIT JEE-Advanced examinations. Our fully automated framework assesses zero-shot, few-shot, and Chain-of-Thought (COT) prompting across a diverse portfolio of SLMs (1.5B-8B parameters). A production-grade answer extraction pipeline enforces strict format compliance, measuring both correct reasoning and the ability to generate machine-parsable outputs.
- Mathematical Specialization is Decisive: The specialized Qwen2.5-Math-7B-Instruct model achieved 22.5% accuracy, outperforming general-purpose models by 40-92%.
- Parameter-Efficiency Challenges Scaling Laws: The 1.5B parameter Qwen2-1.5B-Instruct ranked second overall (16.1% accuracy), surpassing several larger 7-8B models.
- The "Anti-Prompting" Phenomenon: Meta-Llama-3-8B-Instruct suffered a catastrophic 55% performance collapse with few-shot prompting, exposing a critical deployment vulnerability.
- A Universal Computational Barrier: A systematic 93.3% failure rate on Numeric-type problems was observed across all models, revealing a fundamental architectural limitation in current transformers for tasks requiring high numerical fidelity.
The evaluation used a curated 120-problem subset of JEEBench, spanning Physics, Chemistry, and Mathematics. Seven state-of-the-art SLMs were evaluated using a fully automated pipeline that tested zero-shot, few-shot, and Chain-of-Thought prompting strategies. The pipeline included a strict answer-extraction module to measure both reasoning accuracy and format compliance, ensuring a rigorous, deployment-oriented assessment.
.
├── assets/
│ └── (Your images and visual assets go here)
├── data/
│ ├── dataset120best.json
│ └── few_shot_examples.json
├── src/
│ └── finalpy.py
├── .gitignore
├── LICENSE
├── README.md
└── requirements.txt
Below are detailed instructions for setting up and running the evaluation framework on different platforms.
Prerequisites: A Google account and access to Google Colab. A Colab Pro subscription is recommended for T4 GPU access.
-
Open Colab and Set Runtime:
- Go to colab.research.google.com.
- Click File > New notebook.
- Click Runtime > Change runtime type and select T4 GPU as the hardware accelerator.
-
Clone the Repository:
- In a code cell, clone the GitHub repository:
!git clone [https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git\](https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git)
%cd Evaluating-SLMs-on-JEEBench
- In a code cell, clone the GitHub repository:
-
Install Dependencies:
- Install the required Python packages from requirements.txt:
!pip install -r requirements.txt
- Install the required Python packages from requirements.txt:
-
Run the Evaluation:
-
Execute the main evaluation script. For example, to run the Mistral 7B model using the all methods on the first 10 problems:
!python src/finalpy.py \
--model_name "mistralai/Mistral-7B-Instruct-v0.3" \
--dataset "data/dataset120best.json" \
--method "all" \
--max_problems 10 -
Results and visualizations will be saved in the repository directory.
-
Prerequisites: A Kaggle account.
-
Create a New Kaggle Notebook:
- Go to Kaggle and click Create > New Notebook.
- In the right-hand settings panel, under Accelerator, select GPU T4 x2.
-
Clone the Repository:
- In a code cell, clone the repository:
!git clone [https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git\](https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git)
import os
os.chdir('Evaluating-SLMs-on-JEEBench')
- In a code cell, clone the repository:
-
Install Dependencies:
- Install the required packages:
!pip install -r requirements.txt
- Install the required packages:
-
Run the Evaluation:
- Execute the main script. For example, to run the Qwen 1.5B model with the cot method on 15 problems:
!python src/finalpy.py \
--model_name "Qwen/Qwen2-1.5B-Instruct" \
--dataset "data/dataset120best.json" \
--method "cot" \
--max_problems 15
- Execute the main script. For example, to run the Qwen 1.5B model with the cot method on 15 problems:
Prerequisites:
- Python 3.8 or higher.
- An NVIDIA GPU with at least 16GB VRAM and CUDA installed.
- Git installed on your system.
- Visual Studio Code with the Python extension.
-
Clone the Repository:
- Open a terminal or command prompt and run:
git clone [https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git\](https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git)
cd Evaluating-SLMs-on-JEEBench
- Open a terminal or command prompt and run:
-
Create a Virtual Environment:
- It is highly recommended to use a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
- It is highly recommended to use a virtual environment:
-
Install Dependencies:
- Install the project's dependencies:
pip install -r requirements.txt
- Install the project's dependencies:
-
Run the Evaluation in VS Code:
- Open the Evaluating-SLMs-on-JEEBench folder in VS Code.
- Open the integrated terminal (Ctrl+ or Cmd+).
- Run the evaluation script. To run an evaluation using the few-shot method, you must also provide the path to the few-shot dataset:
python src/finalpy.py \
--model_name "meta-llama/Meta-Llama-3-8B-Instruct" \
--dataset "data/dataset120best.json" \
--method "few_shot" \
--few_shot_examples "data/few_shot_examples.json" \
--max_problems 5