Evaluating Small Language Models on JEEBench

This repository contains the official code and resources for the research project, "Evaluating Small Language Models on the JEEBench Expert Benchmark". The project provides a fully automated framework to assess the expert-level reasoning capabilities of Small Language Models (SLMs) on a challenging subset of the JEEBench benchmark.

Abstract

The deployment of Small Language Models (SLMs) in high-stakes educational applications is hampered by a gap between their purported capabilities and proven performance on expert-level reasoning tasks. This research addresses this challenge by evaluating seven state-of-the-art SLMs on a curated 120-problem subset of JEEBench, a definitive benchmark derived from India's highly competitive IIT JEE-Advanced examinations. Our fully automated framework assesses zero-shot, few-shot, and Chain-of-Thought (COT) prompting across a diverse portfolio of SLMs (1.5B-8B parameters). A production-grade answer extraction pipeline enforces strict format compliance, measuring both correct reasoning and the ability to generate machine-parsable outputs.

Key Findings

Mathematical Specialization is Decisive: The specialized Qwen2.5-Math-7B-Instruct model achieved 22.5% accuracy, outperforming general-purpose models by 40-92%.
Parameter-Efficiency Challenges Scaling Laws: The 1.5B parameter Qwen2-1.5B-Instruct ranked second overall (16.1% accuracy), surpassing several larger 7-8B models.
The "Anti-Prompting" Phenomenon: Meta-Llama-3-8B-Instruct suffered a catastrophic 55% performance collapse with few-shot prompting, exposing a critical deployment vulnerability.
A Universal Computational Barrier: A systematic 93.3% failure rate on Numeric-type problems was observed across all models, revealing a fundamental architectural limitation in current transformers for tasks requiring high numerical fidelity.

Methodology Overview

The evaluation used a curated 120-problem subset of JEEBench, spanning Physics, Chemistry, and Mathematics. Seven state-of-the-art SLMs were evaluated using a fully automated pipeline that tested zero-shot, few-shot, and Chain-of-Thought prompting strategies. The pipeline included a strict answer-extraction module to measure both reasoning accuracy and format compliance, ensuring a rigorous, deployment-oriented assessment.

Repository Structure

.
├── assets/
│ └── (Your images and visual assets go here)
├── data/
│ ├── dataset120best.json
│ └── few_shot_examples.json
├── src/
│ └── finalpy.py
├── .gitignore
├── LICENSE
├── README.md
└── requirements.txt

Setup and Usage

Below are detailed instructions for setting up and running the evaluation framework on different platforms.

1. Google Colab Notebooks

Prerequisites: A Google account and access to Google Colab. A Colab Pro subscription is recommended for T4 GPU access.

Open Colab and Set Runtime:
- Go to colab.research.google.com.
- Click File > New notebook.
- Click Runtime > Change runtime type and select T4 GPU as the hardware accelerator.
Clone the Repository:
- In a code cell, clone the GitHub repository:
  !git clone [https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git\](https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git)
  %cd Evaluating-SLMs-on-JEEBench
Install Dependencies:
- Install the required Python packages from requirements.txt:
  !pip install -r requirements.txt
Run the Evaluation:
- Execute the main evaluation script. For example, to run the Mistral 7B model using the all methods on the first 10 problems:
  !python src/finalpy.py \
  --model_name "mistralai/Mistral-7B-Instruct-v0.3" \
  --dataset "data/dataset120best.json" \
  --method "all" \
  --max_problems 10
- Results and visualizations will be saved in the repository directory.

2. Kaggle Notebooks

Prerequisites: A Kaggle account.

Create a New Kaggle Notebook:
- Go to Kaggle and click Create > New Notebook.
- In the right-hand settings panel, under Accelerator, select GPU T4 x2.
Clone the Repository:
- In a code cell, clone the repository:
  !git clone [https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git\](https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git)
  import os
  os.chdir('Evaluating-SLMs-on-JEEBench')
Install Dependencies:
- Install the required packages:
  !pip install -r requirements.txt
Run the Evaluation:
- Execute the main script. For example, to run the Qwen 1.5B model with the cot method on 15 problems:
  !python src/finalpy.py \
  --model_name "Qwen/Qwen2-1.5B-Instruct" \
  --dataset "data/dataset120best.json" \
  --method "cot" \
  --max_problems 15

3. Locally with VS Code

Prerequisites:

Python 3.8 or higher.
An NVIDIA GPU with at least 16GB VRAM and CUDA installed.
Git installed on your system.
Visual Studio Code with the Python extension.

Clone the Repository:
- Open a terminal or command prompt and run:
  git clone [https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git\](https://github.com/your-username/Evaluating-SLMs-on-JEEBench.git)
  cd Evaluating-SLMs-on-JEEBench
Create a Virtual Environment:
- It is highly recommended to use a virtual environment:
  python3 -m venv venv
  source venv/bin/activate # On Windows, use `venv\Scripts\activate`
Install Dependencies:
- Install the project's dependencies:
  pip install -r requirements.txt
Run the Evaluation in VS Code:
- Open the Evaluating-SLMs-on-JEEBench folder in VS Code.
- Open the integrated terminal (Ctrl+ or Cmd+).
- Run the evaluation script. To run an evaluation using the few-shot method, you must also provide the path to the few-shot dataset:
  python src/finalpy.py \
  --model_name "meta-llama/Meta-Llama-3-8B-Instruct" \
  --dataset "data/dataset120best.json" \
  --method "few_shot" \
  --few_shot_examples "data/few_shot_examples.json" \
  --max_problems 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Small Language Models on JEEBench

Abstract

Key Findings

Methodology Overview

Repository Structure

Setup and Usage

1. Google Colab Notebooks

2. Kaggle Notebooks

3. Locally with VS Code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
data		data
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
finalpy.py		finalpy.py
requirements .txt		requirements .txt

License

Abduhu1/Evaluating-SLMs-on-JEEBench

Folders and files

Latest commit

History

Repository files navigation

Evaluating Small Language Models on JEEBench

Abstract

Key Findings

Methodology Overview

Repository Structure

Setup and Usage

1. Google Colab Notebooks

2. Kaggle Notebooks

3. Locally with VS Code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages