Skip to content

A comprehensive analysis of seven state-of-the-art SLMs on JEEBench, a rigorous benchmark for mathematical and scientific reasoning. This project explores the impact of zero-shot, few-shot, and Chain-of-Thought prompting on model performance in Physics, Chemistry, and Mathematics.

License

Notifications You must be signed in to change notification settings

Abduhu1/Evaluating-SLMs-on-JEEBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluating Small Language Models on JEEBench

This repository contains the official code and resources for the research project, "Evaluating Small Language Models on the JEEBench Expert Benchmark". The project provides a fully automated framework to assess the expert-level reasoning capabilities of Small Language Models (SLMs) on a challenging subset of the JEEBench benchmark.

Abstract

The deployment of Small Language Models (SLMs) in high-stakes educational applications is hampered by a gap between their purported capabilities and proven performance on expert-level reasoning tasks. This research addresses this challenge by evaluating seven state-of-the-art SLMs on a curated 120-problem subset of JEEBench, a definitive benchmark derived from India's highly competitive IIT JEE-Advanced examinations. Our fully automated framework assesses zero-shot, few-shot, and Chain-of-Thought (COT) prompting across a diverse portfolio of SLMs (1.5B-8B parameters). A production-grade answer extraction pipeline enforces strict format compliance, measuring both correct reasoning and the ability to generate machine-parsable outputs.

Key Findings

  • Mathematical Specialization is Decisive: The specialized Qwen2.5-Math-7B-Instruct model achieved 22.5% accuracy, outperforming general-purpose models by 40-92%.
  • Parameter-Efficiency Challenges Scaling Laws: The 1.5B parameter Qwen2-1.5B-Instruct ranked second overall (16.1% accuracy), surpassing several larger 7-8B models.
  • The "Anti-Prompting" Phenomenon: Meta-Llama-3-8B-Instruct suffered a catastrophic 55% performance collapse with few-shot prompting, exposing a critical deployment vulnerability.
  • A Universal Computational Barrier: A systematic 93.3% failure rate on Numeric-type problems was observed across all models, revealing a fundamental architectural limitation in current transformers for tasks requiring high numerical fidelity.

Methodology Overview

The evaluation used a curated 120-problem subset of JEEBench, spanning Physics, Chemistry, and Mathematics. Seven state-of-the-art SLMs were evaluated using a fully automated pipeline that tested zero-shot, few-shot, and Chain-of-Thought prompting strategies. The pipeline included a strict answer-extraction module to measure both reasoning accuracy and format compliance, ensuring a rigorous, deployment-oriented assessment.

Repository Structure

.
├── assets/
│ └── (Your images and visual assets go here)
├── data/
│ ├── dataset120best.json
│ └── few_shot_examples.json
├── src/
│ └── finalpy.py
├── .gitignore
├── LICENSE
├── README.md
└── requirements.txt

Setup and Usage

Below are detailed instructions for setting up and running the evaluation framework on different platforms.

1. Google Colab Notebooks

Prerequisites: A Google account and access to Google Colab. A Colab Pro subscription is recommended for T4 GPU access.

  1. Open Colab and Set Runtime:

    • Go to colab.research.google.com.
    • Click File > New notebook.
    • Click Runtime > Change runtime type and select T4 GPU as the hardware accelerator.
  2. Clone the Repository:

  3. Install Dependencies:

    • Install the required Python packages from requirements.txt:
      !pip install -r requirements.txt
  4. Run the Evaluation:

    • Execute the main evaluation script. For example, to run the Mistral 7B model using the all methods on the first 10 problems:
      !python src/finalpy.py \
      --model_name "mistralai/Mistral-7B-Instruct-v0.3" \
      --dataset "data/dataset120best.json" \
      --method "all" \
      --max_problems 10

    • Results and visualizations will be saved in the repository directory.

2. Kaggle Notebooks

Prerequisites: A Kaggle account.

  1. Create a New Kaggle Notebook:

    • Go to Kaggle and click Create > New Notebook.
    • In the right-hand settings panel, under Accelerator, select GPU T4 x2.
  2. Clone the Repository:

  3. Install Dependencies:

    • Install the required packages:
      !pip install -r requirements.txt
  4. Run the Evaluation:

    • Execute the main script. For example, to run the Qwen 1.5B model with the cot method on 15 problems:
      !python src/finalpy.py \
      --model_name "Qwen/Qwen2-1.5B-Instruct" \
      --dataset "data/dataset120best.json" \
      --method "cot" \
      --max_problems 15

3. Locally with VS Code

Prerequisites:

  • Python 3.8 or higher.
  • An NVIDIA GPU with at least 16GB VRAM and CUDA installed.
  • Git installed on your system.
  • Visual Studio Code with the Python extension.
  1. Clone the Repository:

  2. Create a Virtual Environment:

    • It is highly recommended to use a virtual environment:
      python3 -m venv venv
      source venv/bin/activate # On Windows, use `venv\Scripts\activate`
  3. Install Dependencies:

    • Install the project's dependencies:
      pip install -r requirements.txt
  4. Run the Evaluation in VS Code:

    • Open the Evaluating-SLMs-on-JEEBench folder in VS Code.
    • Open the integrated terminal (Ctrl+ or Cmd+).
    • Run the evaluation script. To run an evaluation using the few-shot method, you must also provide the path to the few-shot dataset:
      python src/finalpy.py \
      --model_name "meta-llama/Meta-Llama-3-8B-Instruct" \
      --dataset "data/dataset120best.json" \
      --method "few_shot" \
      --few_shot_examples "data/few_shot_examples.json" \
      --max_problems 5

About

A comprehensive analysis of seven state-of-the-art SLMs on JEEBench, a rigorous benchmark for mathematical and scientific reasoning. This project explores the impact of zero-shot, few-shot, and Chain-of-Thought prompting on model performance in Physics, Chemistry, and Mathematics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages