Automated Scoring of Short Answer Questions with Large Language Models

This repository contains the data and analyses for the research paper "Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design" presented at AIED 2025.

Abstract

Short answer questions (SAQs) are useful in educational testing but can be resource-intensive to grade at scale. This study explores the performance of several off-the-shelf Large Language Models (LLMs) at automatically scoring responses to High School Math and English SAQs using three different rubric types. Results show that LLM performance improves with more detailed rubrics, and performance varies significantly by item characteristics.

Research Questions

RQ1: How much item information is needed in the prompt to produce reliable LM ratings?
RQ2: How well do various LLMs score SAQs?
RQ3: What are the features of SAQs with the best and worst LLM performance?

Dataset

20 Short Answer Questions: 10 High School English Language Arts (ELA) and 10 High School Algebra I (Math)
800 Student Responses: 40 responses per item (20 correct, 20 incorrect)
Human Scoring: 3 human judges scored each response using full rubrics
LLM Scoring: 15 different LLMs scored responses using 3 rubric types

Models Evaluated

Original Study (11 models):

Small: Claude 3.5 Haiku, Gemini 1.5 Flash, Gemini 1.5 Flash 8b, GPT-4o mini, Llama 3.1 70b, Llama 3.1 8b
Large: Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o, Llama 3.1 405b
Frontier: OpenAI o1

Extended Analysis (4 additional models):

Claude 4.0 Sonnet, Gemini 2.5 Pro, OpenAI o3, OpenAI o4 Mini

Rubric Types

Empty: Question only, no scoring criteria
Criteria Only: Question + scoring criteria
Full: Question + scoring criteria + example correct/incorrect answers

Repository Structure

llm-saq-scoring/
├── data/
│   ├── raw/                          # Original datasets
│   │   ├── human_labels.csv          # Human scorer ratings (800 × 8)
│   │   ├── items.csv                 # Item metadata and characteristics (20 × 11)
│   │   ├── llm_labels.csv           # LLM scores - original models (26,400 × 11)
│   │   └── llm_labels_new_models.csv # LLM scores - newer models (9,600 × 11)
│   └── processed/                    # Analysis-ready datasets
│       ├── dat.csv                   # Main analysis dataset
│       └── dat_new.csv              # Dataset with newer models
├── analysis/
│   ├── 01-analysis.qmd              # Primary analysis (original models)
│   └── 02-analysis-new-models.qmd   # Extended analysis (newer models)
├── functions/
│   └── functions.R                   # Reusable analysis functions
├── output/
│   ├── tables/                      # Results in tabular format
│   ├── results/                     # Statistical test outputs  
│   └── visualizations/              # Plots and figures
└── docs/                            # Documentation

Key Findings

Off-the-shelf LLMs can score Short Answer Questions on-par with human scorers, demonstrating their viability for automated scoring applications
Including scoring criteria improved model inter-rater reliability and performance against human ratings, but for some items and models, performance was very good even without scoring criteria or examples
Including examples may not always improve inter-rater reliability and performance beyond just providing criteria, though this appears to depend on the specific item characteristics
Larger models tend to outperform smaller models, but some small models (e.g., Gemini 1.5 Flash) still performed very well, achieving acceptable performance levels
LLMs tend to score items requiring shorter answers and narrow scope more accurately, with performance declining for items requiring longer responses and having broader answer scope
Using LLMs to score SAQs in development could help identify inconsistencies in scoring, potentially leading to clearer and more robust rubrics
LLM inter-rater reliability was often higher than human judges (Fleiss' κ = 0.881 for humans), suggesting high consistency in LLM scoring behavior

Getting Started

Prerequisites

R (version 4.0+)
Required R packages: -- yardstick -- broom -- emmeans -- report -- psych -- irr -- tidyverse -- janitor -- ggthemes

Reproducing the Analysis

Clone the repository

git clone https://github.com/Khan/llm-saq-scoring.git
cd llm-saq-scoring

Run the analyses

# Original model analysis
quarto::quarto_render("analysis/01-analysis.qmd")

# Extended analysis with newer models  
quarto::quarto_render("analysis/02-analysis-new-models.qmd")

All outputs (tables, results, visualizations) will be generated in the output/ directory.

Data Dictionary

Detailed variable descriptions are available in docs/DATA_DICTIONARY.txt. Key variables include:

Response scoring: Binary correct/incorrect ratings from humans and LLMs
Item characteristics: Response length, cognitive demand, answer scope, scoring subjectivity
Model information: Model name, size classification, rubric type used

Citation

If you use this dataset or findings in your research, please cite:

Frohn, S., Burleigh, T., & Chen, J. (2025). Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design. In Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025).

License

Code and Analysis: MIT License
Data and Assessment Items: CC BY 4.0 License

Contact

Scott Frohn - [email protected]
Tyler Burleigh - [email protected]
Jing Chen - [email protected]

Acknowledgments

This research was conducted at Khan Academy. We thank the subject matter experts who created the assessment items and the human judges who provided scoring.

For detailed methodology and complete results, please refer to the published paper and the analysis notebooks in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
analysis		analysis
data		data
docs		docs
functions		functions
output		output
.gitignore		.gitignore
AIED 2025 - Frohn, Burleigh, Chen.pdf		AIED 2025 - Frohn, Burleigh, Chen.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Scoring of Short Answer Questions with Large Language Models

Abstract

Research Questions

Dataset

Models Evaluated

Rubric Types

Repository Structure

Key Findings

Getting Started

Prerequisites

Reproducing the Analysis

Data Dictionary

Citation

License

Contact

Acknowledgments

About

Uh oh!

Languages

Khan/llm-saq-scoring

Folders and files

Latest commit

History

Repository files navigation

Automated Scoring of Short Answer Questions with Large Language Models

Abstract

Research Questions

Dataset

Models Evaluated

Rubric Types

Repository Structure

Key Findings

Getting Started

Prerequisites

Reproducing the Analysis

Data Dictionary

Citation

License

Contact

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages