ai-autograding-feedback

Overview

This program is part of an exploratory project to evaluate the quality of LLM-generated feedback in assisting with assignment grading and enhancing student learning. This program processes either the code sections, text sections, or images of a student's submission to programming assignments, based on the provided arguments. It generates output into a markdown file or standard output.

The large language models used and implementation logic vary depending on whether the selected scope is 'image', 'code' or 'text'.

For the code scope, the program takes three files:

An assignment's solution file
A student's submission file
A test file

For the text scope, the program takes two files:

An assignment's solution file
A student's submission file

For the image scope, the program takes up to two files, depending on the prompt used:

A student's submission file
(Optional) An assignment's solution file

Features

Handles image, text and code scopes.
Reads pre-defined prompts specified in JSON files.
Uses an argument parser for structured command-line input.
Supports various Large Language Models to evaluate student assignment submissions.
Saves response output in Markdown format with a predefined template or prints to stdout.

Argument Details

Argument	Description	Required
`--submission_type`	Type of submission (from `arg_options.FileType`)	❌
`--prompt`	Pre-defined prompt name or file path to custom prompt file	❌ **
`--prompt_text`	String prompt	❌ **
`--scope`	Processing scope (`image` or `code` or `text`)	✅
`--submission`	Submission file path	✅
`--question`	Specific question to evaluate	❌
`--model`	Model type (from `arg_options.Models`)	✅
`--output`	File path for where to record the output	❌
`--solution`	File path for the solution file	❌
`--test_output`	File path for the file containing the results from tests	❌
`--submission_image`	File path for the submission image file	❌
`--solution_image`	File path for the solution image file	❌
`--system_prompt`	Pre-defined system prompt name or file path to custom system prompt	❌
`--llama_mode`	How to invoke deepSeek-v3 (choices in `arg_options.LlamaMode`)	❌
`--output_template`	Output template file (from `arg_options.OutputTemplate)	❌
`--json_schema`	File path to json file for schema for structured output	❌
`--marking_instructions`	File path to marking instructions/rubric	❌
`--model_options`	Comma-separated key-value pairs of model options and their values	❌
** One of either `--prompt` or `--prompt_text` must be selected. If both are provided, `--prompt_text` will be appended to the contents of the file specified by `--prompt`.

Scope

The program supports three scopes: code or text or image. Depending on which is selected, the program supports different models and prompts tailored for each option.

If the "code" scope is selected, the program will identify student errors in the code sections of the assignment, comparing them to the solution code. Additionally, if the --scope code option is chosen, the --question option can also be specified to analyze the code for a particular question rather than the entire file. Currently, you can specify a question number if the file type is jupyter notebook. In order to use the --question option, the question code in both the solution and submission file must be delimited by '## Task {#}'. See the File Formatting Assumptions section.

If the "text" scope is selected, the program will identify student errors in the written responses of the assignment, comparing them to the solution's rubric for written responses. If the 'text' scope is chosen, then 'pdf' must be chosen for the submission type.

If the "image" scope is selected, the program will identify issues in submission images, optionally comparing them to reference solutions. Question numbers can be specified by adding the tag markus_question_name: <question name> to the metadata for the code cell that generates the submission image. The previous cell's markdown content will be used as the question's context.

Submission Type

The program automatically detects submission type based on file extensions in the assignment directory:

Files ending with _submission.ipynb → jupyter notebook
Files ending with _submission.py → python file
Files ending with _submission.pdf → PDF document

The user can also explicitly specify the submission type using the --submission_type argument if auto-detection is not suitable.

Currently, jupyter notebook, pdf, and python assignments are supported.

Prompts

The --prompt argument accepts either pre-defined prompt names or custom file paths:

Pre-defined Prompts

To use pre-defined prompts, specify the prompt name (without extension). Pre-defined prompts are stored as markdown (.md) files in the ai_feedback/data/prompts/user/ directory.

Custom Prompt Files

To use custom prompt files, specify the file path to your custom prompt. The file should be a markdown (.md) file.

Prompt files can contain template placeholders with the following structure:

Consider this question:
{context}

{submission_image}

Do the graphs in the attached image solve the problem? Do not include an example solution.

Prompt files are now stored as markdown (.md) files in the ai_feedback/data/prompts/user/ directory. Each prompt can contain template placeholders that will be automatically replaced with relevant content.

Prompt Naming Conventions:

Prompts to be used when --scope code is selected are prefixed with code_{}.md
Prompts to be used when --scope image is selected are prefixed with image_{}.md
Prompts to be used when --scope text is selected are prefixed with text_{}.md

Scope validation (prefix matching) only applies to pre-defined prompts. Custom prompt files can be used with any scope.

All prompts are treated as templates that can contain special placeholder blocks, the following template placeholders are automatically replaced:

{context} - Question context
{file_references} - List of files being analyzed with descriptions
{file_contents} - Full contents of files with line numbers
{submission_image} - Student submission image
{solution_image} - Reference solution image

Code Scope Prompts

Prompt Name	Description
`code_explanation.md`	Outputs paragraph explanation of errors.
`code_hint.md`	Outputs short hints on what errors are.
`code_lines.md`	Outputs only code lines where errors are caused.
`code_table.md`	Outputs a table which shows the question requirement, the student’s attempt, and potential issue.
`code_template.md`	Outputs a template format specified to include error type, description, solution.
`code_annotation.md`	Outputs a json object of a list of annotation objects to display student errors on MarkUs. This is intended for markus integration usage.

Image Scope Prompts

Prompt Name	Description
`image_analyze.md`	Outputs whether the submission image answers the question provided by the context.
`image_analyze_annotations.md`	Outputs whether the submission image answers the question provided by the context as a list of JSON objects, each with a description of the issue and a location on the image. Intended for MarkUs integration usage.
`image_compare.md`	Outputs table comparing style elements between submission and solution graphs.
`image_style.md`	Outputs table checking the style elements in a submission graph.
`image_style_annotations.md`	Outputs evaluations of style elements in a submission graph as a list of JSON objects, each with a description of the issue and a location on the image. Intended for MarkUs integration usage.

Text Scope Prompts

Prompt Name	Description
`text_pdf_analyze.md`	Outputs whether the submission written response matches all the criteria specified in the solution.

Prompt_text

Additonally, the user can pass in a string through the --prompt_text argument. This will either be concatenated to the prompt if --prompt is used or fed in as the only prompt if --prompt is not used.

System Prompts

The --system_prompt argument accepts either pre-defined system prompt names or custom file paths:

Pre-defined System Prompts

To use pre-defined system prompts, specify the system prompt name (without extension). Pre-defined system prompts are stored as markdown (.md) files in the ai_feedback/data/prompts/system/ directory.

Custom System Prompt Files

To use custom system prompt files, specify the file path to your custom system prompt. The file should be a markdown (.md) file.

System prompts define the AI model's behavior, tone, and approach to providing feedback. They are used to set the context and personality of the AI assistant.

Marking Instructions

The --marking_instructions argument accepts a file path to a text file containing rubric or marking instructions. If the prompt template contains a {marking_instructions} placeholder, the contents of the file will be inserted at that location in the prompt.

Models

The models used can be seen under the ai_feedback/models folder.

OpenAI Vector Store

Model Name: gpt-4-turbo
System Prompt: Behaviour of model is set with INSTRUCTIONS prompt from helpers/constants.py.
Features:
- Assistant: Uses the OpenAI Assistant Beta Feature, allowing customized model for specific tasks.
- Vector Store: The model creates and manages a vector store for data retrieval.
- Tools Used: Supports file_search for retrieving information from uploaded files.
- Cleanup: Uploaded files and models are deleted after processing, in order to manage API resources.
OpenAI Assistants Documentation

OpenAI

Uses the same model as above but doesn't use the vector store functionality. Uploads files as part of the prompt.

Note: If you wish to use OpenAI models, you must specify your API key in an .env file. Create a .env file in your project directory and add your API key:

OPENAI_API_KEY=your_api_key_here

Claude

Model Name: claude-3.7-sonnet
System Prompt: Behaviour of model is set with INSTRUCTIONS prompt from helpers/constants.py.
Claude Documentation

Note: If you wish to use the Claude model, you must specify your API key in an .env file. Create a .env file in your project directory and add your API key:

CLAUDE_API_KEY=your_api_key_here

Ollama

Various models were also tested and run locally on the Teach CS Bigmouth server by using Ollama. Listed below are the models that were used to test out the project:

Code Scope

Models:

deepSeek-R1:70B Documentation
codellama:latest Documentation

Image Scope

llama3.2-vision:90b Documentation
- This model only supports at most one image attachment.
llava:34b Documentation

Output Structure

When --output filepath is given, the script will:

Load the template for the output based on the --output_template (Options defined in ai_feedback/helpers/arg_options.OutputTemplate)
Format it with the provided arguments and processing results.
Save it under filepath

When the --output argument is not given, the prompt used and generated response will be sent to stdout in the format selected by --output_template.
When the --output_template argument is not given it will default to response_only which is only the response from the model

Question Extraction

Use --question "<section name>" to extract a specific section from a submission. Behavior depends on file type.

Supported inputs

PDF: looks up section names in the PDF’s Table of Contents (TOC).
Text / Markdown / Code (.txt, .md, .ipynb, .qmd): looks for Markdown-style headings (#, ##, ###, …). For code, write the heading in a comment line (e.g., ### Question 1 in Python). The extractor returns all content that belongs to that heading (up until the next heading at the same or higher level).

Matching is case-insensitive and normalizes smart quotes, dashes, and extra whitespace.

Test Files

Any subdirectory of /test_submissions can be run locally. More examples can be added to this directory using a similar fashion.

GGR274 Test File Assumptions

Code Scope

To test the program using the GGR274 files, we assume that the test assignment files follow a specific directory structure. Currently, this program has been tested using Homework 5 of the GGR274 class at the University of Toronto.

Directory Structure

Within the test_submissions/ggr274_homework5 directory, mock submissions are contained in a separate subdirectories test_submissions/ggr274_homework5/test#. The following naming convention is used for the files:

Homework_5_solution.ipynb – Instructor-provided solution file
student_submission.ipynb – Student's submission file
test#_error_output.txt – Error trace file for the corresponding test case

Each test folder contains variations of student_submission.ipynb with different errors.

File Formatting Assumptions

To ensure proper extraction and evaluation of student responses, the following format is assumed for Homework_5_solution.ipynb and student_submission.ipynb:

Each task must be clearly delimited using markdown headers in the format:
```
## Task {#}
```
This allows the program to isolate specific questions when using the --question argument, ensuring the model only evaluates errors related to the specified question.
Each file must start with:
```
## Introduction
```
This section serves as the general assignment instructions and is not included in error evaluation.

Image Scope

Test Files

Mock student submissions are stored in ggr274_homework5/image_test#. The following naming convention is used for the files:

solution.ipynb – Instructor-provided solution file
student_submission.ipynb – Student's submission file

Notebook Preprocessing

To grade a specific question using the --question argument, add the tag markus_question_name: <question name> to the metadata for the code cell that generates an image to be graded. The previous cell's markdown content will be used as the question's context.

Package Usage

In order to run this package locally:

Ensure you have the environment variables set up (see Models section above).

When you are in a terminal in the repo, run:

pip install -e .

Run the program:

python -m ai_feedback \
  --submission_type <file_type> \
  --prompt <prompt_name> \
  --scope <image|code|text> \
  --submission <submission_file_path> \
  --solution <solution_file_path> \
  --test_output <test_ouput_path> \
  --submission_image <image_file_path> \
  --solution_image <image_file_path> \
  --question <question_number> \
  --model <model_name> \
  --output <file_path_to> \
  --output_template <file_name> \
  --system_prompt <prompt_file_path> \
  --llama_mode <server|cli>

See the Arguments section for the different command line argument options, or run this command to see help messages and available choices:

python -m ai_feedback -h

Example Commands

Evaluate cnn_example test using openAI model

python -m ai_feedback --prompt code_lines --scope code --submission test_submissions/cnn_example/cnn_submission --solution test_submissions/cnn_example/cnn_solution.py --model openai

Evaluate cnn_example test using openAI model and custom prompt

python -m ai_feedback --prompt_text "Evaluate the student's code readability." --scope code --submission test_submissions/cnn_example/cnn_submission.py --model openai

Evaluate pdf_example test using openAI model

python -m ai_feedback --prompt text_pdf_analyze --scope text --submission test_submissions/pdf_example/student_pdf_submission.pdf --model openai

Evaluate question1 of test1 of ggr274 homework using DeepSeek model

python -m ai_feedback --prompt code_table \
  --scope code --submission test_submissions/ggr274_homework5/test1/student_submission.ipynb --question 1 --model deepSeek-R1:70B

Evaluate the image for question 5b of ggr274 homework with Llama3.2-vision

python -m ai_feedback --prompt image_analyze --scope image --solution ./test_submissions/ggr274_homework5/image_test2/student_submission.ipynb --submission_image test_submissions/ggr274_homework5/image_test2/student_submission.png --question "Question 5b" --model llama3.2-vision:90b

Evaluate the bfs example with remote model to test_file using the verbose template

python -m ai_feedback --prompt code_lines --scope code --solution ./test_submissions/bfs_example/bfs_solution.py --submission test_submissions/bfs_example/bfs_submission.py --model remote --output --output test_file --output_template verbose

Evalute the Jupyter notebook of test1 of ggr274 using DeepSeek-v3 via llama.cpp server

python3 -m ai_feedback --prompt code_table --scope code \
        --submission test_submissions/ggr274_homework5/test1/student_submission.ipynb \
        --solution test_submissions/ggr274_homework5/test1/Homework_5_solution.ipynb \
        --model deepSeek-v3 --llama_mode server

Evalute the Jupyter notebook of test1 of ggr274 using DeepSeek-v3 via llama.cpp cli

python3 -m ai_feedback --prompt code_table --scope code \
        --submission test_submissions/ggr274_homework5/test1/student_submission.ipynb \
        --solution test_submissions/ggr274_homework5/test1/Homework_5_solution.ipynb \
        --model deepSeek-v3 --llama_mode cli

Get annotations for cnn_example test using openAI model

python -m ai_feedback --prompt code_annotations --scope code --submission test_submissions/cnn_example/cnn_submission --solution test_submissions/cnn_example/cnn_solution.py --model openai --json_schema ai_feedback/data/schema/code_annotation_schema.json

Evaluate using custom prompt file path

python -m ai_feedback --prompt ai_feedback/data/prompts/user/code_overall.md --scope code --submission test_submissions/csc108/correct_submission/correct_submission.py --solution test_submissions/csc108/solution.py --model codellama:latest

Evaluate using custom model_options

python3 -m ai_feedback --prompt code_table --scope code --submission ../ai-autograding-feedback-eval/test_submissions/108/hard_coding_submission.py --model openai-vector --submission_type python --model_options "max_tokens=1200,temperature=0.4,top_p=0.92"

Using Ollama

In order to run this project on Bigmouth:

SSH into teach.cs

ssh [email protected]

SSH into bigmouth (access permission required)

ssh bigmouth

Ensure you're in the project directory
Start Ollama

ollama start

Ensure models specified in repo are downloaded

ollama list

Run the script according to the Package Usage section above.

Markus Integration

This python package can be used as a dependency in the Markus Autotester, in order to display LLM generated feedback as overall comments and test outputs, and as annotations on the submission file. Following the instructions below to set up the Autotester, once 'Run Tests' is pressed, these comments and annotations should appear automatically on the Markus UI.

Markus Test Scripts

/markus_test_scripts contains scripts which can be uploaded to the autotester in order to generate LLM Feedback
Currently, only openAI and Claude models are supported.
Within these llm script files, the models and prompts used can be changed by editing the command line arguments, through the run_llm() function.

Files:

python_tester_llm_code.py: Runs LLM on any code assignment (solution file, submission file) uploaded to the autotester. First, creates general feedback and displays as overall comments and test output (can use any prompt and model). Second, feeds in the output of the first LLM response into the model again, asking it to create annotations for the student's mistakes. (Ensure to change submission file import name.)
llm_helpers.py: contains helper functions needed to run llm scripts.
python_tester_llm_pdf.py: Runs LLM on any pdf assignment (solution file and submission file) uploaded to the autotester. Creates general feedback about whether the student's written responses matches the instructors feedback. Dislayed in test outputs and overall comments.
custom_tester_llm_code.sh: Runs LLM on assignments (solution file, submission file, test output file) uploaded to the custom autotester. Currently, supports jupyter notebook files uploaded. Can specify prompt and model used in the script. Displays in overall comments and in test outputs. Can optionally uncomment the annotations section to display annotations, however the annotations will display on the .txt version of the file uploaded by the student, not the .ipynb file.

Python AutoTester Usage

Code Scope

Ensure the student has submitted a submission file.
Ensure the instructor has submitted a solution file, llm_helpers.py (located in /markus_test_scripts), and python_tester_llm_code.py (located in /markus_test_scripts). Instructor can also upload another pytest file which can be run as its own test group.
Ensure the submission import statement in python_tester_llm_code.py matches the name of the student's submission file name.
Create a Python Autotester Test Group to run the LLM File.
In the Package Requirements section of the Test Group Settings for the LLM file, put:

git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback

Along with any other packages that the submission or solution file uses.

Ensure the Timeout is set to 120 seconds or longer.
Ensure Markus Autotester docker container has the API Keys in an .env file and specified in the docker compose file.

Text Scope

Do the same as the code scope, but ensure that the student submission and instructor solution are .pdf files with the same naming assumption. Also, ensure that python_tester_llm_pdf.py is uploaded as the test script.

AI Tester Usage

In the Autotest settings of the assignment, click Add Tester and select the ai option.
Fill in all required arguments for the AI tester.
Upload any related files (e.g., JSON schema files, custom prompts, or configuration files).
Ensure the MarkUs Autotester Docker container has the API keys defined in an .env file and that these variables are specified in the docker-compose.yml file.
Ensure the Timeout is set to 120 seconds or longer.

Running Python Autotester Examples

CNN Example

Look at the /test_submissions/cnn_example directory for the following files
Instructor uploads: cnn_solution.py, cnn_test.py, llm_helpers.py, python_tester_llm_code.py files
Separate test groups for cnn_test.py and python_tester_llm_code.py
cnn_test.py Autotester package requirements: torch numpy
python_tester_llm_code.py Autotester package requirements: git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback numpy torch
Student uploads: cnn_submission.pdf

BFS Example

Look at the /test_submissions/bfs_example directory for the following files
Instructor uploads: bfs_solution.py, test_bfs.py, llm_helpers.py, python_tester_llm_code.py files
Separate test groups for test_bfs.py and python_tester_llm_code.py
python_tester_llm_code.py Autotester package requirements: git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback
Student uploads: bfs_submission.pdf

PDF Example

Look at the /test_submissions/pdf_example directory for the following files
Instructor uploads: instructor_pdf_solution.pdf, llm_helpers.py, python_tester_llm_pdf.py files
Autotester package requirements: git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback
Student uploads: student_pdf_submission.pdf

Custom Tester Usage

Ensure the student has submitted a submission file.
Ensure the instructor has submitted a solution file and custom_tester_llm_code.sh (located in /markus_test_scripts). Instructor can also upload another script used to run its own test group. (See below for GGR274 Example.)
In the Markus Autotesting terminal:

 docker exec -it -u 0 markus-autotesting-server-1 /bin/bash

Then as the root user, install the package:

/home/docker/.autotesting/scripts/defaultvenv/bin/pip install git+https://github.com/MarkUsProject/ai-autograding-feedback.git#egg=ai_feedback

Also pip install other packages that the submission or solution file uses.

Create a Custom Autotester Test Group to run the LLM script file.
Ensure the Timeout is set to 120 seconds or longer.
Ensure Markus Autotester docker container has the API Keys in an .env file and specified in the docker compose file.

GGR274 Test1 Example

Look at the /test_submissions/ggr274_hw5_custom_tester directory for the following files
Instructor uploads: Homework_5_solution.ipynb, test_hw5.py, test_output.txt, custom_tester_llm_code.sh, run_hw5_test.sh
Two separate test groups: one for run_hw5_test.sh, and one for custom_tester_llm_code.sh
Student uploads: test1_submission.ipynb, test1_submission.txt

NOTE: if the LLM Test Group appears to be blank/does not turn green, try increasing the timeout.

Custom Tester

custom_tester_llm_code.sh: Runs LLM on any assignment (solution file, submission file, test output file) uploaded to the autotester. Can specify prompt and model used in the script. Displays in overall comments and in test outputs.

Developers

To install project dependencies, including development dependencies:

$ pip install -e .[dev]

To install pre-commit hooks:

$ pre-commit install

To run the test suite:

$ pytest

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
.github/workflows		.github/workflows
ai_feedback		ai_feedback
markus_test_scripts		markus_test_scripts
presentation_materials		presentation_materials
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

MarkUsProject/ai-autograding-feedback

Folders and files

Latest commit

History

Repository files navigation

ai-autograding-feedback

Overview

Features

Argument Details

Scope

Submission Type

Prompts

Pre-defined Prompts

Custom Prompt Files

Code Scope Prompts

Image Scope Prompts

Text Scope Prompts

Prompt_text

System Prompts

Pre-defined System Prompts

Custom System Prompt Files

Marking Instructions

Models

OpenAI Vector Store

OpenAI

Claude

Ollama

Code Scope

Image Scope

Output Structure

Question Extraction

Supported inputs

Test Files

GGR274 Test File Assumptions

Code Scope

Directory Structure

File Formatting Assumptions

Image Scope

Test Files

Notebook Preprocessing

Package Usage

Example Commands

Evaluate cnn_example test using openAI model

Evaluate cnn_example test using openAI model and custom prompt

Evaluate pdf_example test using openAI model

Evaluate question1 of test1 of ggr274 homework using DeepSeek model

Evaluate the image for question 5b of ggr274 homework with Llama3.2-vision

Evaluate the bfs example with remote model to test_file using the verbose template

Evalute the Jupyter notebook of test1 of ggr274 using DeepSeek-v3 via llama.cpp server

Evalute the Jupyter notebook of test1 of ggr274 using DeepSeek-v3 via llama.cpp cli

Get annotations for cnn_example test using openAI model

Evaluate using custom prompt file path

Evaluate using custom model_options

Using Ollama

Markus Integration

Markus Test Scripts

Python AutoTester Usage

Code Scope

Text Scope

AI Tester Usage

Running Python Autotester Examples

CNN Example

BFS Example

PDF Example

Custom Tester Usage

GGR274 Test1 Example

Custom Tester

Developers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Packages