OpenEval is an evaluation framework for testing LLMs on Roblox game development tasks. This repository contains open-sourced evaluation scripts and tools for running automated assessments in the Roblox Studio environment.
You'll need a Roblox account. If you don't have one, create a free account at roblox.com.
To interact with the OpenEval API, you need to create an OpenCloud API key:
- Navigate to Creator Hub and log in. Make sure you are viewing as user, not group.
- Go to All tools (or OpenCloud) > API Keys
- Create a new key with:
- Access Permissions:
studio-evaluations - Operations:
create - Set an expiration date (recommended: 90 days)
- Access Permissions:
- Save and copy the generated key, which will be used as <OPEN_EVAL_API_KEY> in following commands.
git clone https://github.com/Roblox/open-eval.git
cd open-evalThe project uses uv for dependency management. Install dependencies:
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with Homebrew
brew install uv
# Or with pip
pip install uvYou may save the API key generated in a file named .env, and name it OPEN_EVAL_API_KEY. See .env.example for a sample.
Alternatively, you can pass in the API key directly.
# Using API key stored in .env
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua"
# Or, pass in Open Eval API key manually
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" --api-key $OPEN_EVAL_API_KEYIt should show the status being "submitted" with a url, through which you can check the status of the eval with the Roblox account that owns the API key logged in.
Evals/001_make_cars_faster.lua : Submitted - https://apis.roblox.com/open-eval-api/v1/eval-records/c4106612-0968-4480-90ba-e707d3bbe491It is common for an eval to take 3-4 minutes to run and gather results. The script polls result every 10 seconds and print a status update every 30 seconds.
Once completed, it will return whether the eval run is successful or not. The default timeout is 10 minutes.
Evals/001_make_cars_faster.lua : Success
Success rate: 100.00% (1/1) After eval completed, a result object will be returned as a part of http response. It is accessible through https://apis.roblox.com/open-eval-api/v1/eval-records/{jobId}
The eval is considered as a pass only if all checks are passed.
"results": [
{
"mode": "[EDIT]",
"result": {
"passes": 1,
"fails": 0,
"checks": 1,
"warning": "",
"error": "",
"interruptions": []
}
}
],
passes: Number of checks passed.fails: Number of checks failed.checks: Total number of checks. Equals to passes + fails.warnings: Number of warnings received when running the eval.error: Number of errors received when running the eval.
# Run all evaluations
uv run invoke_eval.py --files "Evals/*.lua"
# Run specific pattern
uv run invoke_eval.py --files "Evals/0*_*.lua"
# Run with concurrency limit
uv run invoke_eval.py --files "Evals/*.lua" --max-concurrent 5# With Gemini
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \
--api-key $OPEN_EVAL_API_KEY \
--llm-url "dummy-url" \
--llm-name "gemini" \
--llm-model-version "gemini-2.5-flash-preview-09-2025" \
--llm-api-key $GEMINI_API_KEY
# With Claude
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \
--api-key $OPEN_EVAL_API_KEY \
--llm-url "dummy-url" \
--llm-name "claude" \
--llm-model-version "claude-4-sonnet-20250514" \
--llm-api-key $CLAUDE_API_KEY
# With OpenAI
uv run invoke_eval.py --files "Evals/001_make_cars_faster.lua" \
--api-key $OPEN_EVAL_API_KEY \
--llm-url "dummy-url" \
--llm-name "openai" \
--llm-model-version "gpt-4o-2024-08-06" \
--llm-api-key $OPENAI_API_KEYuv run invoke_eval.py [OPTIONS]
Options:
--api-key TEXT Open Cloud API key studio-evaluation
--llm-name TEXT Name of provider, e.g. claude | gemini | openai
--llm-api-key TEXT LLM API key
--llm-model-version TEXT LLM model version, e.g. claude-4-sonnet-20250514
--llm-url TEXT LLM endpoint URL. Not yet supported, please put a placeholder string here.
--max-concurrent INTEGER Maximum concurrent evaluations
--files TEXT [TEXT ...] Lua files to evaluate (supports wildcards)
--use-reference-mode Use reference mode for evaluation. This is used for eval development and contribution, not for LLM assessment.- API Key Not Found: Ensure your API key is set in the
.envfile or passed via--api-key. See.env.exampleas an example. - Permission Denied: Verify your API key has proper scope (
studio-evaluation:create). - Timeout Errors: Evaluations have a 10-minute timeout.
- File Not Found: Check file paths and ensure evaluation files exist.
- SSL certificate verify failed: Find the
Install Certificates.commandin finder and execute it. (See details and other solutions)
https://apis.roblox.com/open-eval-api/v1
curl -X POST 'https://apis.roblox.com/open-eval-api/v1/eval' \
--header 'Content-Type: application/json' \
--header "x-api-key: $OPEN_EVAL_API_KEY" \
--data "$(jq -n --rawfile script Evals/001_make_cars_faster.lua '{
name: "make_cars_faster",
description: "Evaluation on make cars faster",
input_script: $script
}')"curl 'https://apis.roblox.com/open-eval-api/v1/eval-records/{job_id}' \
--header "x-api-key: $OPEN_EVAL_API_KEY"QUEUED: Job is waiting to be processedPENDING: Job is being processedCOMPLETED: Job finished successfullyFAILED: Job failed
curl -X POST 'https://apis.roblox.com/open-eval-api/v1/eval' \
--header 'Content-Type: application/json' \
--header "x-api-key: $OPEN_EVAL_API_KEY" \
--data "$(jq -n --rawfile script src/Evals/e_44_create_part.lua '{
name: "create_part",
description: "Evaluation on create part",
input_script: $script,
custom_llm_info: {
name: "provider-name", // ← Provider only, claude | gemini | openai
api_key: "your-provider-api-key",
model_version: "model-version", // ← see example model versions below
url: "dummy_url_not_effective",
}
}')"Example model-versions
- For Gemini models (provider-name: “gemini”)
- gemini-2.5-pro
- gemini-2.5-flash-preview-09-2025
- For Claude models (provider-name: “claude”)
- claude-4-sonnet-20250514
- claude-sonnet-4-5-20250929
- For OpenAI models (provider-name: “openai”)
- gpt-4o-2024-08-06
Each evaluation file follows this structure:
local eval: BaseEval = {
scenario_name = "001_make_cars_faster", -- Name of the eval
prompt = {
{
{
role = "user",
content = "Make the cars of this game 2x faster", -- User prompt
}
}
},
place = "racing.rbxl", --Name of placefile used. Currently only supports Roblox templates.
}
-- Setup necessary changes to the placefile before evaluation
eval.setup = function()
-- Create necessary set up to placefile, including selection
end
-- Reference function (optional, used when running evals with use-reference-mode)
eval.reference = function()
-- Expected behavior implementation. They are intentionally left blank in this set for the purpose of evaluation.
end
-- Validation function
eval.check_scene = function()
-- Checks for edit mode
end
eval.check_game = function()
-- Checks for play mode
end
return evalThis repository contains open-source evaluation scripts. To contribute:
- Fork the repository
- Create evaluation scripts following the established format
- Test your evaluations thoroughly
- Submit a pull request with clear documentation
This project is part of Roblox's open-source initiative. Please refer to the repository's license file for details.
- Contact the Roblox team for API access and permissions
The LLM Leaderboard summarizes benchmark results and progress for all evaluated Large Language Models in this repository. LLM_LEADERBOARD.md