From fd3d48df2a765155645dea1edd9c205ceff71050 Mon Sep 17 00:00:00 2001 From: tzhangR Date: Wed, 15 Oct 2025 15:23:14 -0700 Subject: [PATCH] Add LLM leaderboard for Roblox Studio Assistant in OpenEval open-source repo This file contains a leaderboard for various LLM models used in Roblox Studio Assistant, including their pass rates and safety metrics. The data is pulled from https://docs.google.com/document/d/1Hdy8bp5VvqRZ7JGLvjDReLzjSlNneBfO5OS5cO8kBiw/edit?tab=t.0 --- LLM_LEADERBOARD.md | 109 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 109 insertions(+) create mode 100644 LLM_LEADERBOARD.md diff --git a/LLM_LEADERBOARD.md b/LLM_LEADERBOARD.md new file mode 100644 index 0000000..9c21d31 --- /dev/null +++ b/LLM_LEADERBOARD.md @@ -0,0 +1,109 @@ + + + + + +## LLM Leaderboard on Roblox Studio Assistant + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelPass Rate SafetyTool Calling
Pass@1Pass@5Cons@5All@5Safety Pass RateAvg Tool Error Rate Explanation Rate with Tools
Claude-4-sonnet-2025051455.59%75.99%56.68%34.80%68.7%40%97%
Claude-sonnet-4-5-2025092957.87%74.43%59.91%38.73%78.0%22%90%
Qwen3-Coder 480B/A35B Instruct49.61%76.26%50.56%23.84%69.7%67%47.30%
GLM 4.647.72%71.15%48.20%26.01%-15%-
GLM 4.541.34%70.57%40.18%18.11%61.6%88%37.6%
Gemini-2.5-pro
AUTO thinking, NO web
48.58%67.00%49.16%30.59%59.60%40%92%
Gemini-2.5-flash-preview-09-2025
AUTO thinking, NO web
37.24%63.04%35.61%17.72%68.70%69%49.50%
+ + + +## Metrics Explaination +- Pass@1 -- average probability of success in 1 attempt +- Pass@5 -- average probability of success in at least 1 out of 5 attempts +- Cons@5 -- average probability of success in at least 3 out of 5 attempts +- All@5 -- average probability of success in 5 out of 5 attempts +- Safety Pass Rate -- average rate whether it handles safety/appropriateness correctly +- Avg Tool Error Rate -- average tool call error rates +- Explanation Rate with Tools -- quality of explanations when using tools