diff --git a/LLM_LEADERBOARD.md b/LLM_LEADERBOARD.md new file mode 100644 index 0000000..9c21d31 --- /dev/null +++ b/LLM_LEADERBOARD.md @@ -0,0 +1,109 @@ + + + +
+ +## LLM Leaderboard on Roblox Studio Assistant + +| Model | +Pass Rate | +Safety | +Tool Calling | +||||
|---|---|---|---|---|---|---|---|
| Pass@1 | +Pass@5 | +Cons@5 | +All@5 | +Safety Pass Rate | +Avg Tool Error Rate | +Explanation Rate with Tools | +|
| Claude-4-sonnet-20250514 | +55.59% | +75.99% | +56.68% | +34.80% | +68.7% | +40% | +97% | +
| Claude-sonnet-4-5-20250929 | +57.87% | +74.43% | +59.91% | +38.73% | +78.0% | +22% | +90% | +
| Qwen3-Coder 480B/A35B Instruct | +49.61% | +76.26% | +50.56% | +23.84% | +69.7% | +67% | +47.30% | +
| GLM 4.6 | +47.72% | +71.15% | +48.20% | +26.01% | +- | +15% | +- | +
| GLM 4.5 | +41.34% | +70.57% | +40.18% | +18.11% | +61.6% | +88% | +37.6% | +
| Gemini-2.5-pro AUTO thinking, NO web |
+ 48.58% | +67.00% | +49.16% | +30.59% | +59.60% | +40% | +92% | +
| Gemini-2.5-flash-preview-09-2025 AUTO thinking, NO web |
+ 37.24% | +63.04% | +35.61% | +17.72% | +68.70% | +69% | +49.50% | +