Generated: 2025-09-09T13:33:19.419743
Model | Total Tasks | Pass@1 (avg ± std) | Pass@4 | Pass^4 | Per-Run Cost (USD) | Avg Agent Time (s) |
---|---|---|---|---|---|---|
gpt-5-low | 127 | 46.9% ± 2.9% | 63.0% | 26.8% | $125.87 | 385.8 |
grok-4 | 127 | 31.7% ± 2.9% | 44.9% | 18.1% | $257.41 | 319.8 |
claude-opus-4-1 | 127 | 29.9% ± 0.0% | / | / | $1165.45 | 361.8 |
claude-sonnet-4 | 127 | 28.1% ± 2.6% | 44.9% | 12.6% | $252.41 | 218.3 |
o3 | 127 | 25.4% ± 2.0% | 43.3% | 12.6% | $113.94 | 169.4 |
qwen-3-coder-plus | 127 | 24.8% ± 2.1% | 40.9% | 12.6% | $36.46 | 274.3 |
kimi-k2-0905 | 127 | 21.9% ± 1.2% | 31.5% | 12.6% | $72.57 | 493.8 |
grok-code-fast-1 | 127 | 20.5% ± 3.4% | 30.7% | 9.4% | $16.08 | 156.6 |
kimi-k2-0711 | 127 | 19.1% ± 1.6% | 31.5% | 11.8% | $36.45 | 214.8 |
qwen-3-max | 127 | 17.7% ± 1.3% | 22.8% | 11.0% | $160 | 213.6 |
o4-mini | 127 | 17.3% ± 2.3% | 31.5% | 6.3% | $63.62 | 323.3 |
deepseek-chat | 127 | 16.7% ± 1.4% | 28.3% | 7.9% | $35.66 | 269.9 |
gemini-2-5-pro | 127 | 15.8% ± 0.6% | 29.9% | 4.7% | $162.48 | 119.4 |
glm-4-5 | 127 | 15.6% ± 1.2% | 24.4% | 6.3% | $18.27 | 166.3 |
gemini-2-5-flash | 127 | 9.1% ± 0.7% | 18.1% | 3.9% | $41.81 | 114.9 |
gpt-5-mini-low | 127 | 8.3% ± 1.3% | 18.9% | 0.8% | $7.86 | 63.2 |
gpt-4-1 | 127 | 8.1% ± 0.7% | 12.6% | 3.1% | $83.62 | 59.7 |
gpt-oss-120b | 127 | 4.7% ± 1.0% | 13.4% | 0.0% | $0.64 | 27.4 |
gpt-5-nano-low | 127 | 4.3% ± 1.2% | 10.2% | 0.8% | $2.5 | 96.4 |
gpt-4-1-mini | 127 | 3.9% ± 1.0% | 7.1% | 1.6% | $59.96 | 85.7 |
gpt-4-1-nano | 127 | 0.0% ± 0.0% | 0.0% | 0.0% | $2.54 | 39.1 |
Model | Total Tasks | Pass@1 (avg ± std) | Pass@4 | Pass^4 | Per-Run Cost (USD) | Avg Agent Time (s) |
---|---|---|---|---|---|---|
gpt-5-low | 30 | 54.2% ± 6.8% | 73.3% | 33.3% | $15.48 | 275.6 |
grok-4 | 30 | 50.8% ± 6.4% | 73.3% | 26.7% | $27.08 | 256.6 |
o3 | 30 | 35.8% ± 2.8% | 50.0% | 26.7% | $45.65 | 277.9 |
claude-opus-4-1 | 30 | 33.3% ± 0.0% | / | / | $132.3 | 267.8 |
claude-sonnet-4 | 30 | 27.5% ± 2.8% | 50.0% | 6.7% | $29 | 193.1 |
o4-mini | 30 | 25.0% ± 2.9% | 36.7% | 13.3% | $11.78 | 263.6 |
gemini-2-5-pro | 30 | 24.2% ± 3.6% | 43.3% | 10.0% | $19.61 | 126.1 |
grok-code-fast-1 | 30 | 23.3% ± 7.4% | 40.0% | 10.0% | $1.76 | 75.5 |
kimi-k2-0711 | 30 | 20.0% ± 2.4% | 30.0% | 13.3% | $6.99 | 222.5 |
deepseek-chat | 30 | 15.8% ± 1.4% | 26.7% | 6.7% | $7.25 | 281.7 |
kimi-k2-0905 | 30 | 14.2% ± 1.4% | 23.3% | 6.7% | $12.88 | 376.5 |
qwen-3-coder-plus | 30 | 13.3% ± 6.7% | 26.7% | 3.3% | $5.93 | 157.2 |
gpt-4-1 | 30 | 12.5% ± 1.4% | 20.0% | 3.3% | $9.07 | 41.0 |
gpt-5-nano-low | 30 | 12.5% ± 3.6% | 30.0% | 3.3% | $1.19 | 129.1 |
gpt-5-mini-low | 30 | 12.5% ± 4.9% | 33.3% | 3.3% | $1.13 | 67.7 |
qwen-3-max | 30 | 10.8% ± 1.4% | 13.3% | 10.0% | $14.54 | 133.9 |
gemini-2-5-flash | 30 | 8.3% ± 1.7% | 13.3% | 6.7% | $1.18 | 62.1 |
glm-4-5 | 30 | 7.5% ± 1.4% | 13.3% | 3.3% | $2.08 | 130.2 |
gpt-oss-120b | 30 | 5.8% ± 4.3% | 16.7% | 0.0% | $0.05 | 18.3 |
gpt-4-1-mini | 30 | 3.3% ± 0.0% | 3.3% | 3.3% | $2.43 | 51.5 |
gpt-4-1-nano | 30 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.37 | 28.1 |
Model | Total Tasks | Pass@1 (avg ± std) | Pass@4 | Pass^4 | Per-Run Cost (USD) | Avg Agent Time (s) |
---|---|---|---|---|---|---|
gpt-5-low | 23 | 27.2% ± 1.9% | 39.1% | 17.4% | $22.31 | 268.3 |
glm-4-5 | 23 | 22.8% ± 6.4% | 34.8% | 13.0% | $3.77 | 153.5 |
claude-opus-4-1 | 23 | 21.7% ± 0.0% | / | / | $224.18 | 390.2 |
qwen-3-coder-plus | 23 | 19.6% ± 6.5% | 34.8% | 13.0% | $9.2 | 320.4 |
claude-sonnet-4 | 23 | 16.3% ± 5.7% | 30.4% | 8.7% | $49.61 | 196.5 |
kimi-k2-0905 | 23 | 16.3% ± 1.9% | 26.1% | 8.7% | $14.21 | 780.8 |
gemini-2-5-flash | 23 | 15.2% ± 2.2% | 21.7% | 8.7% | $8.37 | 206.4 |
o3 | 23 | 14.1% ± 3.6% | 21.7% | 4.3% | $21.41 | 128.0 |
qwen-3-max | 23 | 14.1% ± 3.6% | 17.4% | 4.3% | $37.56 | 181.5 |
grok-4 | 23 | 14.1% ± 3.6% | 21.7% | 8.7% | $56.18 | 269.0 |
o4-mini | 23 | 14.1% ± 6.4% | 26.1% | 4.3% | $13.79 | 248.8 |
kimi-k2-0711 | 23 | 10.9% ± 2.2% | 13.0% | 4.3% | $5.37 | 205.0 |
deepseek-chat | 23 | 9.8% ± 1.9% | 13.0% | 8.7% | $4.75 | 194.0 |
gemini-2-5-pro | 23 | 9.8% ± 1.9% | 21.7% | 0.0% | $11.96 | 91.3 |
grok-code-fast-1 | 23 | 8.7% ± 5.3% | 17.4% | 4.3% | $3.68 | 182.9 |
gpt-5-mini-low | 23 | 8.7% ± 3.1% | 13.0% | 0.0% | $2.59 | 63.1 |
gpt-4-1 | 23 | 7.6% ± 1.9% | 8.7% | 4.3% | $20.97 | 90.2 |
gpt-4-1-mini | 23 | 6.5% ± 6.5% | 17.4% | 0.0% | $4.35 | 83.1 |
gpt-oss-120b | 23 | 4.3% ± 3.1% | 8.7% | 0.0% | $0.14 | 24.0 |
gpt-5-nano-low | 23 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.29 | 57.7 |
gpt-4-1-nano | 23 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.74 | 51.8 |
Model | Total Tasks | Pass@1 (avg ± std) | Pass@4 | Pass^4 | Per-Run Cost (USD) | Avg Agent Time (s) |
---|---|---|---|---|---|---|
gpt-5-low | 28 | 36.6% ± 7.7% | 53.6% | 14.3% | $23.26 | 559.6 |
claude-opus-4-1 | 28 | 35.7% ± 0.0% | / | / | $276.24 | 294.2 |
o3 | 28 | 24.1% ± 3.9% | 46.4% | 7.1% | $14.72 | 171.4 |
glm-4-5 | 28 | 21.4% ± 2.5% | 32.1% | 10.7% | $5.97 | 222.2 |
claude-sonnet-4 | 28 | 21.4% ± 5.1% | 39.3% | 7.1% | $56.1 | 193.2 |
o4-mini | 28 | 20.5% ± 5.9% | 42.9% | 7.1% | $11.44 | 442.4 |
qwen-3-coder-plus | 28 | 19.6% ± 6.4% | 39.3% | 7.1% | $4.52 | 99.7 |
qwen-3-max | 28 | 17.0% ± 4.6% | 25.0% | 3.6% | $33.34 | 183.5 |
kimi-k2-0711 | 28 | 14.3% ± 4.4% | 32.1% | 7.1% | $9.37 | 183.4 |
deepseek-chat | 28 | 12.5% ± 3.1% | 28.6% | 0.0% | $8 | 238.9 |
kimi-k2-0905 | 28 | 8.0% ± 3.0% | 10.7% | 3.6% | $19.13 | 467.0 |
gpt-4-1 | 28 | 6.2% ± 1.6% | 14.3% | 0.0% | $7.9 | 48.8 |
gemini-2-5-flash | 28 | 6.2% ± 4.6% | 21.4% | 0.0% | $2.15 | 55.8 |
gpt-5-mini-low | 28 | 5.4% ± 5.4% | 14.3% | 0.0% | $1.07 | 62.6 |
gemini-2-5-pro | 28 | 4.5% ± 3.0% | 7.1% | 0.0% | $17.9 | 102.4 |
gpt-oss-120b | 28 | 3.6% ± 2.5% | 14.3% | 0.0% | $0.15 | 34.0 |
grok-code-fast-1 | 28 | 2.7% ± 1.6% | 3.6% | 0.0% | $3.45 | 334.1 |
grok-4 | 28 | 2.7% ± 1.6% | 3.6% | 0.0% | $62.48 | 554.0 |
gpt-4-1-mini | 28 | 1.8% ± 1.8% | 3.6% | 0.0% | $3 | 59.1 |
gpt-5-nano-low | 28 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.19 | 68.0 |
gpt-4-1-nano | 28 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.28 | 32.2 |
Model | Total Tasks | Pass@1 (avg ± std) | Pass@4 | Pass^4 | Per-Run Cost (USD) | Avg Agent Time (s) |
---|---|---|---|---|---|---|
gpt-5-low | 25 | 45.0% ± 1.7% | 56.0% | 32.0% | $58.7 | 526.9 |
grok-4 | 25 | 35.0% ± 7.7% | 48.0% | 20.0% | $97.36 | 277.2 |
qwen-3-coder-plus | 25 | 30.0% ± 4.5% | 48.0% | 8.0% | $14.31 | 680.0 |
kimi-k2-0905 | 25 | 30.0% ± 6.0% | 40.0% | 20.0% | $20.51 | 380.6 |
claude-sonnet-4 | 25 | 26.0% ± 6.0% | 36.0% | 8.0% | $94.47 | 278.7 |
grok-code-fast-1 | 25 | 25.0% ± 1.7% | 36.0% | 8.0% | $6.06 | 119.5 |
claude-opus-4-1 | 25 | 24.0% ± 0.0% | / | / | $435.18 | 395.2 |
o3 | 25 | 15.0% ± 5.2% | 32.0% | 8.0% | $28.71 | 153.9 |
gemini-2-5-pro | 25 | 15.0% ± 1.7% | 32.0% | 4.0% | $108.12 | 177.7 |
glm-4-5 | 25 | 13.0% ± 3.3% | 20.0% | 4.0% | $4.9 | 165.6 |
kimi-k2-0711 | 25 | 13.0% ± 3.3% | 16.0% | 8.0% | $11.17 | 221.4 |
o4-mini | 25 | 12.0% ± 2.8% | 28.0% | 0.0% | $25.71 | 530.6 |
gpt-4-1 | 25 | 8.0% ± 2.8% | 12.0% | 4.0% | $43.16 | 92.2 |
qwen-3-max | 25 | 8.0% ± 0.0% | 12.0% | 4.0% | $69.1 | 417.7 |
deepseek-chat | 25 | 7.0% ± 3.3% | 16.0% | 0.0% | $11.78 | 288.3 |
gemini-2-5-flash | 25 | 6.0% ± 2.0% | 12.0% | 0.0% | $29.31 | 205.4 |
gpt-oss-120b | 25 | 3.0% ± 1.7% | 4.0% | 0.0% | $0.26 | 37.3 |
gpt-5-mini-low | 25 | 1.0% ± 1.7% | 4.0% | 0.0% | $2.72 | 67.1 |
gpt-5-nano-low | 25 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.67 | 139.3 |
gpt-4-1-nano | 25 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.98 | 53.8 |
gpt-4-1-mini | 25 | 0.0% ± 0.0% | 0.0% | 0.0% | $49.72 | 195.7 |
Model | Total Tasks | Pass@1 (avg ± std) | Pass@4 | Pass^4 | Per-Run Cost (USD) | Avg Agent Time (s) |
---|---|---|---|---|---|---|
gpt-5-low | 21 | 73.8% ± 4.1% | 95.2% | 38.1% | $6.11 | 272.3 |
grok-4 | 21 | 58.3% ± 7.8% | 81.0% | 38.1% | $14.32 | 204.3 |
claude-sonnet-4 | 21 | 53.6% ± 6.2% | 71.4% | 38.1% | $23.24 | 239.5 |
qwen-3-coder-plus | 21 | 47.6% ± 5.8% | 61.9% | 38.1% | $2.5 | 140.9 |
grok-code-fast-1 | 21 | 47.6% ± 4.8% | 61.9% | 28.6% | $1.12 | 51.3 |
kimi-k2-0905 | 21 | 47.6% ± 4.8% | 66.7% | 28.6% | $5.84 | 517.3 |
qwen-3-max | 21 | 44.0% ± 2.1% | 52.4% | 38.1% | $5.46 | 159.6 |
deepseek-chat | 21 | 42.9% ± 7.5% | 61.9% | 28.6% | $3.89 | 355.6 |
kimi-k2-0711 | 21 | 40.5% ± 7.9% | 71.4% | 28.6% | $3.55 | 248.4 |
o3 | 21 | 36.9% ± 4.0% | 66.7% | 14.3% | $3.46 | 75.6 |
claude-opus-4-1 | 21 | 33.3% ± 0.0% | / | / | $97.54 | 515.4 |
gemini-2-5-pro | 21 | 26.2% ± 7.9% | 47.6% | 9.5% | $4.89 | 93.5 |
glm-4-5 | 21 | 14.3% ± 7.5% | 23.8% | 0.0% | $1.56 | 158.1 |
gpt-5-mini-low | 21 | 14.3% ± 3.4% | 28.6% | 0.0% | $0.34 | 52.7 |
o4-mini | 21 | 11.9% ± 4.1% | 19.1% | 4.8% | $0.9 | 84.7 |
gemini-2-5-flash | 21 | 10.7% ± 6.2% | 23.8% | 4.8% | $0.81 | 60.9 |
gpt-4-1-mini | 21 | 9.5% ± 3.4% | 14.3% | 4.8% | $0.45 | 42.1 |
gpt-5-nano-low | 21 | 8.3% ± 4.0% | 19.1% | 0.0% | $0.16 | 78.5 |
gpt-oss-120b | 21 | 7.1% ± 2.4% | 23.8% | 0.0% | $0.04 | 23.3 |
gpt-4-1 | 21 | 4.8% ± 0.0% | 4.8% | 4.8% | $2.52 | 28.9 |
gpt-4-1-nano | 21 | 0.0% ± 0.0% | 0.0% | 0.0% | $0.17 | 32.5 |