Skip to content

eval-sys/mcpmark-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mcpmark-v1-0905 - Evaluation Results

Generated: 2025-09-09T13:33:19.419743

Overall Performance

Model Total Tasks Pass@1 (avg ± std) Pass@4 Pass^4 Per-Run Cost (USD) Avg Agent Time (s)
gpt-5-low 127 46.9% ± 2.9% 63.0% 26.8% $125.87 385.8
grok-4 127 31.7% ± 2.9% 44.9% 18.1% $257.41 319.8
claude-opus-4-1 127 29.9% ± 0.0% / / $1165.45 361.8
claude-sonnet-4 127 28.1% ± 2.6% 44.9% 12.6% $252.41 218.3
o3 127 25.4% ± 2.0% 43.3% 12.6% $113.94 169.4
qwen-3-coder-plus 127 24.8% ± 2.1% 40.9% 12.6% $36.46 274.3
kimi-k2-0905 127 21.9% ± 1.2% 31.5% 12.6% $72.57 493.8
grok-code-fast-1 127 20.5% ± 3.4% 30.7% 9.4% $16.08 156.6
kimi-k2-0711 127 19.1% ± 1.6% 31.5% 11.8% $36.45 214.8
qwen-3-max 127 17.7% ± 1.3% 22.8% 11.0% $160 213.6
o4-mini 127 17.3% ± 2.3% 31.5% 6.3% $63.62 323.3
deepseek-chat 127 16.7% ± 1.4% 28.3% 7.9% $35.66 269.9
gemini-2-5-pro 127 15.8% ± 0.6% 29.9% 4.7% $162.48 119.4
glm-4-5 127 15.6% ± 1.2% 24.4% 6.3% $18.27 166.3
gemini-2-5-flash 127 9.1% ± 0.7% 18.1% 3.9% $41.81 114.9
gpt-5-mini-low 127 8.3% ± 1.3% 18.9% 0.8% $7.86 63.2
gpt-4-1 127 8.1% ± 0.7% 12.6% 3.1% $83.62 59.7
gpt-oss-120b 127 4.7% ± 1.0% 13.4% 0.0% $0.64 27.4
gpt-5-nano-low 127 4.3% ± 1.2% 10.2% 0.8% $2.5 96.4
gpt-4-1-mini 127 3.9% ± 1.0% 7.1% 1.6% $59.96 85.7
gpt-4-1-nano 127 0.0% ± 0.0% 0.0% 0.0% $2.54 39.1

Filesystem Performance

Model Total Tasks Pass@1 (avg ± std) Pass@4 Pass^4 Per-Run Cost (USD) Avg Agent Time (s)
gpt-5-low 30 54.2% ± 6.8% 73.3% 33.3% $15.48 275.6
grok-4 30 50.8% ± 6.4% 73.3% 26.7% $27.08 256.6
o3 30 35.8% ± 2.8% 50.0% 26.7% $45.65 277.9
claude-opus-4-1 30 33.3% ± 0.0% / / $132.3 267.8
claude-sonnet-4 30 27.5% ± 2.8% 50.0% 6.7% $29 193.1
o4-mini 30 25.0% ± 2.9% 36.7% 13.3% $11.78 263.6
gemini-2-5-pro 30 24.2% ± 3.6% 43.3% 10.0% $19.61 126.1
grok-code-fast-1 30 23.3% ± 7.4% 40.0% 10.0% $1.76 75.5
kimi-k2-0711 30 20.0% ± 2.4% 30.0% 13.3% $6.99 222.5
deepseek-chat 30 15.8% ± 1.4% 26.7% 6.7% $7.25 281.7
kimi-k2-0905 30 14.2% ± 1.4% 23.3% 6.7% $12.88 376.5
qwen-3-coder-plus 30 13.3% ± 6.7% 26.7% 3.3% $5.93 157.2
gpt-4-1 30 12.5% ± 1.4% 20.0% 3.3% $9.07 41.0
gpt-5-nano-low 30 12.5% ± 3.6% 30.0% 3.3% $1.19 129.1
gpt-5-mini-low 30 12.5% ± 4.9% 33.3% 3.3% $1.13 67.7
qwen-3-max 30 10.8% ± 1.4% 13.3% 10.0% $14.54 133.9
gemini-2-5-flash 30 8.3% ± 1.7% 13.3% 6.7% $1.18 62.1
glm-4-5 30 7.5% ± 1.4% 13.3% 3.3% $2.08 130.2
gpt-oss-120b 30 5.8% ± 4.3% 16.7% 0.0% $0.05 18.3
gpt-4-1-mini 30 3.3% ± 0.0% 3.3% 3.3% $2.43 51.5
gpt-4-1-nano 30 0.0% ± 0.0% 0.0% 0.0% $0.37 28.1

Github Performance

Model Total Tasks Pass@1 (avg ± std) Pass@4 Pass^4 Per-Run Cost (USD) Avg Agent Time (s)
gpt-5-low 23 27.2% ± 1.9% 39.1% 17.4% $22.31 268.3
glm-4-5 23 22.8% ± 6.4% 34.8% 13.0% $3.77 153.5
claude-opus-4-1 23 21.7% ± 0.0% / / $224.18 390.2
qwen-3-coder-plus 23 19.6% ± 6.5% 34.8% 13.0% $9.2 320.4
claude-sonnet-4 23 16.3% ± 5.7% 30.4% 8.7% $49.61 196.5
kimi-k2-0905 23 16.3% ± 1.9% 26.1% 8.7% $14.21 780.8
gemini-2-5-flash 23 15.2% ± 2.2% 21.7% 8.7% $8.37 206.4
o3 23 14.1% ± 3.6% 21.7% 4.3% $21.41 128.0
qwen-3-max 23 14.1% ± 3.6% 17.4% 4.3% $37.56 181.5
grok-4 23 14.1% ± 3.6% 21.7% 8.7% $56.18 269.0
o4-mini 23 14.1% ± 6.4% 26.1% 4.3% $13.79 248.8
kimi-k2-0711 23 10.9% ± 2.2% 13.0% 4.3% $5.37 205.0
deepseek-chat 23 9.8% ± 1.9% 13.0% 8.7% $4.75 194.0
gemini-2-5-pro 23 9.8% ± 1.9% 21.7% 0.0% $11.96 91.3
grok-code-fast-1 23 8.7% ± 5.3% 17.4% 4.3% $3.68 182.9
gpt-5-mini-low 23 8.7% ± 3.1% 13.0% 0.0% $2.59 63.1
gpt-4-1 23 7.6% ± 1.9% 8.7% 4.3% $20.97 90.2
gpt-4-1-mini 23 6.5% ± 6.5% 17.4% 0.0% $4.35 83.1
gpt-oss-120b 23 4.3% ± 3.1% 8.7% 0.0% $0.14 24.0
gpt-5-nano-low 23 0.0% ± 0.0% 0.0% 0.0% $0.29 57.7
gpt-4-1-nano 23 0.0% ± 0.0% 0.0% 0.0% $0.74 51.8

Notion Performance

Model Total Tasks Pass@1 (avg ± std) Pass@4 Pass^4 Per-Run Cost (USD) Avg Agent Time (s)
gpt-5-low 28 36.6% ± 7.7% 53.6% 14.3% $23.26 559.6
claude-opus-4-1 28 35.7% ± 0.0% / / $276.24 294.2
o3 28 24.1% ± 3.9% 46.4% 7.1% $14.72 171.4
glm-4-5 28 21.4% ± 2.5% 32.1% 10.7% $5.97 222.2
claude-sonnet-4 28 21.4% ± 5.1% 39.3% 7.1% $56.1 193.2
o4-mini 28 20.5% ± 5.9% 42.9% 7.1% $11.44 442.4
qwen-3-coder-plus 28 19.6% ± 6.4% 39.3% 7.1% $4.52 99.7
qwen-3-max 28 17.0% ± 4.6% 25.0% 3.6% $33.34 183.5
kimi-k2-0711 28 14.3% ± 4.4% 32.1% 7.1% $9.37 183.4
deepseek-chat 28 12.5% ± 3.1% 28.6% 0.0% $8 238.9
kimi-k2-0905 28 8.0% ± 3.0% 10.7% 3.6% $19.13 467.0
gpt-4-1 28 6.2% ± 1.6% 14.3% 0.0% $7.9 48.8
gemini-2-5-flash 28 6.2% ± 4.6% 21.4% 0.0% $2.15 55.8
gpt-5-mini-low 28 5.4% ± 5.4% 14.3% 0.0% $1.07 62.6
gemini-2-5-pro 28 4.5% ± 3.0% 7.1% 0.0% $17.9 102.4
gpt-oss-120b 28 3.6% ± 2.5% 14.3% 0.0% $0.15 34.0
grok-code-fast-1 28 2.7% ± 1.6% 3.6% 0.0% $3.45 334.1
grok-4 28 2.7% ± 1.6% 3.6% 0.0% $62.48 554.0
gpt-4-1-mini 28 1.8% ± 1.8% 3.6% 0.0% $3 59.1
gpt-5-nano-low 28 0.0% ± 0.0% 0.0% 0.0% $0.19 68.0
gpt-4-1-nano 28 0.0% ± 0.0% 0.0% 0.0% $0.28 32.2

Playwright Performance

Model Total Tasks Pass@1 (avg ± std) Pass@4 Pass^4 Per-Run Cost (USD) Avg Agent Time (s)
gpt-5-low 25 45.0% ± 1.7% 56.0% 32.0% $58.7 526.9
grok-4 25 35.0% ± 7.7% 48.0% 20.0% $97.36 277.2
qwen-3-coder-plus 25 30.0% ± 4.5% 48.0% 8.0% $14.31 680.0
kimi-k2-0905 25 30.0% ± 6.0% 40.0% 20.0% $20.51 380.6
claude-sonnet-4 25 26.0% ± 6.0% 36.0% 8.0% $94.47 278.7
grok-code-fast-1 25 25.0% ± 1.7% 36.0% 8.0% $6.06 119.5
claude-opus-4-1 25 24.0% ± 0.0% / / $435.18 395.2
o3 25 15.0% ± 5.2% 32.0% 8.0% $28.71 153.9
gemini-2-5-pro 25 15.0% ± 1.7% 32.0% 4.0% $108.12 177.7
glm-4-5 25 13.0% ± 3.3% 20.0% 4.0% $4.9 165.6
kimi-k2-0711 25 13.0% ± 3.3% 16.0% 8.0% $11.17 221.4
o4-mini 25 12.0% ± 2.8% 28.0% 0.0% $25.71 530.6
gpt-4-1 25 8.0% ± 2.8% 12.0% 4.0% $43.16 92.2
qwen-3-max 25 8.0% ± 0.0% 12.0% 4.0% $69.1 417.7
deepseek-chat 25 7.0% ± 3.3% 16.0% 0.0% $11.78 288.3
gemini-2-5-flash 25 6.0% ± 2.0% 12.0% 0.0% $29.31 205.4
gpt-oss-120b 25 3.0% ± 1.7% 4.0% 0.0% $0.26 37.3
gpt-5-mini-low 25 1.0% ± 1.7% 4.0% 0.0% $2.72 67.1
gpt-5-nano-low 25 0.0% ± 0.0% 0.0% 0.0% $0.67 139.3
gpt-4-1-nano 25 0.0% ± 0.0% 0.0% 0.0% $0.98 53.8
gpt-4-1-mini 25 0.0% ± 0.0% 0.0% 0.0% $49.72 195.7

Postgres Performance

Model Total Tasks Pass@1 (avg ± std) Pass@4 Pass^4 Per-Run Cost (USD) Avg Agent Time (s)
gpt-5-low 21 73.8% ± 4.1% 95.2% 38.1% $6.11 272.3
grok-4 21 58.3% ± 7.8% 81.0% 38.1% $14.32 204.3
claude-sonnet-4 21 53.6% ± 6.2% 71.4% 38.1% $23.24 239.5
qwen-3-coder-plus 21 47.6% ± 5.8% 61.9% 38.1% $2.5 140.9
grok-code-fast-1 21 47.6% ± 4.8% 61.9% 28.6% $1.12 51.3
kimi-k2-0905 21 47.6% ± 4.8% 66.7% 28.6% $5.84 517.3
qwen-3-max 21 44.0% ± 2.1% 52.4% 38.1% $5.46 159.6
deepseek-chat 21 42.9% ± 7.5% 61.9% 28.6% $3.89 355.6
kimi-k2-0711 21 40.5% ± 7.9% 71.4% 28.6% $3.55 248.4
o3 21 36.9% ± 4.0% 66.7% 14.3% $3.46 75.6
claude-opus-4-1 21 33.3% ± 0.0% / / $97.54 515.4
gemini-2-5-pro 21 26.2% ± 7.9% 47.6% 9.5% $4.89 93.5
glm-4-5 21 14.3% ± 7.5% 23.8% 0.0% $1.56 158.1
gpt-5-mini-low 21 14.3% ± 3.4% 28.6% 0.0% $0.34 52.7
o4-mini 21 11.9% ± 4.1% 19.1% 4.8% $0.9 84.7
gemini-2-5-flash 21 10.7% ± 6.2% 23.8% 4.8% $0.81 60.9
gpt-4-1-mini 21 9.5% ± 3.4% 14.3% 4.8% $0.45 42.1
gpt-5-nano-low 21 8.3% ± 4.0% 19.1% 0.0% $0.16 78.5
gpt-oss-120b 21 7.1% ± 2.4% 23.8% 0.0% $0.04 23.3
gpt-4-1 21 4.8% ± 0.0% 4.8% 4.8% $2.52 28.9
gpt-4-1-nano 21 0.0% ± 0.0% 0.0% 0.0% $0.17 32.5

About

Collection of evaluation results for MCPMark

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •