mcpmark-release - Evaluation Results

Generated: 2025-12-15T06:36:08.102520 Task set: standard

Overall Performance

Model	Total Tasks	Pass@1 (avg ± std)	Pass@4	Pass^4	Per-Run Cost (USD)	Avg Agent Time (s)
gpt-5-2-high	127	57.5% ± 1.1%	66.9%	44.9%	$250.47	732.5
gemini-3-pro-high	127	53.9% ± 0.4%	66.9%	37.8%	$265.59	222.4
gpt-5-medium	127	52.6% ± 1.3%	68.5%	33.9%	$127.46	478.2
gpt-5-high	127	51.6% ± 2.5%	66.1%	33.1%	$153.89	1029.3
gemini-3-pro-low	127	50.8% ± 2.1%	67.7%	30.7%	$257.15	209.4
gpt-5-low	127	46.9% ± 2.9%	63.0%	26.8%	$125.87	385.8
claude-opus-4-5-high	127	42.3% ± 2.0%	53.5%	33.9%	$466.18	216.9
deepseek-v3-2-thinking	127	36.8% ± 1.8%	51.2%	21.3%	$31.28	398.0
claude-sonnet-4-5	127	32.1% ± 2.3%	46.5%	16.5%	$281.6	173.2
grok-4	127	31.7% ± 2.9%	44.9%	18.1%	$257.41	319.8
gpt-5-mini-high	127	30.3% ± 1.7%	46.5%	16.5%	$40.35	349.9
claude-opus-4-1	127	29.9% ± 0.0%	/	/	$1165.45	361.8
deepseek-v3-2-chat	127	29.7% ± 1.5%	46.5%	13.4%	$26.58	298.4
claude-sonnet-4-high	127	28.3% ± 2.4%	40.9%	18.1%	$442.33	185.6
claude-sonnet-4	127	28.1% ± 2.6%	44.9%	12.6%	$252.41	218.3
claude-sonnet-4-low	127	27.4% ± 1.7%	39.4%	18.1%	$460.95	199.4
gpt-5-mini-medium	127	27.4% ± 3.1%	45.7%	9.4%	$26.02	159.9
o3	127	25.4% ± 2.0%	43.3%	12.6%	$113.94	169.4
qwen-3-coder-plus	127	24.8% ± 2.1%	40.9%	12.6%	$36.46	274.3
grok-4-fast	127	24.0% ± 3.1%	38.6%	12.6%	$17.88	109.9
kimi-k2-0905	127	21.9% ± 1.2%	31.5%	12.6%	$72.57	493.8
deepseek-v3-1-terminus-thinking	127	21.3% ± 3.3%	37.0%	5.5%	$10.52	734.5
grok-code-fast-1	127	20.5% ± 3.4%	30.7%	9.4%	$16.08	156.6
kimi-k2-0711	127	19.1% ± 1.6%	31.5%	11.8%	$36.45	214.8
qwen-3-max	127	17.7% ± 1.3%	22.8%	11.0%	$160	213.6
o4-mini	127	17.3% ± 2.3%	31.5%	6.3%	$63.62	323.3
deepseek-chat	127	16.7% ± 1.4%	28.3%	7.9%	$35.66	269.9
deepseek-v3-1-terminus	127	16.5% ± 5.1%	29.9%	3.9%	$12.65	244.9
gemini-2-5-pro	127	15.8% ± 0.6%	29.9%	4.7%	$162.48	119.4
glm-4-5	127	15.6% ± 1.2%	24.4%	6.3%	$18.27	166.3
gemini-2-5-flash	127	9.1% ± 0.7%	18.1%	3.9%	$41.81	114.9
gpt-5-mini-low	127	8.3% ± 1.3%	18.9%	0.8%	$7.86	63.2
gpt-4-1	127	8.1% ± 0.7%	12.6%	3.1%	$83.62	59.7
gpt-5-nano-medium	127	6.3% ± 2.0%	11.8%	1.6%	$4.06	157.2
gpt-5-nano-high	127	5.1% ± 2.1%	14.2%	0.0%	$7.79	309.5
gpt-oss-120b	127	4.7% ± 1.0%	13.4%	0.0%	$0.64	27.4
gpt-5-nano-low	127	4.3% ± 1.2%	10.2%	0.8%	$2.5	96.4
gpt-4-1-mini	127	3.9% ± 1.0%	7.1%	1.6%	$59.96	85.7
gpt-4-1-nano	127	0.0% ± 0.0%	0.0%	0.0%	$2.54	39.1

Filesystem Performance

Model	Total Tasks	Pass@1 (avg ± std)	Pass@4	Pass^4	Per-Run Cost (USD)	Avg Agent Time (s)
gpt-5-2-high	30	60.8% ± 2.8%	70.0%	46.7%	$40.75	500.0
gemini-3-pro-high	30	59.2% ± 3.6%	80.0%	40.0%	$41.68	229.3
gpt-5-medium	30	57.5% ± 3.6%	76.7%	36.7%	$13.31	313.1
gemini-3-pro-low	30	56.7% ± 4.1%	80.0%	33.3%	$39.3	209.2
gpt-5-low	30	54.2% ± 6.8%	73.3%	33.3%	$15.48	275.6
gpt-5-high	30	52.5% ± 3.6%	70.0%	36.7%	$17.48	828.0
grok-4	30	50.8% ± 6.4%	73.3%	26.7%	$27.08	256.6
claude-opus-4-5-high	30	40.0% ± 2.4%	50.0%	33.3%	$36.23	87.9
deepseek-v3-2-thinking	30	36.7% ± 4.1%	46.7%	23.3%	$5.47	413.5
o3	30	35.8% ± 2.8%	50.0%	26.7%	$45.65	277.9
gpt-5-mini-high	30	35.0% ± 7.6%	46.7%	23.3%	$5.56	288.9
claude-opus-4-1	30	33.3% ± 0.0%	/	/	$132.3	267.8
gpt-5-mini-medium	30	33.3% ± 6.2%	53.3%	10.0%	$3.74	174.3
claude-sonnet-4-5	30	32.5% ± 4.9%	43.3%	13.3%	$26.32	95.1
grok-4-fast	30	29.2% ± 7.2%	53.3%	16.7%	$3.25	74.7
claude-sonnet-4	30	27.5% ± 2.8%	50.0%	6.7%	$29	193.1
deepseek-v3-2-chat	30	25.0% ± 5.5%	43.3%	6.7%	$5.03	311.2
o4-mini	30	25.0% ± 2.9%	36.7%	13.3%	$11.78	263.6
deepseek-v3-1-terminus-thinking	30	24.2% ± 10.1%	43.3%	6.7%	$1.03	773.4
gemini-2-5-pro	30	24.2% ± 3.6%	43.3%	10.0%	$19.61	126.1
claude-sonnet-4-low	30	23.3% ± 4.7%	36.7%	13.3%	$95.98	176.0
grok-code-fast-1	30	23.3% ± 7.4%	40.0%	10.0%	$1.76	75.5
claude-sonnet-4-high	30	23.3% ± 4.1%	36.7%	10.0%	$82.56	143.4
kimi-k2-0711	30	20.0% ± 2.4%	30.0%	13.3%	$6.99	222.5
deepseek-chat	30	15.8% ± 1.4%	26.7%	6.7%	$7.25	281.7
kimi-k2-0905	30	14.2% ± 1.4%	23.3%	6.7%	$12.88	376.5
qwen-3-coder-plus	30	13.3% ± 6.7%	26.7%	3.3%	$5.93	157.2
gpt-4-1	30	12.5% ± 1.4%	20.0%	3.3%	$9.07	41.0
gpt-5-nano-low	30	12.5% ± 3.6%	30.0%	3.3%	$1.19	129.1
gpt-5-mini-low	30	12.5% ± 4.9%	33.3%	3.3%	$1.13	67.7
deepseek-v3-1-terminus	30	10.8% ± 4.9%	20.0%	3.3%	$1.28	179.8
qwen-3-max	30	10.8% ± 1.4%	13.3%	10.0%	$14.54	133.9
gemini-2-5-flash	30	8.3% ± 1.7%	13.3%	6.7%	$1.18	62.1
glm-4-5	30	7.5% ± 1.4%	13.3%	3.3%	$2.08	130.2
gpt-5-nano-medium	30	6.7% ± 5.3%	16.7%	0.0%	$0.93	129.5
gpt-5-nano-high	30	5.8% ± 4.9%	16.7%	0.0%	$1.47	206.6
gpt-oss-120b	30	5.8% ± 4.3%	16.7%	0.0%	$0.05	18.3
gpt-4-1-mini	30	3.3% ± 0.0%	3.3%	3.3%	$2.43	51.5
gpt-4-1-nano	30	0.0% ± 0.0%	0.0%	0.0%	$0.37	28.1

Github Performance

Model	Total Tasks	Pass@1 (avg ± std)	Pass@4	Pass^4	Per-Run Cost (USD)	Avg Agent Time (s)
gpt-5-high	23	50.0% ± 2.2%	60.9%	34.8%	$34.73	1083.5
gpt-5-medium	23	47.8% ± 8.1%	65.2%	17.4%	$23.7	456.6
gpt-5-2-high	23	47.8% ± 5.3%	60.9%	34.8%	$43.34	714.5
gemini-3-pro-high	23	46.7% ± 6.4%	65.2%	30.4%	$30.96	209.5
gemini-3-pro-low	23	45.6% ± 11.7%	65.2%	26.1%	$32.55	205.5
claude-opus-4-5-high	23	37.0% ± 7.2%	52.2%	21.7%	$143.37	555.3
claude-sonnet-4-5	23	29.3% ± 8.9%	43.5%	17.4%	$68.75	220.0
claude-sonnet-4-high	23	28.3% ± 2.2%	43.5%	21.7%	$58.39	170.0
gpt-5-low	23	27.2% ± 1.9%	39.1%	17.4%	$22.31	268.3
claude-sonnet-4-low	23	25.0% ± 3.6%	34.8%	21.7%	$57.93	173.6
glm-4-5	23	22.8% ± 6.4%	34.8%	13.0%	$3.77	153.5
claude-opus-4-1	23	21.7% ± 0.0%	/	/	$224.18	390.2
deepseek-v3-2-thinking	23	20.6% ± 1.9%	43.5%	0.0%	$6.18	411.8
qwen-3-coder-plus	23	19.6% ± 6.5%	34.8%	13.0%	$9.2	320.4
gpt-5-mini-high	23	19.6% ± 2.2%	34.8%	8.7%	$6.92	338.1
gpt-5-mini-medium	23	18.5% ± 7.8%	34.8%	4.3%	$3.89	127.6
deepseek-v3-2-chat	23	17.4% ± 5.3%	39.1%	0.0%	$5.45	292.1
claude-sonnet-4	23	16.3% ± 5.7%	30.4%	8.7%	$49.61	196.5
kimi-k2-0905	23	16.3% ± 1.9%	26.1%	8.7%	$14.21	780.8
gemini-2-5-flash	23	15.2% ± 2.2%	21.7%	8.7%	$8.37	206.4
o3	23	14.1% ± 3.6%	21.7%	4.3%	$21.41	128.0
qwen-3-max	23	14.1% ± 3.6%	17.4%	4.3%	$37.56	181.5
grok-4	23	14.1% ± 3.6%	21.7%	8.7%	$56.18	269.0
o4-mini	23	14.1% ± 6.4%	26.1%	4.3%	$13.79	248.8
grok-4-fast	23	13.0% ± 3.1%	21.7%	0.0%	$2.77	143.1
deepseek-v3-1-terminus-thinking	23	10.9% ± 4.9%	21.7%	0.0%	$2.07	702.3
kimi-k2-0711	23	10.9% ± 2.2%	13.0%	4.3%	$5.37	205.0
deepseek-chat	23	9.8% ± 1.9%	13.0%	8.7%	$4.75	194.0
gemini-2-5-pro	23	9.8% ± 1.9%	21.7%	0.0%	$11.96	91.3
grok-code-fast-1	23	8.7% ± 5.3%	17.4%	4.3%	$3.68	182.9
gpt-5-nano-high	23	8.7% ± 3.1%	17.4%	0.0%	$1.86	317.0
gpt-5-mini-low	23	8.7% ± 3.1%	13.0%	0.0%	$2.59	63.1
gpt-4-1	23	7.6% ± 1.9%	8.7%	4.3%	$20.97	90.2
gpt-5-nano-medium	23	7.6% ± 1.9%	13.0%	0.0%	$1.11	187.3
gpt-4-1-mini	23	6.5% ± 6.5%	17.4%	0.0%	$4.35	83.1
deepseek-v3-1-terminus	23	5.4% ± 7.1%	17.4%	0.0%	$2.2	231.9
gpt-oss-120b	23	4.3% ± 3.1%	8.7%	0.0%	$0.14	24.0
gpt-5-nano-low	23	0.0% ± 0.0%	0.0%	0.0%	$0.29	57.7
gpt-4-1-nano	23	0.0% ± 0.0%	0.0%	0.0%	$0.74	51.8

Notion Performance

Model	Total Tasks	Pass@1 (avg ± std)	Pass@4	Pass^4	Per-Run Cost (USD)	Avg Agent Time (s)
gpt-5-2-high	28	60.7% ± 2.5%	67.9%	50.0%	$60.66	1259.0
gemini-3-pro-high	28	47.3% ± 3.0%	57.1%	28.6%	$19.11	158.6
deepseek-v3-2-thinking	28	45.5% ± 4.6%	57.1%	32.1%	$6.53	408.1
gpt-5-high	28	44.6% ± 1.8%	60.7%	21.4%	$28.98	1161.4
gemini-3-pro-low	28	43.8% ± 3.9%	57.1%	21.4%	$24.13	197.0
gpt-5-medium	28	42.0% ± 3.0%	50.0%	32.1%	$21.98	661.8
claude-opus-4-5-high	28	38.4% ± 3.9%	46.4%	32.1%	$79.08	148.7
gpt-5-low	28	36.6% ± 7.7%	53.6%	14.3%	$23.26	559.6
claude-opus-4-1	28	35.7% ± 0.0%	/	/	$276.24	294.2
deepseek-v3-2-chat	28	32.1% ± 5.7%	46.4%	14.3%	$6.06	264.8
claude-sonnet-4-5	28	25.0% ± 7.6%	46.4%	3.6%	$58.39	165.0
o3	28	24.1% ± 3.9%	46.4%	7.1%	$14.72	171.4
deepseek-v3-1-terminus-thinking	28	22.3% ± 3.0%	39.3%	3.6%	$3.05	919.4
claude-sonnet-4-low	28	22.3% ± 3.0%	32.1%	7.1%	$105.36	220.0
deepseek-v3-1-terminus	28	22.3% ± 8.1%	39.3%	7.1%	$4.28	252.5
glm-4-5	28	21.4% ± 2.5%	32.1%	10.7%	$5.97	222.2
claude-sonnet-4	28	21.4% ± 5.1%	39.3%	7.1%	$56.1	193.2
gpt-5-mini-high	28	20.5% ± 13.0%	42.9%	7.1%	$10.62	522.6
o4-mini	28	20.5% ± 5.9%	42.9%	7.1%	$11.44	442.4
qwen-3-coder-plus	28	19.6% ± 6.4%	39.3%	7.1%	$4.52	99.7
claude-sonnet-4-high	28	19.6% ± 8.2%	35.7%	3.6%	$117.43	190.6
qwen-3-max	28	17.0% ± 4.6%	25.0%	3.6%	$33.34	183.5
gpt-5-mini-medium	28	16.1% ± 5.9%	32.1%	3.6%	$5.63	168.7
kimi-k2-0711	28	14.3% ± 4.4%	32.1%	7.1%	$9.37	183.4
deepseek-chat	28	12.5% ± 3.1%	28.6%	0.0%	$8	238.9
kimi-k2-0905	28	8.0% ± 3.0%	10.7%	3.6%	$19.13	467.0
gpt-4-1	28	6.2% ± 1.6%	14.3%	0.0%	$7.9	48.8
gemini-2-5-flash	28	6.2% ± 4.6%	21.4%	0.0%	$2.15	55.8
gpt-5-mini-low	28	5.4% ± 5.4%	14.3%	0.0%	$1.07	62.6
gemini-2-5-pro	28	4.5% ± 3.0%	7.1%	0.0%	$17.9	102.4
gpt-5-nano-medium	28	3.6% ± 0.0%	3.6%	3.6%	$0.65	170.5
gpt-oss-120b	28	3.6% ± 2.5%	14.3%	0.0%	$0.15	34.0
grok-4-fast	28	3.6% ± 0.0%	3.6%	3.6%	$3.17	152.7
grok-code-fast-1	28	2.7% ± 1.6%	3.6%	0.0%	$3.45	334.1
grok-4	28	2.7% ± 1.6%	3.6%	0.0%	$62.48	554.0
gpt-4-1-mini	28	1.8% ± 1.8%	3.6%	0.0%	$3	59.1
gpt-5-nano-high	28	0.9% ± 1.6%	3.6%	0.0%	$1.43	401.0
gpt-5-nano-low	28	0.0% ± 0.0%	0.0%	0.0%	$0.19	68.0
gpt-4-1-nano	28	0.0% ± 0.0%	0.0%	0.0%	$0.28	32.2

Playwright Performance

Model	Total Tasks	Pass@1 (avg ± std)	Pass@4	Pass^4	Per-Run Cost (USD)	Avg Agent Time (s)
gpt-5-2-high	25	46.0% ± 3.5%	60.0%	32.0%	$88	534.8
gpt-5-low	25	45.0% ± 1.7%	56.0%	32.0%	$58.7	526.9
claude-opus-4-5-high	25	45.0% ± 1.7%	52.0%	40.0%	$153.68	169.6
gpt-5-medium	25	43.0% ± 5.2%	56.0%	36.0%	$61.92	608.2
gpt-5-high	25	42.0% ± 4.5%	56.0%	24.0%	$61.32	1115.9
gemini-3-pro-high	25	40.0% ± 5.7%	48.0%	28.0%	$162.52	325.5
gemini-3-pro-low	25	40.0% ± 4.0%	48.0%	28.0%	$153.19	286.8
grok-4	25	35.0% ± 7.7%	48.0%	20.0%	$97.36	277.2
qwen-3-coder-plus	25	30.0% ± 4.5%	48.0%	8.0%	$14.31	680.0
kimi-k2-0905	25	30.0% ± 6.0%	40.0%	20.0%	$20.51	380.6
claude-sonnet-4-5	25	27.0% ± 5.9%	36.0%	16.0%	$94.37	175.4
grok-4-fast	25	27.0% ± 3.3%	40.0%	16.0%	$7.68	105.8
claude-sonnet-4	25	26.0% ± 6.0%	36.0%	8.0%	$94.47	278.7
claude-sonnet-4-high	25	26.0% ± 2.0%	28.0%	24.0%	$154.28	261.9
grok-code-fast-1	25	25.0% ± 1.7%	36.0%	8.0%	$6.06	119.5
claude-opus-4-1	25	24.0% ± 0.0%	/	/	$435.18	395.2
claude-sonnet-4-low	25	22.0% ± 3.5%	28.0%	20.0%	$157.11	239.1
deepseek-v3-2-chat	25	19.0% ± 3.3%	28.0%	12.0%	$7.19	314.9
deepseek-v3-2-thinking	25	17.0% ± 1.7%	24.0%	12.0%	$10.21	349.5
o3	25	15.0% ± 5.2%	32.0%	8.0%	$28.71	153.9
gemini-2-5-pro	25	15.0% ± 1.7%	32.0%	4.0%	$108.12	177.7
gpt-5-mini-high	25	15.0% ± 5.2%	32.0%	4.0%	$15.42	365.6
glm-4-5	25	13.0% ± 3.3%	20.0%	4.0%	$4.9	165.6
deepseek-v3-1-terminus	25	13.0% ± 1.7%	20.0%	8.0%	$3.57	329.9
kimi-k2-0711	25	13.0% ± 3.3%	16.0%	8.0%	$11.17	221.4
gpt-5-mini-medium	25	12.0% ± 6.3%	24.0%	4.0%	$11.77	216.0
o4-mini	25	12.0% ± 2.8%	28.0%	0.0%	$25.71	530.6
deepseek-v3-1-terminus-thinking	25	9.0% ± 1.7%	20.0%	0.0%	$3.05	775.8
gpt-4-1	25	8.0% ± 2.8%	12.0%	4.0%	$43.16	92.2
qwen-3-max	25	8.0% ± 0.0%	12.0%	4.0%	$69.1	417.7
deepseek-chat	25	7.0% ± 3.3%	16.0%	0.0%	$11.78	288.3
gemini-2-5-flash	25	6.0% ± 2.0%	12.0%	0.0%	$29.31	205.4
gpt-oss-120b	25	3.0% ± 1.7%	4.0%	0.0%	$0.26	37.3
gpt-5-nano-high	25	2.0% ± 2.0%	4.0%	0.0%	$2.32	325.0
gpt-5-mini-low	25	1.0% ± 1.7%	4.0%	0.0%	$2.72	67.1
gpt-5-nano-low	25	0.0% ± 0.0%	0.0%	0.0%	$0.67	139.3
gpt-4-1-nano	25	0.0% ± 0.0%	0.0%	0.0%	$0.98	53.8
gpt-5-nano-medium	25	0.0% ± 0.0%	0.0%	0.0%	$1.07	171.1
gpt-4-1-mini	25	0.0% ± 0.0%	0.0%	0.0%	$49.72	195.7

Postgres Performance

Model	Total Tasks	Pass@1 (avg ± std)	Pass@4	Pass^4	Per-Run Cost (USD)	Avg Agent Time (s)
gemini-3-pro-high	21	79.8% ± 5.2%	85.7%	66.7%	$11.32	188.8
gpt-5-medium	21	76.2% ± 7.5%	100.0%	47.6%	$6.55	338.2
gpt-5-low	21	73.8% ± 4.1%	95.2%	38.1%	$6.11	272.3
gpt-5-high	21	72.6% ± 4.0%	85.7%	52.4%	$11.37	977.9
gpt-5-2-high	21	72.6% ± 2.1%	76.2%	61.9%	$17.73	617.9
gemini-3-pro-low	21	70.2% ± 4.0%	90.5%	47.6%	$7.97	138.6
gpt-5-mini-high	21	66.7% ± 3.4%	81.0%	42.9%	$1.83	201.2
deepseek-v3-2-thinking	21	66.7% ± 5.8%	90.5%	38.1%	$2.88	405.0
gpt-5-mini-medium	21	61.9% ± 5.8%	90.5%	28.6%	$1	96.5
deepseek-v3-2-chat	21	59.5% ± 5.3%	81.0%	38.1%	$2.84	311.9
grok-4	21	58.3% ± 7.8%	81.0%	38.1%	$14.32	204.3
claude-sonnet-4	21	53.6% ± 6.2%	71.4%	38.1%	$23.24	239.5
claude-opus-4-5-high	21	53.6% ± 5.2%	71.4%	42.9%	$53.82	177.6
grok-4-fast	21	52.4% ± 8.9%	81.0%	28.6%	$1	71.8
claude-sonnet-4-5	21	50.0% ± 4.1%	66.7%	38.1%	$33.77	241.8
claude-sonnet-4-high	21	50.0% ± 7.1%	66.7%	38.1%	$29.68	165.1
claude-sonnet-4-low	21	48.8% ± 7.0%	71.4%	33.3%	$44.57	186.2
qwen-3-coder-plus	21	47.6% ± 5.8%	61.9%	38.1%	$2.5	140.9
grok-code-fast-1	21	47.6% ± 4.8%	61.9%	28.6%	$1.12	51.3
kimi-k2-0905	21	47.6% ± 4.8%	66.7%	28.6%	$5.84	517.3
qwen-3-max	21	44.0% ± 2.1%	52.4%	38.1%	$5.46	159.6
deepseek-chat	21	42.9% ± 7.5%	61.9%	28.6%	$3.89	355.6
deepseek-v3-1-terminus-thinking	21	41.7% ± 7.8%	61.9%	19.1%	$1.31	418.5
kimi-k2-0711	21	40.5% ± 7.9%	71.4%	28.6%	$3.55	248.4
o3	21	36.9% ± 4.0%	66.7%	14.3%	$3.46	75.6
claude-opus-4-1	21	33.3% ± 0.0%	/	/	$97.54	515.4
deepseek-v3-1-terminus	21	33.3% ± 19.9%	57.1%	0.0%	$1.34	240.7
gemini-2-5-pro	21	26.2% ± 7.9%	47.6%	9.5%	$4.89	93.5
gpt-5-nano-medium	21	15.5% ± 5.2%	28.6%	4.8%	$0.3	129.1
glm-4-5	21	14.3% ± 7.5%	23.8%	0.0%	$1.56	158.1
gpt-5-mini-low	21	14.3% ± 3.4%	28.6%	0.0%	$0.34	52.7
o4-mini	21	11.9% ± 4.1%	19.1%	4.8%	$0.9	84.7
gemini-2-5-flash	21	10.7% ± 6.2%	23.8%	4.8%	$0.81	60.9
gpt-5-nano-high	21	9.5% ± 3.4%	33.3%	0.0%	$0.72	307.7
gpt-4-1-mini	21	9.5% ± 3.4%	14.3%	4.8%	$0.45	42.1
gpt-5-nano-low	21	8.3% ± 4.0%	19.1%	0.0%	$0.16	78.5
gpt-oss-120b	21	7.1% ± 2.4%	23.8%	0.0%	$0.04	23.3
gpt-4-1	21	4.8% ± 0.0%	4.8%	4.8%	$2.52	28.9
gpt-4-1-nano	21	0.0% ± 0.0%	0.0%	0.0%	$0.17	32.5

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
model_results		model_results
task_results		task_results
tasks		tasks
README.md		README.md
summary.json		summary.json
task_meta.json		task_meta.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mcpmark-release - Evaluation Results

Overall Performance

Filesystem Performance

Github Performance

Notion Performance

Playwright Performance

Postgres Performance

About

Uh oh!

Releases

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

mcpmark-release - Evaluation Results

Overall Performance

Filesystem Performance

Github Performance

Notion Performance

Playwright Performance

Postgres Performance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!