Skip to content

Commit 949e6a5

Browse files
committed
Replace static chart with SVG representation in OSWorld blog post to enhance visual clarity and maintain responsiveness. Update content to reflect current model performance metrics against human benchmarks, ensuring accurate and engaging presentation.
1 parent 26d3515 commit 949e6a5

File tree

2 files changed

+88
-40
lines changed

2 files changed

+88
-40
lines changed

data/blog_posts/osworld-verified.md

Lines changed: 4 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -251,46 +251,10 @@ The performance distribution reveals distinct tiers with substantial improvement
251251
While the gaps between tiers remain significant, the dramatic upward shift across all categories demonstrates accelerating progress.
252252
This indicates that OSWorld continues to provide meaningful developmental signal, particularly highlighting the effectiveness of reasoning-enhanced agentic approaches while revealing remaining challenges in areas requiring complex multi-step reasoning, robust error recovery, and dynamic adaptation to interface changes.
253253

254-
<div style="width: 100%; max-width: 800px; margin: 30px auto; padding: 20px; background: #f8f9fa; border-radius: 10px; box-shadow: 0 2px 10px rgba(0,0,0,0.1); font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;">
255-
<h3 style="text-align: center; font-size: 24px; font-weight: bold; color: #2c3e50; margin-bottom: 30px;">Gap to Human Performance</h3>
256-
257-
<div style="display: flex; align-items: end; justify-content: space-around; height: 300px; margin: 20px 0; padding: 20px; background: white; border-radius: 8px;">
258-
259-
<div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
260-
<div style="width: 80px; background: linear-gradient(45deg, #667eea, #764ba2); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="240">
261-
<div style="height: 240px; width: 100%; background: linear-gradient(45deg, #667eea, #764ba2); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">60.76%</div>
262-
</div>
263-
<div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">CoACT-1</div>
264-
</div>
265-
266-
<div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
267-
<div style="width: 80px; background: linear-gradient(45deg, #667eea, #764ba2); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="220">
268-
<div style="height: 220px; width: 100%; background: linear-gradient(45deg, #667eea 70%, #764ba2); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">56.0%</div>
269-
</div>
270-
<div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">Agent S2.5 w/ o3</div>
271-
</div>
272-
273-
<div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
274-
<div style="width: 80px; background: linear-gradient(45deg, #764ba2, #9b59b6); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="175">
275-
<div style="height: 175px; width: 100%; background: linear-gradient(45deg, #764ba2, #9b59b6); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">43.9%</div>
276-
</div>
277-
<div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">Claude 4 Sonnet</div>
278-
</div>
279-
280-
<div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
281-
<div style="width: 80px; background: linear-gradient(45deg, #2ecc71, #27ae60); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="288">
282-
<div style="height: 288px; width: 100%; background: linear-gradient(45deg, #2ecc71, #27ae60); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">72%</div>
283-
</div>
284-
<div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">Human Performance</div>
285-
</div>
286-
287-
</div>
288-
289-
<div style="text-align: center; margin-top: 20px;">
290-
<div style="font-size: 16px; font-weight: bold; color: #666; margin-bottom: 5px;">Best Models vs Human Performance</div>
291-
<div style="font-size: 14px; color: #888;">Success Rate (%)</div>
292-
</div>
293-
</div>
254+
<figure style="text-align: center;">
255+
<img src="/blog/osworld-verified/human_gap_svg.svg" height=400>
256+
<figcaption style="text-align: center;">Figure 2. Gap to Human Performance - Current best models still show significant gaps compared to human performance.</figcaption>
257+
</figure>
294258

295259
**Agentic frameworks with reasoning models dominate the leaderboard.** Agentic frameworks powered by reasoning models like o3 have achieved breakthrough performance.CoACT-1 leads with 60.76% success rate, followed closely by Agent S2.5 w/ o3 (56.0%) and GTA1 w/ o3 (53.1%).
296260
This demonstrates that sophisticated orchestration layers can dramatically amplify the capabilities of reasoning models, even when those models weren't specifically trained for computer use tasks.
Lines changed: 84 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)