You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace static chart with SVG representation in OSWorld blog post to enhance visual clarity and maintain responsiveness. Update content to reflect current model performance metrics against human benchmarks, ensuring accurate and engaging presentation.
Copy file name to clipboardExpand all lines: data/blog_posts/osworld-verified.md
+4-40Lines changed: 4 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -251,46 +251,10 @@ The performance distribution reveals distinct tiers with substantial improvement
251
251
While the gaps between tiers remain significant, the dramatic upward shift across all categories demonstrates accelerating progress.
252
252
This indicates that OSWorld continues to provide meaningful developmental signal, particularly highlighting the effectiveness of reasoning-enhanced agentic approaches while revealing remaining challenges in areas requiring complex multi-step reasoning, robust error recovery, and dynamic adaptation to interface changes.
<figcaptionstyle="text-align: center;">Figure 2. Gap to Human Performance - Current best models still show significant gaps compared to human performance.</figcaption>
257
+
</figure>
294
258
295
259
**Agentic frameworks with reasoning models dominate the leaderboard.** Agentic frameworks powered by reasoning models like o3 have achieved breakthrough performance.CoACT-1 leads with 60.76% success rate, followed closely by Agent S2.5 w/ o3 (56.0%) and GTA1 w/ o3 (53.1%).
296
260
This demonstrates that sophisticated orchestration layers can dramatically amplify the capabilities of reasoning models, even when those models weren't specifically trained for computer use tasks.
0 commit comments