Replace static chart with SVG representation in OSWorld blog post to enhance visual clarity and maintain responsiveness. Update content to reflect current model performance metrics against human benchmarks, ensuring accurate and engaging presentation.

Timothyxxx · Timothyxxx · commit 949e6a5cf472 · 2025-08-07T11:33:12.000Z
diff --git a/data/blog_posts/osworld-verified.md b/data/blog_posts/osworld-verified.md
@@ -251,46 +251,10 @@ The performance distribution reveals distinct tiers with substantial improvement
 While the gaps between tiers remain significant, the dramatic upward shift across all categories demonstrates accelerating progress. 
 This indicates that OSWorld continues to provide meaningful developmental signal, particularly highlighting the effectiveness of reasoning-enhanced agentic approaches while revealing remaining challenges in areas requiring complex multi-step reasoning, robust error recovery, and dynamic adaptation to interface changes.
 
-<div style="width: 100%; max-width: 800px; margin: 30px auto; padding: 20px; background: #f8f9fa; border-radius: 10px; box-shadow: 0 2px 10px rgba(0,0,0,0.1); font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;">
-  <h3 style="text-align: center; font-size: 24px; font-weight: bold; color: #2c3e50; margin-bottom: 30px;">Gap to Human Performance</h3>
-  
-  <div style="display: flex; align-items: end; justify-content: space-around; height: 300px; margin: 20px 0; padding: 20px; background: white; border-radius: 8px;">
-    
-    <div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
-      <div style="width: 80px; background: linear-gradient(45deg, #667eea, #764ba2); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="240">
-        <div style="height: 240px; width: 100%; background: linear-gradient(45deg, #667eea, #764ba2); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">60.76%</div>
-      </div>
-      <div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">CoACT-1</div>
-    </div>
-    
-    <div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
-      <div style="width: 80px; background: linear-gradient(45deg, #667eea, #764ba2); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="220">
-        <div style="height: 220px; width: 100%; background: linear-gradient(45deg, #667eea 70%, #764ba2); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">56.0%</div>
-      </div>
-      <div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">Agent S2.5 w/ o3</div>
-    </div>
-    
-    <div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
-      <div style="width: 80px; background: linear-gradient(45deg, #764ba2, #9b59b6); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="175">
-        <div style="height: 175px; width: 100%; background: linear-gradient(45deg, #764ba2, #9b59b6); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">43.9%</div>
-      </div>
-      <div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">Claude 4 Sonnet</div>
-    </div>
-    
-    <div style="display: flex; flex-direction: column; align-items: center; margin: 0 10px;">
-      <div style="width: 80px; background: linear-gradient(45deg, #2ecc71, #27ae60); border-radius: 4px 4px 0 0; position: relative; display: flex; align-items: end; justify-content: center; color: white; font-weight: bold; font-size: 14px; padding: 8px 4px;" data-height="288">
-        <div style="height: 288px; width: 100%; background: linear-gradient(45deg, #2ecc71, #27ae60); border-radius: 4px 4px 0 0; display: flex; align-items: end; justify-content: center; padding-bottom: 8px;">72%</div>
-      </div>
-      <div style="margin-top: 8px; font-size: 12px; font-weight: 600; text-align: center; color: #333; word-wrap: break-word; width: 100px;">Human Performance</div>
-    </div>
-    
-  </div>
-  
-  <div style="text-align: center; margin-top: 20px;">
-    <div style="font-size: 16px; font-weight: bold; color: #666; margin-bottom: 5px;">Best Models vs Human Performance</div>
-    <div style="font-size: 14px; color: #888;">Success Rate (%)</div>
-  </div>
-</div>
+<figure style="text-align: center;">  
+  <img src="/blog/osworld-verified/human_gap_svg.svg" height=400>  
+  <figcaption style="text-align: center;">Figure 2. Gap to Human Performance - Current best models still show significant gaps compared to human performance.</figcaption>  
+</figure>
 
 **Agentic frameworks with reasoning models dominate the leaderboard.** Agentic frameworks powered by reasoning models like o3 have achieved breakthrough performance.CoACT-1 leads with 60.76% success rate, followed closely by Agent S2.5 w/ o3 (56.0%) and GTA1 w/ o3 (53.1%). 
 This demonstrates that sophisticated orchestration layers can dramatically amplify the capabilities of reasoning models, even when those models weren't specifically trained for computer use tasks. 
diff --git a/public/blog/osworld-verified/human_gap_svg.svg b/public/blog/osworld-verified/human_gap_svg.svg
@@ -0,0 +1,84 @@
+<svg width="800" height="500" xmlns="http://www.w3.org/2000/svg">
+  <!-- Background -->
+  <rect width="800" height="500" fill="#f8f9fa" rx="10"/>
+  
+  <!-- Title -->
+  <text x="400" y="40" text-anchor="middle" font-family="Arial, sans-serif" font-size="24" font-weight="bold" fill="#2c3e50">
+    Gap to Human Performance
+  </text>
+  
+  <!-- Subtitle -->
+  <text x="400" y="65" text-anchor="middle" font-family="Arial, sans-serif" font-size="16" fill="#7f8c8d">
+    Best Models vs Human Performance
+  </text>
+  
+  <!-- Chart area background -->
+  <rect x="100" y="100" width="600" height="300" fill="white" stroke="#e9ecef" stroke-width="1" rx="5"/>
+  
+  <!-- Grid lines -->
+  <g stroke="#e9ecef" stroke-width="1">
+    <!-- Horizontal grid lines -->
+    <line x1="100" y1="160" x2="700" y2="160"/>
+    <line x1="100" y1="220" x2="700" y2="220"/>
+    <line x1="100" y1="280" x2="700" y2="280"/>
+    <line x1="100" y1="340" x2="700" y2="340"/>
+  </g>
+  
+  <!-- Y-axis labels -->
+  <g font-family="Arial, sans-serif" font-size="12" fill="#666" text-anchor="end">
+    <text x="95" y="405">0</text>
+    <text x="95" y="345">20</text>
+    <text x="95" y="285">40</text>
+    <text x="95" y="225">60</text>
+    <text x="95" y="165">80</text>
+  </g>
+  
+  <!-- Y-axis title -->
+  <text x="30" y="250" text-anchor="middle" font-family="Arial, sans-serif" font-size="14" font-weight="bold" fill="#666" transform="rotate(-90, 30, 250)">
+    Success Rate (%)
+  </text>
+  
+  <!-- Bars -->
+  <!-- CoACT-1: 60.76% -->
+  <rect x="140" y="174.28" width="80" height="225.72" fill="rgba(102, 126, 234, 0.8)" stroke="rgba(102, 126, 234, 1)" stroke-width="2" rx="3"/>
+  
+  <!-- Agent S2.5 w/ o3: 56.0% -->
+  <rect x="270" y="192" width="80" height="208" fill="rgba(102, 126, 234, 0.6)" stroke="rgba(102, 126, 234, 1)" stroke-width="2" rx="3"/>
+  
+  <!-- Claude 4 Sonnet: 43.9% -->
+  <rect x="400" y="237.15" width="80" height="162.85" fill="rgba(118, 75, 162, 0.8)" stroke="rgba(118, 75, 162, 1)" stroke-width="2" rx="3"/>
+  
+  <!-- Human Performance: 72% -->
+  <rect x="530" y="130" width="80" height="270" fill="rgba(46, 204, 113, 0.8)" stroke="rgba(46, 204, 113, 1)" stroke-width="2" rx="3"/>
+  
+  <!-- Value labels on bars -->
+  <g font-family="Arial, sans-serif" font-size="14" font-weight="bold" fill="white" text-anchor="middle">
+    <text x="180" y="195">60.8%</text>
+    <text x="310" y="210">56.0%</text>
+    <text x="440" y="255">43.9%</text>
+    <text x="570" y="150">72.0%</text>
+  </g>
+  
+  <!-- X-axis labels -->
+  <g font-family="Arial, sans-serif" font-size="12" fill="#666" text-anchor="middle">
+    <text x="180" y="430">CoACT-1</text>
+    <text x="310" y="430">Agent S2.5</text>
+    <text x="310" y="445">w/ o3</text>
+    <text x="440" y="430">Claude 4</text>
+    <text x="440" y="445">Sonnet</text>
+    <text x="570" y="430">Human</text>
+    <text x="570" y="445">Performance</text>
+  </g>
+  
+  <!-- Legend indicators -->
+  <g font-family="Arial, sans-serif" font-size="11" fill="#666">
+    <circle cx="120" cy="470" r="4" fill="rgba(102, 126, 234, 0.8)"/>
+    <text x="130" y="474">Agentic Frameworks</text>
+    
+    <circle cx="280" cy="470" r="4" fill="rgba(118, 75, 162, 0.8)"/>
+    <text x="290" y="474">Foundation Models</text>
+    
+    <circle cx="450" cy="470" r="4" fill="rgba(46, 204, 113, 0.8)"/>
+    <text x="460" y="474">Human Baseline</text>
+  </g>
+</svg>