Skip to content

Commit d97dec9

Browse files
committed
Enhance OSWorld blog post by expanding on fragile dependencies and timing issues in task execution. Add visual examples to illustrate complex initialization sequences and sequential dependencies. Introduce new images to support explanations of loading delays and access challenges, ensuring clarity in the discussion of task failures and solutions.
1 parent 949e6a5 commit d97dec9

File tree

7 files changed

+39
-3
lines changed

7 files changed

+39
-3
lines changed

data/blog_posts/osworld-verified.md

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,20 @@ Next, we'll discuss the uncontrollable factors we discovered in actual operation
6868

6969
#### Fragile Dependencies with Timing Issues
7070

71-
e.g., Some tasks' config and postconfig rely on "hardcoded" operations, such as "Copy the screenshot 1.png from the desktop to where my cursor is located." Our configuration uses `pyautogui.press("down", presses=8, interval=0.01)` to move the cursor to the 9th line, which requires LibreOffice Writer to be fully loaded with the document open and cursor blinking at the first line when executing this command. The previous fragile dependencies couldn't guarantee sequential execution, causing initial setup issues in some tasks.
71+
Many tasks exhibit complex temporal dependencies where proper initialization requires precise sequential execution. Software applications and web pages often require significant loading and response times, creating timing-sensitive scenarios that can lead to task failures.
72+
73+
<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin: 25px auto; max-width: 800px; flex-wrap: wrap;">
74+
<figure style="flex: 1; min-width: 300px; max-width: 380px; text-align: center; margin: 0;">
75+
<img src="/blog/osworld-verified/loading.png" style="max-width: 100%; height: auto; border: 1px solid #ddd; border-radius: 8px;">
76+
<figcaption style="text-align: center; font-size: 14px; color: #666; margin-top: 8px;">Loading Delays - Applications requiring extended initialization time</figcaption>
77+
</figure>
78+
<figure style="flex: 1; min-width: 300px; max-width: 380px; text-align: center; margin: 0;">
79+
<img src="/blog/osworld-verified/stucked_open.png" style="max-width: 100%; height: auto; border: 1px solid #ddd; border-radius: 8px;">
80+
<figcaption style="text-align: center; font-size: 14px; color: #666; margin-top: 8px;">Sequential Dependencies - Precise cursor positioning requirements</figcaption>
81+
</figure>
82+
</div>
83+
84+
**Specific Example**: Tasks involving document operations, such as "Copy the screenshot 1.png from the desktop to where my cursor is located," require complex initialization sequences. Our configuration uses `pyautogui.press("down", presses=8, interval=0.01)` to move the cursor to the 9th line, which demands that LibreOffice Writer be fully loaded with the document open and cursor positioned at the first line before executing this command. These fragile dependencies previously couldn't guarantee sequential execution, causing initial setup failures in multiple tasks.
7285

7386
### Incompleteness of Initial Tasks Annotation
7487

@@ -94,6 +107,18 @@ e.g., Some tasks' config and postconfig rely on "hardcoded" operations, such as
94107

95108
**False negatives from limited ground truth**:
96109
- e.g., "Change the first two paragraphs to double line spacing" - the empty line between the two paragraphs can either be set to double spacing or left unchanged; both approaches should be considered correct.
110+
111+
<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin: 25px auto; max-width: 800px; flex-wrap: wrap;">
112+
<figure style="flex: 1; min-width: 300px; max-width: 380px; text-align: center; margin: 0;">
113+
<img src="/blog/osworld-verified/2_linespace_line_by_line_1.png" style="max-width: 100%; height: auto; border: 1px solid #ddd; border-radius: 8px;">
114+
<figcaption style="text-align: center; font-size: 14px; color: #666; margin-top: 8px;">Method 1: Sequential paragraph selection - selecting paragraphs individually</figcaption>
115+
</figure>
116+
<figure style="flex: 1; min-width: 300px; max-width: 380px; text-align: center; margin: 0;">
117+
<img src="/blog/osworld-verified/2_linespace_line_by_line_2.png" style="max-width: 100%; height: auto; border: 1px solid #ddd; border-radius: 8px;">
118+
<figcaption style="text-align: center; font-size: 14px; color: #666; margin-top: 8px;">Method 2: Bulk selection - selecting both paragraphs together</figcaption>
119+
</figure>
120+
</div>
121+
97122
- e.g., Different but functionally equivalent spreadsheet formulas marked incorrect
98123

99124
### Decentralized Evaluation Reduces Motivation to Contribute Error Discovery
@@ -148,6 +173,17 @@ For tasks we identified as genuinely problematic, we primarily modified only the
148173

149174
**Problem**: Websites blocking automated access through CAPTCHA, IP restrictions, or bot detection.
150175

176+
<div style="display: flex; justify-content: center; align-items: center; gap: 20px; margin: 20px auto; max-width: 800px; flex-wrap: wrap;">
177+
<figure style="flex: 1; min-width: 300px; max-width: 380px; text-align: center; margin: 0;">
178+
<img src="/blog/osworld-verified/access_denied.png" style="max-width: 100%; height: auto; border: 1px solid #ddd; border-radius: 8px;">
179+
<figcaption style="text-align: center; font-size: 14px; color: #666; margin-top: 8px;">Access Denied - Websites blocking automated agents</figcaption>
180+
</figure>
181+
<figure style="flex: 1; min-width: 300px; max-width: 380px; text-align: center; margin: 0;">
182+
<img src="/blog/osworld-verified/amazon_captcha.png" style="max-width: 100%; height: auto; border: 1px solid #ddd; border-radius: 8px;">
183+
<figcaption style="text-align: center; font-size: 14px; color: #666; margin-top: 8px;">CAPTCHA Challenge - Human verification requirements</figcaption>
184+
</figure>
185+
</div>
186+
151187
**Solutions Deployed**:
152188
- **Proxy infrastructure**: Added `proxy` field support for websites with aggressive anti-crawling
153189
- **Alternative website selection**: For heavily protected sites (e.g., SeatGeek → Ticketek, TripAdvisor proxy issues), switched to functionally equivalent alternatives
@@ -251,8 +287,8 @@ The performance distribution reveals distinct tiers with substantial improvement
251287
While the gaps between tiers remain significant, the dramatic upward shift across all categories demonstrates accelerating progress.
252288
This indicates that OSWorld continues to provide meaningful developmental signal, particularly highlighting the effectiveness of reasoning-enhanced agentic approaches while revealing remaining challenges in areas requiring complex multi-step reasoning, robust error recovery, and dynamic adaptation to interface changes.
253289

254-
<figure style="text-align: center;">
255-
<img src="/blog/osworld-verified/human_gap_svg.svg" height=400>
290+
<figure style="text-align: center; margin: 30px auto; max-width: 100%;">
291+
<img src="/blog/osworld-verified/human_gap_svg.svg" height=400 style="display: block; margin: 0 auto;">
256292
<figcaption style="text-align: center;">Figure 2. Gap to Human Performance - Current best models still show significant gaps compared to human performance.</figcaption>
257293
</figure>
258294

181 KB
Loading
180 KB
Loading
99 KB
Loading
396 KB
Loading
85.4 KB
Loading
1.18 MB
Loading

0 commit comments

Comments
 (0)