Skip to content

Commit 37edfe4

Browse files
authored
Merge pull request #272 from hud-evals/l/agnet-updates
Scenario and agent upgrades
2 parents dd0fbe5 + a18d869 commit 37edfe4

File tree

18 files changed

+1468
-371
lines changed

18 files changed

+1468
-371
lines changed

docs/docs.json

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
"icon": "code",
3434
"versions": [
3535
{
36-
"version": "0.5.6",
36+
"version": "0.5.7",
3737
"groups": [
3838
{
3939
"group": "Get Started",
@@ -213,7 +213,14 @@
213213
{
214214
"group": "Guides",
215215
"pages": [
216-
"platform/publishing-leaderboards"
216+
"platform/publishing-leaderboards",
217+
"platform/subagent"
218+
]
219+
},
220+
{
221+
"group": "Integrations",
222+
"pages": [
223+
"platform/slack"
217224
]
218225
}
219226
]

docs/migration.mdx

Lines changed: 112 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ icon: "arrow-right-arrow-left"
77
v4 separated environments (Docker containers) from evaluation logic (Task objects). v5 unifies everything in the `Environment` class—tools, setup, and scoring live together.
88

99
<Warning>
10-
**Deprecation Notice**: `LegacyTask`, `setup_tool`, and `evaluate_tool` are deprecated in v0.5.0 and will be removed in v0.6.0 (no earlier than March 1st, 2026). Use `Task.from_v4()` for quick migration or `@env.scenario()` for new code.
10+
**Deprecation Notice**: `LegacyTask`, `setup_tool`, and `evaluate_tool` are deprecated in v0.5.0 and will be removed in v0.6.0 (no earlier than March 1st, 2026). Migrate to `@env.scenario()` for new code.
1111
</Warning>
1212

13-
## Good News: Your Code Still Works
13+
## MCPServer → Environment
1414

1515
`Environment` inherits from `MCPServer`. Same API, same behavior. Just change the import:
1616

@@ -38,77 +38,78 @@ env.run()
3838

3939
That's it. Your Dockerfile, your tools, your `run()` call—all unchanged. Environment adds scenarios, connectors, and integrations on top.
4040

41-
## Migration Path 1: Quick Conversion with Task.from_v4()
41+
## Migrating Tasks: Prompt Passthrough Pattern
4242

43-
The fastest way to migrate existing v4 code—no changes to task definitions needed:
43+
The recommended migration uses the **prompt passthrough pattern**—scenario arguments become the prompt content:
4444

4545
```python
46-
# BEFORE (deprecated in v0.6.0)
47-
from hud.datasets import LegacyTask
48-
49-
legacy_task = LegacyTask(
50-
prompt="Navigate to google.com",
51-
mcp_config={"hud": {...}},
52-
setup_tool={"name": "navigate", "arguments": {"url": "https://google.com"}},
53-
evaluate_tool={"name": "check_url", "arguments": {}}
54-
)
55-
56-
# AFTER - One-line conversion
57-
from hud.eval import Task
46+
from hud import Environment
5847

59-
task = Task.from_v4(legacy_task) # Converts LegacyTask → Task
60-
# Also works with: Task.from_v4(dict), Task.from_v4(json_string)
48+
env = Environment("browser").connect_hub("hud-evals/browser")
6149

62-
# Works the same with agents
63-
agent = ClaudeAgent.create()
64-
result = await agent.run(task)
50+
@env.scenario("web-task")
51+
async def web_task(instruction: str, start_url: str = "https://example.com"):
52+
"""
53+
The instruction arg passes through directly to the prompt.
54+
One scenario, infinite test cases.
55+
"""
56+
# Setup phase (before yield)
57+
await env.call_tool("navigate", url=start_url)
58+
59+
# Prompt - the instruction IS the prompt
60+
answer = yield instruction
61+
62+
# Evaluate phase (after yield)
63+
result = await env.call_tool("check_completion")
64+
yield 1.0 if result["success"] else 0.0
65+
66+
# Create tasks by passing the actual prompt as an arg
67+
task1 = env("web-task", instruction="Find the contact page and extract the support email")
68+
task2 = env("web-task", instruction="Add a MacBook Pro to cart", start_url="https://store.example.com")
69+
task3 = env("web-task", instruction="Fill out the signup form with test data")
6570
```
6671

67-
`Task.from_v4()` automatically:
68-
- Runs `setup_tool` at the start of evaluation
69-
- Runs `evaluate_tool` at the end to compute reward
70-
- Preserves all existing behavior
71-
72-
## Migration Path 2: Full Scenario Migration (Recommended)
73-
74-
For new code or when refactoring, migrate `setup_tool` and `evaluate_tool` to `@env.scenario()`.
72+
This pattern:
73+
- **Args ARE the prompt**: The instruction flows directly through as the agent's task
74+
- **Enables parametric evaluation**: Same scenario, different instructions
75+
- **Replaces hardcoded prompts**: Instead of `LegacyTask(prompt="...")`, pass the prompt as an arg
76+
- **Type-safe**: Arguments are validated against the scenario signature
7577

76-
**The rule is simple:**
77-
- `setup_tool` code → **before the first yield**
78-
- `evaluate_tool` code → **after the first yield**
78+
### Before/After Comparison
7979

8080
```python
8181
# BEFORE (deprecated in v0.6.0)
8282
task = LegacyTask(
83-
prompt="What's the current URL?",
83+
prompt="Find all products under $50 and add the cheapest to cart",
8484
mcp_config={"hud": {...}},
85-
setup_tool={"name": "navigate", "arguments": {"url": "https://google.com"}},
86-
evaluate_tool={"name": "check_url", "arguments": {"expected": "google.com"}}
85+
setup_tool={"name": "navigate", "arguments": {"url": "https://shop.example.com"}},
86+
evaluate_tool={"name": "check_cart", "arguments": {}}
8787
)
8888

89-
# AFTER
90-
from hud import Environment
91-
92-
env = Environment("browser").connect_hub("hud-evals/browser")
93-
94-
@env.scenario("navigate-google")
95-
async def navigate_google():
96-
# ===== SETUP SECTION (replaces setup_tool) =====
97-
await env.call_tool("navigate", url="https://google.com")
98-
99-
# ===== PROMPT (first yield) =====
100-
answer = yield "What's the current URL?"
89+
# AFTER - Prompt passthrough pattern
90+
@env.scenario("shopping")
91+
async def shopping(task: str, shop_url: str):
92+
await env.call_tool("navigate", url=shop_url)
10193

102-
# ===== EVALUATE SECTION (replaces evaluate_tool) =====
103-
result = await env.call_tool("check_url", expected="google.com")
94+
answer = yield task # The task arg IS the prompt
10495

105-
# ===== REWARD (second yield) =====
106-
yield 1.0 if result else 0.0
107-
108-
# Create task from scenario
109-
task = env("navigate-google")
96+
result = await env.call_tool("check_cart")
97+
yield 1.0 if result["has_items"] else 0.0
98+
99+
# Now create multiple tasks with different instructions
100+
tasks = [
101+
env("shopping", task="Find all products under $50 and add the cheapest to cart", shop_url="https://shop.example.com"),
102+
env("shopping", task="Search for 'laptop' and add the first result to cart", shop_url="https://shop.example.com"),
103+
env("shopping", task="Apply promo code SAVE20 at checkout", shop_url="https://shop.example.com"),
104+
]
110105
```
111106

107+
### The Migration Rule
108+
109+
- `prompt`**scenario arg** (passthrough)
110+
- `setup_tool`**code before first yield**
111+
- `evaluate_tool`**code after first yield**
112+
112113
### Multiple setup_tool Calls
113114

114115
If you have multiple setup tools, just call them in sequence:
@@ -118,45 +119,80 @@ If you have multiple setup tools, just call them in sequence:
118119
setup_tool=[
119120
{"name": "navigate", "arguments": {"url": "..."}},
120121
{"name": "login", "arguments": {"user": "..."}},
121-
{"name": "go_to_page", "arguments": {"page": "settings"}}
122122
]
123123

124124
# AFTER
125-
@env.scenario("settings-test")
126-
async def settings_test():
127-
# Multiple setup steps - just call them in order
128-
await env.call_tool("navigate", url="...")
129-
await env.call_tool("login", user="...")
130-
await env.call_tool("go_to_page", page="settings")
125+
@env.scenario("authenticated-task")
126+
async def authenticated_task(instruction: str, username: str):
127+
await env.call_tool("navigate", url="https://app.example.com")
128+
await env.call_tool("login", user=username)
131129

132-
answer = yield "Verify the settings page loaded correctly"
130+
answer = yield instruction
133131

134-
result = await env.call_tool("check_settings")
132+
result = await env.call_tool("check_completion")
135133
yield 1.0 if result else 0.0
136134
```
137135

136+
## JSON Task Format (Platform Ready)
137+
138+
For JSON-based task definitions that can be uploaded to the HUD platform, use this format:
139+
140+
```json
141+
{
142+
"env": {
143+
"name": "hud-evals/browser"
144+
},
145+
"scenario": "web-task",
146+
"args": {
147+
"instruction": "Find the contact page and extract the support email",
148+
"start_url": "https://example.com"
149+
}
150+
}
151+
```
152+
153+
This maps directly to the scenario call: `env("web-task", instruction="...", start_url="...")`.
154+
155+
**Example: Task set for platform upload**
156+
157+
```json
158+
[
159+
{
160+
"env": { "name": "hud-ops-diagnostics-sentry" },
161+
"scenario": "sentry-agent:investigate",
162+
"args": {
163+
"issue_id": "PROJ-1234",
164+
"max_depth": 3
165+
}
166+
},
167+
{
168+
"env": { "name": "hud-evals/browser" },
169+
"scenario": "web-task",
170+
"args": {
171+
"instruction": "Add a MacBook Pro to cart and proceed to checkout"
172+
}
173+
}
174+
]
175+
```
176+
177+
The `args` field uses prompt passthrough—the values flow directly into the scenario's yield statement.
178+
138179
## Using with Built-in Agents
139180

140-
Built-in agents (ClaudeAgent, OpenAIAgent, etc.) work with both patterns:
181+
Built-in agents work with scenarios:
141182

142183
```python
143184
from hud.agents import ClaudeAgent
144185

145186
agent = ClaudeAgent.create()
146-
147-
# Works with Task from scenario
148-
result = await agent.run(env("navigate-google"))
149-
150-
# Works with Task.from_v4() conversion
151-
result = await agent.run(Task.from_v4(legacy_task))
187+
result = await agent.run(env("web-task", instruction="Find the pricing page"))
152188
```
153189

154-
## Optional: Bring Your Own Agent
190+
## Bring Your Own Agent
155191

156192
v5 gives you the `hud.eval()` context manager for maximum flexibility:
157193

158194
```python
159-
async with hud.eval(env("checkout", product="laptop")) as ctx:
195+
async with hud.eval(env("shopping", task="Add item to cart", shop_url="https://shop.example.com")) as ctx:
160196
# Use OpenAI, Anthropic, your own agent—whatever you want
161197
response = await client.chat.completions.create(
162198
model="gpt-4o",
@@ -170,14 +206,12 @@ async with hud.eval(env("checkout", product="laptop")) as ctx:
170206
print(ctx.reward)
171207
```
172208

173-
The old `ClaudeAgent` and `OperatorAgent` still work—even with the new `hud.eval()` system. But now you're not locked into a specific agent spec. Pair with the [Gateway](/quick-links/gateway) to use any model through one API.
174-
175209
## Quick Reference
176210

177-
| v4 (deprecated in v0.6.0) | v5 |
178-
|---------------------------|-----|
179-
| `LegacyTask(...)` | `Task.from_v4(...)` (quick) or `env("scenario", ...)` (recommended) |
211+
| v4 (deprecated in v0.6.0) | v5 (recommended) |
212+
|---------------------------|------------------|
213+
| `LegacyTask(prompt=...)` | `env("scenario", instruction=...)` — prompt passthrough |
180214
| `setup_tool` | Code before first yield in `@env.scenario()` |
181215
| `evaluate_tool` | Code after first yield in `@env.scenario()` |
182216
| `MCPServer` | `Environment` (drop-in replacement) |
183-
| `agent.run(task)` | Still works, or use `hud.eval()` for BYOA |
217+
| JSON with `mcp_config` + `prompt` | JSON with `env` + `scenario` + `args` |

0 commit comments

Comments
 (0)