Skip to content

Commit 8f1e553

Browse files
authored
Merge branch 'main' into fix/validate-unverified-providers
2 parents 61555c7 + d7b3617 commit 8f1e553

File tree

345 files changed

+36106
-3359
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

345 files changed

+36106
-3359
lines changed

.agents/skills/code-review.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
---
2+
name: code-review
3+
description: Structured code review covering style, readability, and security concerns with actionable feedback. Use when reviewing pull requests or merge requests to identify issues and suggest improvements.
4+
triggers:
5+
- /codereview
6+
---
7+
8+
# OpenHands/software-agent-sdk Code Review Guidelines
9+
10+
You are an expert code reviewer for the **OpenHands/software-agent-sdk** repository. This skill provides repo-specific review guidelines. Be direct but constructive.
11+
12+
## Review Decisions
13+
14+
You have permission to **APPROVE** or **COMMENT** on PRs. Do not use REQUEST_CHANGES.
15+
16+
**Default to APPROVE**: If your review finds no issues at "important" level or higher, approve the PR. Minor suggestions or nitpicks alone are not sufficient reason to withhold approval.
17+
18+
**IMPORTANT: If you determine a PR is worth merging, you should approve it.** Don’t just say a PR is "worth merging" or "ready to merge" without actually submitting an approval. Your words and actions should be consistent.
19+
20+
### When to APPROVE
21+
22+
Examples of straightforward and low-risk PRs you should approve (non-exhaustive):
23+
24+
- **Configuration changes**: Adding models to config files, updating CI/workflow settings
25+
- **CI/Infrastructure changes**: Changing runner types, fixing workflow paths, updating job configurations
26+
- **Cosmetic changes**: Typo fixes, formatting, comment improvements, README updates
27+
- **Documentation-only changes**: Docstring updates, clarifying notes, API documentation improvements
28+
- **Simple additions**: Adding entries to lists/dictionaries following existing patterns
29+
- **Test-only changes**: Adding or updating tests without changing production code
30+
- **Dependency updates**: Version bumps with passing CI
31+
32+
Examples:
33+
- A PR adding a new model to `resolve_model_config.py` or `verified_models.py` with corresponding test updates
34+
- A PR adding documentation notes to docstrings clarifying method behavior (e.g., security considerations, bypass behaviors)
35+
- A PR changing CI runners or fixing workflow infrastructure issues (e.g., standardizing runner types to fix path inconsistencies)
36+
37+
### When to COMMENT
38+
39+
Use COMMENT when you have feedback or concerns:
40+
41+
- Issues that need attention (bugs, security concerns, missing tests)
42+
- Suggestions for improvement
43+
- Questions about design decisions
44+
- Minor style preferences
45+
46+
If there are significant issues, leave detailed comments explaining the concerns—but let a human maintainer decide whether to block the PR.
47+
48+
## Core Principles
49+
50+
1. **Simplicity First**: Question complexity. If something feels overcomplicated, ask "what's the use case?" and seek simpler alternatives. Features should solve real problems, not imaginary ones.
51+
52+
2. **Pragmatic Testing**: Test what matters. Avoid duplicate test coverage. Don't test library features (e.g., `BaseModel.model_dump()`). Focus on the specific logic implemented in this codebase.
53+
54+
3. **Type Safety**: Avoid `# type: ignore` - treat it as a last resort. Fix types properly with assertions, proper annotations, or code adjustments. Prefer explicit type checking over `getattr`/`hasattr` guards.
55+
56+
4. **Backward Compatibility**: Evaluate breaking change impact carefully. Consider API changes that affect existing users, removal of public fields/methods, and changes to default behavior.
57+
58+
## What to Check
59+
60+
- **Complexity**: Over-engineered solutions, unnecessary abstractions, complex logic that could be refactored
61+
- **Testing**: Duplicate test coverage, tests for library features, missing edge case coverage
62+
- **Type Safety**: `# type: ignore` usage, missing type annotations, `getattr`/`hasattr` guards, mocking non-existent arguments
63+
- **Breaking Changes**: API changes affecting users, removed public fields/methods, changed defaults
64+
- **Code Quality**: Code duplication, missing comments for non-obvious decisions, inline imports (unless necessary for circular deps)
65+
- **Repository Conventions**: Use `pyright` not `mypy`, put fixtures in `conftest.py`, avoid `sys.path.insert` hacks
66+
67+
## What NOT to Comment On
68+
69+
Do not leave comments for:
70+
71+
- **Nitpicks**: Minor style preferences, optional improvements, or "nice-to-haves" that don't affect correctness or maintainability
72+
- **Good behavior observed**: Don't comment just to praise code that follows best practices - this adds noise. Simply approve if the code is good.
73+
- **Suggestions for additional tests on simple changes**: For straightforward PRs (config changes, model additions, etc.), don't suggest adding test coverage unless tests are clearly missing for new logic
74+
- **Obvious or self-explanatory code**: Don't ask for comments on code that is already clear
75+
- **`.pr/` directory artifacts**: Files in the `.pr/` directory are temporary PR-specific documents (design notes, analysis, scripts) that are automatically cleaned up when the PR is approved. Do not comment on their presence or suggest removing them.
76+
77+
If a PR is approvable, just approve it. Don't add "one small suggestion" or "consider doing X" comments that delay merging without adding real value.
78+
79+
## Communication Style
80+
81+
- Be direct and concise - don't over-explain
82+
- Use casual, friendly tone ("lgtm", "WDYT?", emojis are fine 👀)
83+
- Ask questions to understand use cases before suggesting changes
84+
- Suggest alternatives, not mandates
85+
- Approve quickly when code is good ("LGTM!")
86+
- Use GitHub suggestion syntax for code fixes
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
name: debug-test-examples-workflow
3+
description: Guide for debugging failing example tests in the `test-examples` labeled workflow. Use this skill when investigating CI failures in the run-examples.yml workflow, when example scripts fail to run correctly, when needing to isolate specific test failures, or when analyzing workflow logs and failure patterns.
4+
---
5+
6+
# Debugging test-examples Workflow
7+
8+
## Overview
9+
10+
The `run-examples.yml` workflow runs example scripts from `examples/` directory. Triggers:
11+
- Adding `test-examples` label to a PR
12+
- Manual workflow dispatch
13+
- Scheduled nightly runs
14+
15+
## Debugging Steps
16+
17+
### 1. Isolate Failing Tests
18+
19+
Modify `tests/examples/test_examples.py` to focus on specific tests:
20+
21+
```python
22+
_TARGET_DIRECTORIES = (
23+
# EXAMPLES_ROOT / "01_standalone_sdk",
24+
EXAMPLES_ROOT / "02_remote_agent_server", # Keep only failing directory
25+
)
26+
```
27+
28+
### 2. Exclude Tests
29+
30+
Add to `_EXCLUDED_EXAMPLES` with explanation:
31+
32+
```python
33+
_EXCLUDED_EXAMPLES = {
34+
# Reason for exclusion
35+
"examples/path/to/failing_test.py",
36+
}
37+
```
38+
39+
### 3. Trigger Workflow
40+
41+
Toggle the `test-examples` label:
42+
43+
```bash
44+
# Remove label
45+
curl -X DELETE -H "Authorization: token $GITHUB_TOKEN" \
46+
"https://api.github.com/repos/OpenHands/software-agent-sdk/issues/${PR_NUMBER}/labels/test-examples"
47+
48+
# Add label
49+
curl -X POST -H "Authorization: token $GITHUB_TOKEN" \
50+
-H "Accept: application/vnd.github.v3+json" \
51+
"https://api.github.com/repos/OpenHands/software-agent-sdk/issues/{PR_NUMBER}/labels" \
52+
-d '{"labels":["test-examples"]}'
53+
```
54+
55+
### 4. Monitor Progress
56+
57+
```bash
58+
# Check status
59+
curl -s -H "Authorization: token $GITHUB_TOKEN" \
60+
"https://api.github.com/repos/OpenHands/software-agent-sdk/actions/runs/{RUN_ID}" | jq '{status, conclusion}'
61+
62+
# Download logs
63+
curl -sL -H "Authorization: token $GITHUB_TOKEN" \
64+
"https://api.github.com/repos/OpenHands/software-agent-sdk/actions/runs/{RUN_ID}/logs" -o logs.zip
65+
unzip logs.zip -d logs
66+
```
67+
68+
## Common Failure Patterns
69+
70+
| Pattern | Cause | Solution |
71+
|---------|-------|----------|
72+
| Port conflicts | Fixed ports (8010, 8011) | Run with `-n 1` or use different ports |
73+
| Container issues | Docker/Apptainer setup | Check Docker availability, image pulls |
74+
| LLM failures | Transient API errors | Retry the test |
75+
| Example bugs | Code errors | Check traceback |
76+
77+
78+
## Key Configuration
79+
80+
**Workflow** (`.github/workflows/run-examples.yml`):
81+
- Runner: `blacksmith-2vcpu-ubuntu-2404`
82+
- Timeout: 60 minutes
83+
- Parallelism: `-n 4` (pytest-xdist: 4 parallel workers)
84+
85+
**Tests** (`tests/examples/test_examples.py`):
86+
- Timeout per example: 600 seconds
87+
- Target directories: `_TARGET_DIRECTORIES`
88+
- Excluded examples: `_EXCLUDED_EXAMPLES`

.agents/skills/run-eval.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
---
2+
name: run-eval
3+
description: Trigger and monitor evaluation runs for benchmarks like SWE-bench, GAIA, and others. Use when running evaluations via GitHub Actions or monitoring eval progress through Datadog and kubectl.
4+
triggers:
5+
- run eval
6+
- trigger eval
7+
- evaluation run
8+
- swebench eval
9+
---
10+
11+
# Running Evaluations
12+
13+
## Trigger via GitHub API
14+
15+
```bash
16+
curl -X POST \
17+
-H "Authorization: token $GITHUB_TOKEN" \
18+
-H "Accept: application/vnd.github+json" \
19+
"https://api.github.com/repos/OpenHands/software-agent-sdk/actions/workflows/run-eval.yml/dispatches" \
20+
-d '{
21+
"ref": "main",
22+
"inputs": {
23+
"benchmark": "swebench",
24+
"sdk_ref": "main",
25+
"eval_limit": "50",
26+
"model_ids": "claude-sonnet-4-5-20250929",
27+
"reason": "Description of eval run",
28+
"benchmarks_branch": "main"
29+
}
30+
}'
31+
```
32+
33+
**Key parameters:**
34+
- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`
35+
- `eval_limit`: `1`, `50`, `100`, `200`, `500`
36+
- `model_ids`: See `.github/run-eval/resolve_model_config.py` for available models
37+
- `benchmarks_branch`: Use feature branch from the benchmarks repo to test benchmark changes before merging
38+
39+
**Note:** When running a full eval, you must select an `eval_limit` that is greater than or equal to the actual number of instances in the benchmark. If you specify a smaller limit, only that many instances will be evaluated (partial eval).
40+
41+
## Monitoring
42+
43+
**Datadog script** (requires `OpenHands/evaluation` repo; DD_API_KEY, DD_APP_KEY, and DD_SITE environment variables are set):
44+
```bash
45+
DD_API_KEY=$DD_API_KEY DD_APP_KEY=$DD_APP_KEY DD_SITE=$DD_SITE \
46+
python scripts/analyze_evals.py --job-prefix <EVAL_RUN_ID> --time-range 60
47+
# EVAL_RUN_ID format: typically the workflow run ID from GitHub Actions
48+
```
49+
50+
**kubectl** (for users with cluster access - the agent does not have kubectl access):
51+
```bash
52+
kubectl logs -f job/eval-eval-<RUN_ID>-<MODEL_SLUG> -n evaluation-jobs
53+
```
54+
55+
## Common Errors
56+
57+
| Error | Cause | Fix |
58+
|-------|-------|-----|
59+
| `503 Service Unavailable` | Infrastructure overloaded | Ask user to stop some evaluation runs |
60+
| `429 Too Many Requests` | Rate limiting | Wait or reduce concurrency |
61+
| `failed after 3 retries` | Instance failures | Check Datadog logs for root cause |
62+
63+
## Limits
64+
65+
- Max 256 parallel runtimes (jobs will queue if this limit is exceeded)
66+
- Full evals typically take 1-3 hours depending on benchmark size
Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
---
22
name: write-behavior-test
3-
type: knowledge
4-
version: 1.0.0
5-
agent: CodeActAgent
3+
description: Guide for writing behavior tests that verify agents follow system message guidelines and avoid undesirable behaviors. Use when creating integration tests for agent behavior validation.
64
triggers:
75
- /write_behavior_test
86
---
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
name: OpenHands PR Review
3+
description: Automated PR review using OpenHands agent
4+
author: OpenHands
5+
6+
branding:
7+
icon: code
8+
color: blue
9+
10+
inputs:
11+
llm-model:
12+
description: LLM model to use for the review
13+
required: false
14+
default: anthropic/claude-sonnet-4-5-20250929
15+
llm-base-url:
16+
description: LLM base URL (optional, for custom LLM endpoints)
17+
required: false
18+
default: ''
19+
review-style:
20+
description: "Review style: 'standard' (balanced review covering style, readability, and security) or 'roasted' (Linus Torvalds-style brutally honest
21+
feedback focusing on data structures, simplicity, and pragmatism)"
22+
required: false
23+
default: roasted
24+
sdk-repo:
25+
description: GitHub repository for the SDK (owner/repo)
26+
required: false
27+
default: OpenHands/software-agent-sdk
28+
sdk-version:
29+
description: Git ref to use for the SDK (tag, branch, or commit SHA, e.g., v1.0.0, main, or abc1234)
30+
required: false
31+
default: main
32+
llm-api-key:
33+
description: LLM API key (required)
34+
required: true
35+
github-token:
36+
description: GitHub token for API access (required)
37+
required: true
38+
lmnr-api-key:
39+
description: Laminar API key for observability (optional)
40+
required: false
41+
default: ''
42+
43+
runs:
44+
using: composite
45+
steps:
46+
- name: Checkout software-agent-sdk repository
47+
uses: actions/checkout@v4
48+
with:
49+
repository: ${{ inputs.sdk-repo }}
50+
ref: ${{ inputs.sdk-version }}
51+
path: software-agent-sdk
52+
53+
- name: Checkout PR repository
54+
uses: actions/checkout@v4
55+
with:
56+
repository: ${{ github.event.pull_request.head.repo.full_name }}
57+
ref: ${{ github.event.pull_request.head.ref }}
58+
fetch-depth: 0
59+
persist-credentials: false
60+
path: pr-repo
61+
62+
- name: Set up Python
63+
uses: actions/setup-python@v5
64+
with:
65+
python-version: '3.12'
66+
67+
- name: Install uv
68+
uses: astral-sh/setup-uv@v6
69+
with:
70+
enable-cache: true
71+
72+
- name: Install GitHub CLI
73+
shell: bash
74+
run: |
75+
sudo apt-get update
76+
sudo apt-get install -y gh
77+
78+
- name: Install OpenHands dependencies
79+
shell: bash
80+
run: |
81+
uv pip install --system ./software-agent-sdk/openhands-sdk ./software-agent-sdk/openhands-tools lmnr
82+
83+
- name: Check required configuration
84+
shell: bash
85+
env:
86+
LLM_API_KEY: ${{ inputs.llm-api-key }}
87+
GITHUB_TOKEN: ${{ inputs.github-token }}
88+
run: |
89+
if [ -z "$LLM_API_KEY" ]; then
90+
echo "Error: llm-api-key is required."
91+
exit 1
92+
fi
93+
if [ -z "$GITHUB_TOKEN" ]; then
94+
echo "Error: github-token is required."
95+
exit 1
96+
fi
97+
98+
echo "PR Number: ${{ github.event.pull_request.number }}"
99+
echo "PR Title: ${{ github.event.pull_request.title }}"
100+
echo "Repository: ${{ github.repository }}"
101+
echo "SDK Version: ${{ inputs.sdk-version }}"
102+
echo "LLM model: ${{ inputs.llm-model }}"
103+
if [ -n "${{ inputs.llm-base-url }}" ]; then
104+
echo "LLM base URL: ${{ inputs.llm-base-url }}"
105+
fi
106+
107+
- name: Run PR review
108+
shell: bash
109+
env:
110+
LLM_MODEL: ${{ inputs.llm-model }}
111+
LLM_BASE_URL: ${{ inputs.llm-base-url }}
112+
REVIEW_STYLE: ${{ inputs.review-style }}
113+
LLM_API_KEY: ${{ inputs.llm-api-key }}
114+
GITHUB_TOKEN: ${{ inputs.github-token }}
115+
LMNR_PROJECT_API_KEY: ${{ inputs.lmnr-api-key }}
116+
PR_NUMBER: ${{ github.event.pull_request.number }}
117+
PR_TITLE: ${{ github.event.pull_request.title }}
118+
PR_BODY: ${{ github.event.pull_request.body }}
119+
PR_BASE_BRANCH: ${{ github.event.pull_request.base.ref }}
120+
PR_HEAD_BRANCH: ${{ github.event.pull_request.head.ref }}
121+
REPO_NAME: ${{ github.repository }}
122+
run: |
123+
cd pr-repo
124+
uv run python ../software-agent-sdk/examples/03_github_workflows/02_pr_review/agent_script.py
125+
126+
- name: Upload logs as artifact
127+
uses: actions/upload-artifact@v4
128+
if: always()
129+
with:
130+
name: openhands-pr-review-logs
131+
path: |
132+
*.log
133+
output/
134+
retention-days: 7
135+
136+
- name: Upload Laminar trace info for evaluation
137+
uses: actions/upload-artifact@v4
138+
if: success()
139+
with:
140+
name: pr-review-trace-${{ github.event.pull_request.number }}
141+
path: pr-repo/laminar_trace_info.json
142+
retention-days: 30
143+
if-no-files-found: ignore

0 commit comments

Comments
 (0)