OpenHands
diff --git a/‎.agents/skills/code-review.md‎
Lines changed: 86 additions & 0 deletions b/‎.agents/skills/code-review.md‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎.agents/skills/debug-test-examples-workflow/SKILL.md‎
Lines changed: 88 additions & 0 deletions b/‎.agents/skills/debug-test-examples-workflow/SKILL.md‎
Lines changed: 88 additions & 0 deletions
diff --git a/‎.agents/skills/run-eval.md‎
Lines changed: 66 additions & 0 deletions b/‎.agents/skills/run-eval.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎.openhands/skills/write-behavior-test.md‎ ‎.agents/skills/write-behavior-test.md‎.openhands/skills/write-behavior-test.md renamed to .agents/skills/write-behavior-test.md
Lines changed: 1 addition & 3 deletions b/‎.openhands/skills/write-behavior-test.md‎ ‎.agents/skills/write-behavior-test.md‎.openhands/skills/write-behavior-test.md renamed to .agents/skills/write-behavior-test.md
Lines changed: 1 addition & 3 deletions
diff --git a/‎.github/actions/pr-review/action.yml‎
Lines changed: 143 additions & 0 deletions b/‎.github/actions/pr-review/action.yml‎
Lines changed: 143 additions & 0 deletions
@@ -0,0 +1,86 @@
+---
+name: code-review
+description: Structured code review covering style, readability, and security concerns with actionable feedback. Use when reviewing pull requests or merge requests to identify issues and suggest improvements.
+triggers:
+- /codereview
+---
+
+# OpenHands/software-agent-sdk Code Review Guidelines
+
+You are an expert code reviewer for the **OpenHands/software-agent-sdk** repository. This skill provides repo-specific review guidelines. Be direct but constructive.
+
+## Review Decisions
+
+You have permission to **APPROVE** or **COMMENT** on PRs. Do not use REQUEST_CHANGES.
+
+**Default to APPROVE**: If your review finds no issues at "important" level or higher, approve the PR. Minor suggestions or nitpicks alone are not sufficient reason to withhold approval.
+
+**IMPORTANT: If you determine a PR is worth merging, you should approve it.** Don’t just say a PR is "worth merging" or "ready to merge" without actually submitting an approval. Your words and actions should be consistent.
+
+### When to APPROVE
+
+Examples of straightforward and low-risk PRs you should approve (non-exhaustive):
+
+- **Configuration changes**: Adding models to config files, updating CI/workflow settings
+- **CI/Infrastructure changes**: Changing runner types, fixing workflow paths, updating job configurations
+- **Cosmetic changes**: Typo fixes, formatting, comment improvements, README updates
+- **Documentation-only changes**: Docstring updates, clarifying notes, API documentation improvements
+- **Simple additions**: Adding entries to lists/dictionaries following existing patterns
+- **Test-only changes**: Adding or updating tests without changing production code
+- **Dependency updates**: Version bumps with passing CI
+
+Examples:
+- A PR adding a new model to `resolve_model_config.py` or `verified_models.py` with corresponding test updates
+- A PR adding documentation notes to docstrings clarifying method behavior (e.g., security considerations, bypass behaviors)
+- A PR changing CI runners or fixing workflow infrastructure issues (e.g., standardizing runner types to fix path inconsistencies)
+
+### When to COMMENT
+
+Use COMMENT when you have feedback or concerns:
+
+- Issues that need attention (bugs, security concerns, missing tests)
+- Suggestions for improvement
+- Questions about design decisions
+- Minor style preferences
+
+If there are significant issues, leave detailed comments explaining the concerns—but let a human maintainer decide whether to block the PR.
+
+## Core Principles
+
+1. **Simplicity First**: Question complexity. If something feels overcomplicated, ask "what's the use case?" and seek simpler alternatives. Features should solve real problems, not imaginary ones.
+
+2. **Pragmatic Testing**: Test what matters. Avoid duplicate test coverage. Don't test library features (e.g., `BaseModel.model_dump()`). Focus on the specific logic implemented in this codebase.
+
+3. **Type Safety**: Avoid `# type: ignore` - treat it as a last resort. Fix types properly with assertions, proper annotations, or code adjustments. Prefer explicit type checking over `getattr`/`hasattr` guards.
+
+4. **Backward Compatibility**: Evaluate breaking change impact carefully. Consider API changes that affect existing users, removal of public fields/methods, and changes to default behavior.
+
+## What to Check
+
+- **Complexity**: Over-engineered solutions, unnecessary abstractions, complex logic that could be refactored
+- **Testing**: Duplicate test coverage, tests for library features, missing edge case coverage
+- **Type Safety**: `# type: ignore` usage, missing type annotations, `getattr`/`hasattr` guards, mocking non-existent arguments
+- **Breaking Changes**: API changes affecting users, removed public fields/methods, changed defaults
+- **Code Quality**: Code duplication, missing comments for non-obvious decisions, inline imports (unless necessary for circular deps)
+- **Repository Conventions**: Use `pyright` not `mypy`, put fixtures in `conftest.py`, avoid `sys.path.insert` hacks
+
+## What NOT to Comment On
+
+Do not leave comments for:
+
+- **Nitpicks**: Minor style preferences, optional improvements, or "nice-to-haves" that don't affect correctness or maintainability
+- **Good behavior observed**: Don't comment just to praise code that follows best practices - this adds noise. Simply approve if the code is good.
+- **Suggestions for additional tests on simple changes**: For straightforward PRs (config changes, model additions, etc.), don't suggest adding test coverage unless tests are clearly missing for new logic
+- **Obvious or self-explanatory code**: Don't ask for comments on code that is already clear
+- **`.pr/` directory artifacts**: Files in the `.pr/` directory are temporary PR-specific documents (design notes, analysis, scripts) that are automatically cleaned up when the PR is approved. Do not comment on their presence or suggest removing them.
+
+If a PR is approvable, just approve it. Don't add "one small suggestion" or "consider doing X" comments that delay merging without adding real value.
+
+## Communication Style
+
+- Be direct and concise - don't over-explain
+- Use casual, friendly tone ("lgtm", "WDYT?", emojis are fine 👀)
+- Ask questions to understand use cases before suggesting changes
+- Suggest alternatives, not mandates
+- Approve quickly when code is good ("LGTM!")
+- Use GitHub suggestion syntax for code fixes
@@ -0,0 +1,88 @@
+---
+name: debug-test-examples-workflow
+description: Guide for debugging failing example tests in the `test-examples` labeled workflow. Use this skill when investigating CI failures in the run-examples.yml workflow, when example scripts fail to run correctly, when needing to isolate specific test failures, or when analyzing workflow logs and failure patterns.
+---
+
+# Debugging test-examples Workflow
+
+## Overview
+
+The `run-examples.yml` workflow runs example scripts from `examples/` directory. Triggers:
+- Adding `test-examples` label to a PR
+- Manual workflow dispatch
+- Scheduled nightly runs
+
+## Debugging Steps
+
+### 1. Isolate Failing Tests
+
+Modify `tests/examples/test_examples.py` to focus on specific tests:
+
+```python
+_TARGET_DIRECTORIES = (
+    # EXAMPLES_ROOT / "01_standalone_sdk",
+    EXAMPLES_ROOT / "02_remote_agent_server",  # Keep only failing directory
+)
+```
+
+### 2. Exclude Tests
+
+Add to `_EXCLUDED_EXAMPLES` with explanation:
+
+```python
+_EXCLUDED_EXAMPLES = {
+    # Reason for exclusion
+    "examples/path/to/failing_test.py",
+}
+```
+
+### 3. Trigger Workflow
+
+Toggle the `test-examples` label:
+
+```bash
+# Remove label
+curl -X DELETE -H "Authorization: token $GITHUB_TOKEN" \
+  "https://api.github.com/repos/OpenHands/software-agent-sdk/issues/${PR_NUMBER}/labels/test-examples"
+
+# Add label
+curl -X POST -H "Authorization: token $GITHUB_TOKEN" \
+  -H "Accept: application/vnd.github.v3+json" \
+  "https://api.github.com/repos/OpenHands/software-agent-sdk/issues/{PR_NUMBER}/labels" \
+  -d '{"labels":["test-examples"]}'
+```
+
+### 4. Monitor Progress
+
+```bash
+# Check status
+curl -s -H "Authorization: token $GITHUB_TOKEN" \
+  "https://api.github.com/repos/OpenHands/software-agent-sdk/actions/runs/{RUN_ID}" | jq '{status, conclusion}'
+
+# Download logs
+curl -sL -H "Authorization: token $GITHUB_TOKEN" \
+  "https://api.github.com/repos/OpenHands/software-agent-sdk/actions/runs/{RUN_ID}/logs" -o logs.zip
+unzip logs.zip -d logs
+```
+
+## Common Failure Patterns
+
+| Pattern | Cause | Solution |
+|---------|-------|----------|
+| Port conflicts | Fixed ports (8010, 8011) | Run with `-n 1` or use different ports |
+| Container issues | Docker/Apptainer setup | Check Docker availability, image pulls |
+| LLM failures | Transient API errors | Retry the test |
+| Example bugs | Code errors | Check traceback |
+
+
+## Key Configuration
+
+**Workflow** (`.github/workflows/run-examples.yml`):
+- Runner: `blacksmith-2vcpu-ubuntu-2404`
+- Timeout: 60 minutes
+- Parallelism: `-n 4` (pytest-xdist: 4 parallel workers)
+
+**Tests** (`tests/examples/test_examples.py`):
+- Timeout per example: 600 seconds
+- Target directories: `_TARGET_DIRECTORIES`
+- Excluded examples: `_EXCLUDED_EXAMPLES`
@@ -0,0 +1,66 @@
+---
+name: run-eval
+description: Trigger and monitor evaluation runs for benchmarks like SWE-bench, GAIA, and others. Use when running evaluations via GitHub Actions or monitoring eval progress through Datadog and kubectl.
+triggers:
+- run eval
+- trigger eval
+- evaluation run
+- swebench eval
+---
+
+# Running Evaluations
+
+## Trigger via GitHub API
+
+```bash
+curl -X POST \
+  -H "Authorization: token $GITHUB_TOKEN" \
+  -H "Accept: application/vnd.github+json" \
+  "https://api.github.com/repos/OpenHands/software-agent-sdk/actions/workflows/run-eval.yml/dispatches" \
+  -d '{
+    "ref": "main",
+    "inputs": {
+      "benchmark": "swebench",
+      "sdk_ref": "main",
+      "eval_limit": "50",
+      "model_ids": "claude-sonnet-4-5-20250929",
+      "reason": "Description of eval run",
+      "benchmarks_branch": "main"
+    }
+  }'
+```
+
+**Key parameters:**
+- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`
+- `eval_limit`: `1`, `50`, `100`, `200`, `500`
+- `model_ids`: See `.github/run-eval/resolve_model_config.py` for available models
+- `benchmarks_branch`: Use feature branch from the benchmarks repo to test benchmark changes before merging
+
+**Note:** When running a full eval, you must select an `eval_limit` that is greater than or equal to the actual number of instances in the benchmark. If you specify a smaller limit, only that many instances will be evaluated (partial eval).
+
+## Monitoring
+
+**Datadog script** (requires `OpenHands/evaluation` repo; DD_API_KEY, DD_APP_KEY, and DD_SITE environment variables are set):
+```bash
+DD_API_KEY=$DD_API_KEY DD_APP_KEY=$DD_APP_KEY DD_SITE=$DD_SITE \
+  python scripts/analyze_evals.py --job-prefix <EVAL_RUN_ID> --time-range 60
+# EVAL_RUN_ID format: typically the workflow run ID from GitHub Actions
+```
+
+**kubectl** (for users with cluster access - the agent does not have kubectl access):
+```bash
+kubectl logs -f job/eval-eval-<RUN_ID>-<MODEL_SLUG> -n evaluation-jobs
+```
+
+## Common Errors
+
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `503 Service Unavailable` | Infrastructure overloaded | Ask user to stop some evaluation runs |
+| `429 Too Many Requests` | Rate limiting | Wait or reduce concurrency |
+| `failed after 3 retries` | Instance failures | Check Datadog logs for root cause |
+
+## Limits
+
+- Max 256 parallel runtimes (jobs will queue if this limit is exceeded)
+- Full evals typically take 1-3 hours depending on benchmark size
@@ -1,8 +1,6 @@
 ---
 name: write-behavior-test
-type: knowledge
-version: 1.0.0
-agent: CodeActAgent
+description: Guide for writing behavior tests that verify agents follow system message guidelines and avoid undesirable behaviors. Use when creating integration tests for agent behavior validation.
 triggers:
 - /write_behavior_test
 ---
 
@@ -0,0 +1,143 @@
+---
+name: OpenHands PR Review
+description: Automated PR review using OpenHands agent
+author: OpenHands
+
+branding:
+    icon: code
+    color: blue
+
+inputs:
+    llm-model:
+        description: LLM model to use for the review
+        required: false
+        default: anthropic/claude-sonnet-4-5-20250929
+    llm-base-url:
+        description: LLM base URL (optional, for custom LLM endpoints)
+        required: false
+        default: ''
+    review-style:
+        description: "Review style: 'standard' (balanced review covering style, readability, and security) or 'roasted' (Linus Torvalds-style brutally honest
+            feedback focusing on data structures, simplicity, and pragmatism)"
+        required: false
+        default: roasted
+    sdk-repo:
+        description: GitHub repository for the SDK (owner/repo)
+        required: false
+        default: OpenHands/software-agent-sdk
+    sdk-version:
+        description: Git ref to use for the SDK (tag, branch, or commit SHA, e.g., v1.0.0, main, or abc1234)
+        required: false
+        default: main
+    llm-api-key:
+        description: LLM API key (required)
+        required: true
+    github-token:
+        description: GitHub token for API access (required)
+        required: true
+    lmnr-api-key:
+        description: Laminar API key for observability (optional)
+        required: false
+        default: ''
+
+runs:
+    using: composite
+    steps:
+        - name: Checkout software-agent-sdk repository
+          uses: actions/checkout@v4
+          with:
+              repository: ${{ inputs.sdk-repo }}
+              ref: ${{ inputs.sdk-version }}
+              path: software-agent-sdk
+
+        - name: Checkout PR repository
+          uses: actions/checkout@v4
+          with:
+              repository: ${{ github.event.pull_request.head.repo.full_name }}
+              ref: ${{ github.event.pull_request.head.ref }}
+              fetch-depth: 0
+              persist-credentials: false
+              path: pr-repo
+
+        - name: Set up Python
+          uses: actions/setup-python@v5
+          with:
+              python-version: '3.12'
+
+        - name: Install uv
+          uses: astral-sh/setup-uv@v6
+          with:
+              enable-cache: true
+
+        - name: Install GitHub CLI
+          shell: bash
+          run: |
+              sudo apt-get update
+              sudo apt-get install -y gh
+
+        - name: Install OpenHands dependencies
+          shell: bash
+          run: |
+              uv pip install --system ./software-agent-sdk/openhands-sdk ./software-agent-sdk/openhands-tools lmnr
+
+        - name: Check required configuration
+          shell: bash
+          env:
+              LLM_API_KEY: ${{ inputs.llm-api-key }}
+              GITHUB_TOKEN: ${{ inputs.github-token }}
+          run: |
+              if [ -z "$LLM_API_KEY" ]; then
+                echo "Error: llm-api-key is required."
+                exit 1
+              fi
+              if [ -z "$GITHUB_TOKEN" ]; then
+                echo "Error: github-token is required."
+                exit 1
+              fi
+
+              echo "PR Number: ${{ github.event.pull_request.number }}"
+              echo "PR Title: ${{ github.event.pull_request.title }}"
+              echo "Repository: ${{ github.repository }}"
+              echo "SDK Version: ${{ inputs.sdk-version }}"
+              echo "LLM model: ${{ inputs.llm-model }}"
+              if [ -n "${{ inputs.llm-base-url }}" ]; then
+                echo "LLM base URL: ${{ inputs.llm-base-url }}"
+              fi
+
+        - name: Run PR review
+          shell: bash
+          env:
+              LLM_MODEL: ${{ inputs.llm-model }}
+              LLM_BASE_URL: ${{ inputs.llm-base-url }}
+              REVIEW_STYLE: ${{ inputs.review-style }}
+              LLM_API_KEY: ${{ inputs.llm-api-key }}
+              GITHUB_TOKEN: ${{ inputs.github-token }}
+              LMNR_PROJECT_API_KEY: ${{ inputs.lmnr-api-key }}
+              PR_NUMBER: ${{ github.event.pull_request.number }}
+              PR_TITLE: ${{ github.event.pull_request.title }}
+              PR_BODY: ${{ github.event.pull_request.body }}
+              PR_BASE_BRANCH: ${{ github.event.pull_request.base.ref }}
+              PR_HEAD_BRANCH: ${{ github.event.pull_request.head.ref }}
+              REPO_NAME: ${{ github.repository }}
+          run: |
+              cd pr-repo
+              uv run python ../software-agent-sdk/examples/03_github_workflows/02_pr_review/agent_script.py
+
+        - name: Upload logs as artifact
+          uses: actions/upload-artifact@v4
+          if: always()
+          with:
+              name: openhands-pr-review-logs
+              path: |
+                  *.log
+                  output/
+              retention-days: 7
+
+        - name: Upload Laminar trace info for evaluation
+          uses: actions/upload-artifact@v4
+          if: success()
+          with:
+              name: pr-review-trace-${{ github.event.pull_request.number }}
+              path: pr-repo/laminar_trace_info.json
+              retention-days: 30
+              if-no-files-found: ignore