Skip to content

Commit 8db8a82

Browse files
authored
Merge branch 'main' into fix/windows-fcntl-import-crash
2 parents 3967d7a + 2d3d96d commit 8db8a82

File tree

194 files changed

+16849
-3472
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

194 files changed

+16849
-3472
lines changed
Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
name: code-review
3-
description: Structured code review covering style, readability, and security concerns with actionable feedback. Use when reviewing pull requests or merge requests to identify issues and suggest improvements.
2+
name: custom-codereview-guide
3+
description: Repo-specific code review guidelines for OpenHands/software-agent-sdk. Provides SDK-specific review rules in addition to the default code review skill.
44
triggers:
55
- /codereview
66
---
@@ -13,9 +13,27 @@ You are an expert code reviewer for the **OpenHands/software-agent-sdk** reposit
1313

1414
You have permission to **APPROVE** or **COMMENT** on PRs. Do not use REQUEST_CHANGES.
1515

16-
**Default to APPROVE**: If your review finds no issues at "important" level or higher, approve the PR. Minor suggestions or nitpicks alone are not sufficient reason to withhold approval.
16+
### Review decision policy (eval / benchmark risk)
1717

18-
**IMPORTANT: If you determine a PR is worth merging, you should approve it.** Don’t just say a PR is "worth merging" or "ready to merge" without actually submitting an approval. Your words and actions should be consistent.
18+
Do **NOT** submit an **APPROVE** review when the PR changes agent behavior or anything
19+
that could plausibly affect benchmark/evaluation performance.
20+
21+
Examples include: prompt templates, tool calling/execution, planning/loop logic,
22+
memory/condenser behavior, terminal/stdin/stdout handling, or evaluation harness code.
23+
24+
If a PR is in this category (or you are uncertain), leave a **COMMENT** review and
25+
explicitly flag it for a human maintainer to decide after running lightweight evals.
26+
27+
### Default approval policy
28+
29+
**Default to APPROVE**: If your review finds no issues at "important" level or higher,
30+
approve the PR. Minor suggestions or nitpicks alone are not sufficient reason to
31+
withhold approval.
32+
33+
**IMPORTANT:** If you determine a PR is worth merging **and it is not in the eval-risk
34+
category above**, you should approve it. Don’t just say a PR is "worth merging" or
35+
"ready to merge" without actually submitting an approval. Your words and actions should
36+
be consistent.
1937

2038
### When to APPROVE
2139

@@ -29,6 +47,14 @@ Examples of straightforward and low-risk PRs you should approve (non-exhaustive)
2947
- **Test-only changes**: Adding or updating tests without changing production code
3048
- **Dependency updates**: Version bumps with passing CI
3149

50+
### When NOT to APPROVE - Blocking Issues
51+
52+
**DO NOT APPROVE** PRs that have any of the following issues:
53+
54+
- **Package version bumps in non-release PRs**: If any `pyproject.toml` file has changes to the `version` field (e.g., `version = "1.12.0"``version = "1.13.0"`), and the PR is NOT explicitly a release PR (title/description doesn't indicate it's a release), **DO NOT APPROVE**. Version numbers should only be changed in dedicated release PRs managed by maintainers.
55+
- Check: Look for changes to `version = "..."` in any `*/pyproject.toml` files
56+
- Exception: PRs with titles like "release: v1.x.x" or "chore: bump version to 1.x.x" from maintainers
57+
3258
Examples:
3359
- A PR adding a new model to `resolve_model_config.py` or `verified_models.py` with corresponding test updates
3460
- A PR adding documentation notes to docstrings clarifying method behavior (e.g., security considerations, bypass behaviors)

.github/actions/pr-review/action.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,10 +67,14 @@ runs:
6767
with:
6868
python-version: '3.12'
6969

70+
# Security: this workflow executes untrusted PR content (diff/title/body) via an
71+
# LLM-powered reviewer agent that can run Bash. GitHub Actions caches are shared
72+
# across workflows within a repository and can enable cache-poisoning pivots into
73+
# more-privileged workflows. Keep caching disabled here.
7074
- name: Install uv
7175
uses: astral-sh/setup-uv@v6
7276
with:
73-
enable-cache: true
77+
enable-cache: false
7478

7579
- name: Install GitHub CLI
7680
shell: bash

.github/run-eval/resolve_model_config.py

Lines changed: 160 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
#!/usr/bin/env python3
22
"""
3-
Resolve model IDs to full model configurations.
3+
Resolve model IDs to full model configurations and verify model availability.
44
55
Reads:
66
- MODEL_IDS: comma-separated model IDs
7+
- LLM_API_KEY: API key for litellm_proxy (optional, for preflight check)
8+
- LLM_BASE_URL: Base URL for litellm_proxy (optional, defaults to eval proxy)
9+
- SKIP_PREFLIGHT: Set to 'true' to skip the preflight LLM check
710
811
Outputs to GITHUB_OUTPUT:
912
- models_json: JSON array of full model configs with display names
@@ -12,6 +15,13 @@
1215
import json
1316
import os
1417
import sys
18+
from typing import Any
19+
20+
21+
# SDK-specific parameters that should not be passed to litellm.
22+
# These parameters are used by the SDK's LLM wrapper but are not part of litellm's API.
23+
# Keep this list in sync with SDK LLM config parameters that are SDK-internal.
24+
SDK_ONLY_PARAMS = {"disable_vision"}
1525

1626

1727
# Model configurations dictionary
@@ -48,6 +58,14 @@
4858
"litellm_extra_body": {"enable_thinking": True},
4959
},
5060
},
61+
"qwen3.5-flash": {
62+
"id": "qwen3.5-flash",
63+
"display_name": "Qwen3.5 Flash",
64+
"llm_config": {
65+
"model": "litellm_proxy/dashscope/qwen3.5-flash-2026-02-23",
66+
"temperature": 0.0,
67+
},
68+
},
5169
"claude-4.5-opus": {
5270
"id": "claude-4.5-opus",
5371
"display_name": "Claude 4.5 Opus",
@@ -64,6 +82,14 @@
6482
"temperature": 0.0,
6583
},
6684
},
85+
"claude-sonnet-4-6": {
86+
"id": "claude-sonnet-4-6",
87+
"display_name": "Claude Sonnet 4.6",
88+
"llm_config": {
89+
"model": "litellm_proxy/anthropic/claude-sonnet-4-6",
90+
"temperature": 0.0,
91+
},
92+
},
6793
"gemini-3-pro": {
6894
"id": "gemini-3-pro",
6995
"display_name": "Gemini 3 Pro",
@@ -74,6 +100,11 @@
74100
"display_name": "Gemini 3 Flash",
75101
"llm_config": {"model": "litellm_proxy/gemini-3-flash-preview"},
76102
},
103+
"gemini-3.1-pro": {
104+
"id": "gemini-3.1-pro",
105+
"display_name": "Gemini 3.1 Pro",
106+
"llm_config": {"model": "litellm_proxy/gemini-3.1-pro-preview"},
107+
},
77108
"gpt-5.2": {
78109
"id": "gpt-5.2",
79110
"display_name": "GPT-5.2",
@@ -100,22 +131,17 @@
100131
"minimax-m2.5": {
101132
"id": "minimax-m2.5",
102133
"display_name": "MiniMax M2.5",
103-
"llm_config": {"model": "litellm_proxy/minimax/MiniMax-M2.5"},
134+
"llm_config": {
135+
"model": "litellm_proxy/minimax/MiniMax-M2.5",
136+
"temperature": 1.0,
137+
"top_p": 0.95,
138+
},
104139
},
105140
"minimax-m2.1": {
106141
"id": "minimax-m2.1",
107142
"display_name": "MiniMax M2.1",
108143
"llm_config": {"model": "litellm_proxy/minimax/MiniMax-M2.1"},
109144
},
110-
"jade-spark-2862": {
111-
"id": "jade-spark-2862",
112-
"display_name": "Jade Spark 2862",
113-
"llm_config": {
114-
"model": "litellm_proxy/jade-spark-2862",
115-
"temperature": 1.0,
116-
"top_p": 0.95,
117-
},
118-
},
119145
"deepseek-v3.2-reasoner": {
120146
"id": "deepseek-v3.2-reasoner",
121147
"display_name": "DeepSeek V3.2 Reasoner",
@@ -150,6 +176,8 @@
150176
"display_name": "GLM-5",
151177
"llm_config": {
152178
"model": "litellm_proxy/openrouter/z-ai/glm-5",
179+
# OpenRouter glm-5 is text-only despite LiteLLM reporting vision support
180+
"disable_vision": True,
153181
},
154182
},
155183
"qwen3-coder-next": {
@@ -207,6 +235,122 @@ def find_models_by_id(model_ids: list[str]) -> list[dict]:
207235
return resolved
208236

209237

238+
def check_model(
239+
model_config: dict[str, Any],
240+
api_key: str,
241+
base_url: str,
242+
timeout: int = 60,
243+
) -> tuple[bool, str]:
244+
"""Check a single model with a simple completion request using litellm.
245+
246+
Args:
247+
model_config: Model configuration dict with 'llm_config' key
248+
api_key: API key for authentication
249+
base_url: Base URL for the LLM proxy
250+
timeout: Request timeout in seconds
251+
252+
Returns:
253+
Tuple of (success: bool, message: str)
254+
"""
255+
import litellm
256+
257+
llm_config = model_config.get("llm_config", {})
258+
model_name = llm_config.get("model", "unknown")
259+
display_name = model_config.get("display_name", model_name)
260+
261+
try:
262+
# Build kwargs from llm_config, excluding 'model' and SDK-specific params
263+
kwargs = {
264+
k: v
265+
for k, v in llm_config.items()
266+
if k != "model" and k not in SDK_ONLY_PARAMS
267+
}
268+
269+
# Use simple arithmetic prompt that works reliably across all models
270+
# max_tokens=100 provides enough room for models to respond
271+
# (some need >10 tokens)
272+
response = litellm.completion(
273+
model=model_name,
274+
messages=[{"role": "user", "content": "1+1="}],
275+
max_tokens=100,
276+
api_key=api_key,
277+
base_url=base_url,
278+
timeout=timeout,
279+
**kwargs,
280+
)
281+
282+
content = response.choices[0].message.content if response.choices else None
283+
284+
if content:
285+
return True, f"✓ {display_name}: OK"
286+
else:
287+
# Check if there's any other data in the response for diagnostics
288+
finish_reason = (
289+
response.choices[0].finish_reason if response.choices else None
290+
)
291+
usage = getattr(response, "usage", None)
292+
return (
293+
False,
294+
(
295+
f"✗ {display_name}: Empty response "
296+
f"(finish_reason={finish_reason}, usage={usage})"
297+
),
298+
)
299+
300+
except litellm.exceptions.Timeout:
301+
return False, f"✗ {display_name}: Request timed out after {timeout}s"
302+
except litellm.exceptions.APIConnectionError as e:
303+
return False, f"✗ {display_name}: Connection error - {e}"
304+
except litellm.exceptions.BadRequestError as e:
305+
return False, f"✗ {display_name}: Bad request - {e}"
306+
except litellm.exceptions.NotFoundError as e:
307+
return False, f"✗ {display_name}: Model not found - {e}"
308+
except Exception as e:
309+
return False, f"✗ {display_name}: {type(e).__name__} - {e}"
310+
311+
312+
def run_preflight_check(models: list[dict[str, Any]]) -> bool:
313+
"""Run preflight LLM check for all models.
314+
315+
Args:
316+
models: List of model configurations to test
317+
318+
Returns:
319+
True if all models passed, False otherwise
320+
"""
321+
api_key = os.environ.get("LLM_API_KEY")
322+
base_url = os.environ.get("LLM_BASE_URL", "https://llm-proxy.eval.all-hands.dev")
323+
skip_preflight = os.environ.get("SKIP_PREFLIGHT", "").lower() == "true"
324+
325+
if skip_preflight:
326+
print("Preflight check: SKIPPED (SKIP_PREFLIGHT=true)")
327+
return True
328+
329+
if not api_key:
330+
print("Preflight check: SKIPPED (LLM_API_KEY not set)")
331+
return True
332+
333+
print(f"\nPreflight LLM check for {len(models)} model(s)...")
334+
print("-" * 50)
335+
336+
all_passed = True
337+
for model_config in models:
338+
success, message = check_model(model_config, api_key, base_url)
339+
print(message)
340+
if not success:
341+
all_passed = False
342+
343+
print("-" * 50)
344+
345+
if all_passed:
346+
print(f"✓ All {len(models)} model(s) passed preflight check\n")
347+
else:
348+
print("✗ Some models failed preflight check")
349+
print("Evaluation aborted to avoid wasting compute resources.\n")
350+
351+
return all_passed
352+
353+
210354
def main() -> None:
211355
model_ids_str = get_required_env("MODEL_IDS")
212356
github_output = get_required_env("GITHUB_OUTPUT")
@@ -216,14 +360,17 @@ def main() -> None:
216360

217361
# Resolve model configs
218362
resolved = find_models_by_id(model_ids)
363+
print(f"Resolved {len(resolved)} model(s): {', '.join(model_ids)}")
364+
365+
# Run preflight check
366+
if not run_preflight_check(resolved):
367+
error_exit("Preflight LLM check failed")
219368

220369
# Output as JSON
221370
models_json = json.dumps(resolved, separators=(",", ":"))
222371
with open(github_output, "a", encoding="utf-8") as f:
223372
f.write(f"models_json={models_json}\n")
224373

225-
print(f"Resolved {len(resolved)} model(s): {', '.join(model_ids)}")
226-
227374

228375
if __name__ == "__main__":
229376
main()

0 commit comments

Comments
 (0)