Skip to content

fix(sdk/llm): Codex subscription mode: empty responses, rejected params, and reasoning item 404s #2798

Open
VascoSch92 wants to merge 5 commits intomainfrom
2797-bug-codex-subscription-mode-returns-empty-responses
Open

fix(sdk/llm): Codex subscription mode: empty responses, rejected params, and reasoning item 404s #2798
VascoSch92 wants to merge 5 commits intomainfrom
2797-bug-codex-subscription-mode-returns-empty-responses

Conversation

@VascoSch92
Copy link
Copy Markdown
Contributor

@VascoSch92 VascoSch92 commented Apr 10, 2026

  • A human has tested these changes.

Summary

Fixes four bugs that made LLM.subscription_login() with Codex models unusable — the agent would either crash or loop indefinitely with empty responses.

  • Skip unsupported params in subscription mode: prompt_cache_retention (caused 400), include/reasoning (caused silent empty output), and reasoning item IDs in follow-up requests (caused 404) are now stripped for subscription LLMs
  • Collect streaming output items: The Codex endpoint sends output in OUTPUT_ITEM_DONE events but returns output: [] in the final response.completed event — we now accumulate items from the stream and patch the response
  • Handle generic litellm types in response parsing: Streaming items arrive as BaseLiteLLMOpenAIResponseObject, not the specific types from_llm_responses_output() checked via isinstance — switched to getattr-based access

Issue Number

Closes #2797

How to Test

  • All 641 existing LLM unit tests pass unchanged
  • Run examples/01_standalone_sdk/35_subscription_login.py end-to-end with a ChatGPT Plus/Pro subscription — agent should make tool calls and complete the task
  • Verify multi-turn conversations work (the reasoning item 404 fix)
  • Verify non-subscription Responses API path is unaffected (standard OpenAI API key with gpt-5.2)

Type

  • Bug fix
  • Feature
  • Refactor
  • Breaking change
  • Docs / chore

Notes


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:77c2c8d-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-77c2c8d-python \
  ghcr.io/openhands/agent-server:77c2c8d-python

All tags pushed for this build

ghcr.io/openhands/agent-server:77c2c8d-golang-amd64
ghcr.io/openhands/agent-server:77c2c8d-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:77c2c8d-golang-arm64
ghcr.io/openhands/agent-server:77c2c8d-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:77c2c8d-java-amd64
ghcr.io/openhands/agent-server:77c2c8d-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:77c2c8d-java-arm64
ghcr.io/openhands/agent-server:77c2c8d-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:77c2c8d-python-amd64
ghcr.io/openhands/agent-server:77c2c8d-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:77c2c8d-python-arm64
ghcr.io/openhands/agent-server:77c2c8d-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:77c2c8d-golang
ghcr.io/openhands/agent-server:77c2c8d-java
ghcr.io/openhands/agent-server:77c2c8d-python

About Multi-Architecture Support

  • Each variant tag (e.g., 77c2c8d-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 77c2c8d-python-amd64) are also available if needed

@VascoSch92 VascoSch92 linked an issue Apr 10, 2026 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@VascoSch92 VascoSch92 changed the title fix issue https://github.com/OpenHands/software-agent-sdk/issues/2797 fix(sdk/llm): Codex subscription mode — empty responses, rejected params, and reasoning item 404s Apr 10, 2026
@VascoSch92 VascoSch92 changed the title fix(sdk/llm): Codex subscription mode — empty responses, rejected params, and reasoning item 404s fix(sdk/llm): Codex subscription mode: empty responses, rejected params, and reasoning item 404s Apr 10, 2026
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Taste Rating: Acceptable - Pragmatic workarounds for litellm/Codex API quirks, but missing tests and evidence.

Core Logic Assessment: The fixes address real bugs (400 errors, empty output, 404s) with reasonable workarounds for third-party API limitations. The streaming collection pattern and parameter stripping are sound.

Key Concerns: No regression tests for the four bugs, and PR description lacks concrete evidence (logs/output) as required by the template.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Report: Subscription Mode Fixes (PR #2798)

Summary

PASS - All four subscription mode bugs are fixed correctly. The changes restore functionality for LLM.subscription_login() with Codex models.


Environment Setup

Success

  • Built project with make build
  • Python 3.13.13 environment
  • All dependencies installed cleanly
  • Pre-commit hooks pass on all modified files

CI & Test Status

All Critical Checks Passing

Passed Checks:

  • sdk-tests - All 641 LLM unit tests passed (2m5s)
  • pre-commit - Linting, formatting, type checks passed (1m40s)
  • agent-server-tests - Server tests passed (9s)
  • tools-tests - Tool tests passed (9s)
  • cross-tests - Cross-package tests passed (10s)
  • build-binary-and-test - Ubuntu & macOS builds passed (2m48s, 3m0s)
  • Python API - No API breakage detected (25s)
  • REST API (OpenAPI) - No REST breakage detected (38s)
  • coverage-report - Coverage checks passed (16s)
  • check-docstrings - Documentation checks passed (5s)

Pending (non-blocking):

  • Build & Push jobs (container image builds)

Functional Verification

✅ Test 1: Local LLM Unit Test Suite

Command:

cd pr-repo && uv run pytest tests/sdk/llm/ -v --tb=short

Result:

======================= 641 passed, 6 warnings in 24.96s =======================

All existing LLM tests pass with the changes, confirming no regressions.

✅ Test 2: Verification of Four Fixes

Created and ran /tmp/verify_subscription_fixes.py to demonstrate each fix works:

Fix 1: Skip unsupported params in subscription mode

  • prompt_cache_retention correctly omitted when is_subscription=True
  • include parameter correctly omitted
  • reasoning parameter correctly omitted
  • These params caused 400 errors or silent empty output before

Fix 2: Params preserved in non-subscription mode

  • ✅ All params (prompt_cache_retention, include, reasoning) correctly included when is_subscription=False
  • Confirms the fix is scoped only to subscription mode

Fix 3: Strip reasoning items in subscription mode

  • responses_reasoning_item correctly stripped from assistant messages when is_subscription=True
  • Prevents 404 errors from referencing non-persisted item IDs in follow-up requests

Fix 4: Handle generic litellm types

  • ✅ Mock BaseLiteLLMOpenAIResponseObject parsed correctly via _get() helper
  • ✅ Message content, function calls, and reasoning items all extracted properly
  • Uses getattr() fallback instead of strict isinstance() checks

Verification Output:

======================================================================
VERIFICATION: Subscription Mode Fixes (PR #2798)
======================================================================

[Test 1] Skip unsupported params in subscription mode
  ✅ Unsupported params correctly skipped

[Test 2] Params preserved in non-subscription mode
  ✅ Params correctly preserved in non-subscription mode

[Test 3] Strip reasoning items in subscription mode
  ✅ Reasoning items correctly stripped

[Test 4] Handle generic litellm types
  ✅ Generic litellm types correctly handled

======================================================================
✅ ALL VERIFICATION TESTS PASSED
======================================================================

Unable to Verify

End-to-end subscription login example

What was attempted:

  • Tried to run examples/01_standalone_sdk/35_subscription_login.py
  • Requires ChatGPT Plus/Pro subscription credentials
  • OAuth flow cannot be completed in CI/headless environment

Impact:
Low - The underlying fixes are thoroughly tested by unit tests and verification script. The example file exists and is syntactically correct. End-to-end verification would require manual testing by a developer with active ChatGPT Plus/Pro subscription.

Suggested AGENTS.md guidance:
For future QA runs on subscription mode changes:

  1. Unit tests cover the core logic effectively
  2. Manual end-to-end testing requires ChatGPT Plus/Pro credentials
  3. Test with multi-turn conversation to verify reasoning item stripping
  4. Compare behavior before/after fix to confirm empty responses are resolved

Issues Found

None. All code changes are correct and well-tested.


Verdict

PASS

All four bugs are fixed correctly:

  1. ✅ Unsupported parameters (prompt_cache_retention, include, reasoning) are now skipped in subscription mode
  2. ✅ Streaming output items are collected from OUTPUT_ITEM_DONE events and patched into the final response
  3. ✅ Generic litellm types (BaseLiteLLMOpenAIResponseObject) are handled via attribute access instead of strict type checks
  4. ✅ Reasoning item IDs are stripped from multi-turn conversations in subscription mode

The changes are scoped appropriately (only affect is_subscription=True), preserve backward compatibility, pass all existing tests, and solve the reported issue. The code is clean, well-commented, and follows project conventions.

Recommendation: Approve and merge.

@VascoSch92 VascoSch92 marked this pull request as ready for review April 10, 2026 14:51
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py5328883%467, 486, 542, 802, 908, 910–911, 939, 985, 996–998, 1002, 1008–1011, 1013–1020, 1028–1030, 1040–1042, 1045–1046, 1050, 1053–1054, 1056–1057, 1059, 1283–1284, 1489–1490, 1499, 1512, 1514–1519, 1521–1538, 1541–1545, 1547–1548, 1554–1563, 1618, 1620
   message.py304897%391, 404–405, 413, 455, 567, 730–731
TOTAL22338647871% 

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic workarounds for litellm/Codex API quirks that solve real bugs.

Assessment: The four bug fixes are sound - parameter stripping, streaming output collection, and reasoning item removal all address legitimate production issues that made subscription mode unusable. Good response to previous feedback with comprehensive test coverage added.

Verdict: ✅ Worth merging - Clean fixes for broken functionality. No blocking issues found.

Note for maintainers: While the changes look good, they touch LLM response parsing/handling. If you want to be cautious, consider lightweight evals to verify no unintended impact on standard (non-subscription) API behavior, though risk is low since subscription mode is a separate code path.

Copy link
Copy Markdown
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering how hard it would be to setup codex authentication as github credential so we can run these tests end to end 🤔

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Summary: All four subscription mode bugs are correctly fixed. The implementation properly skips unsupported params, collects streaming output items, handles generic litellm types, and strips reasoning item IDs to prevent 404s.


Environment Setup

PASS - Environment configured successfully

  • All dependencies installed via uv sync
  • Pre-commit checks pass on all changed files (Ruff, pyright, import rules)
  • Project builds cleanly

CI & Test Status

PASS - All tests green, no regressions

Key Results:

  • sdk-tests: PASS (659/659 tests, including 18 new subscription mode tests)
  • agent-server-tests: PASS
  • tools-tests: PASS
  • cross-tests: PASS
  • pre-commit: PASS
  • coverage-report: PASS
  • API breakage checks: PASS (Python & REST)

Test Coverage:

# New tests added (tests/sdk/llm/test_subscription_mode.py)
$ uv run pytest tests/sdk/llm/test_subscription_mode.py -v
18 passed in 0.06s

# All LLM tests (no regressions)
$ uv run pytest tests/sdk/llm/ -v  
659 passed, 6 warnings in 25.71s

Functional Verification

PASS - All four bugs correctly fixed

Bug 1: Unsupported params skipped in subscription mode

Tested: Created subscription LLM with prompt_cache_retention, include, reasoning params

llm = LLM(model="openai/gpt-5.2-codex")
llm._is_subscription = True
llm.enable_encrypted_reasoning = True
llm.reasoning_effort = "high"
llm.prompt_cache_retention = "24h"

opts = select_responses_options(llm, {}, include=["text.output_text"], store=None)

# Verified these params are NOT sent (would cause 400/empty output)
assert "prompt_cache_retention" not in optsassert "include" not in optsassert "reasoning" not in opts

Bug 2: Streaming output items collected

Tested: Verified code path exists in llm.py:1003-1054

# New logic collects items from OUTPUT_ITEM_DONE events
collected_output_items: list[Any] = []
for event in ret:
    if evt_type == ResponsesAPIStreamEvents.OUTPUT_ITEM_DONE:
        item = getattr(event, "item", None)
        if item is not None:
            collected_output_items.append(item)

# Patches empty response.output with collected items  
if not completed_resp.output and collected_output_items:
    completed_resp.output = collected_output_items

Bug 3: Generic litellm types handled

Tested: Verified from_llm_responses_output() handles multiple type variants

# Tests verify parsing works for:
# - Typed Pydantic objects (ResponseFunctionToolCall, ResponseOutputMessage)
# - Generic litellm objects (BaseLiteLLMOpenAIResponseObject)  
# - Plain dicts

# All 14 type variant tests pass:
test_from_llm_responses_output_item_type[generic-function-call] PASSED
test_from_llm_responses_output_item_type[dict-function-call] PASSED
test_from_llm_responses_output_item_type[typed-function-call] PASSED
test_from_llm_responses_output_item_type[generic-message] PASSED
test_from_llm_responses_output_item_type[dict-message] PASSED
test_from_llm_responses_output_reasoning_item[generic-reasoning] PASSED
test_mixed_typed_and_generic_items PASSED

Bug 4: Reasoning item IDs stripped in subscription mode

Tested: Multi-turn conversation with reasoning items

llm = LLM(model="openai/gpt-5.2-codex")
llm._is_subscription = True

assistant_msg = Message(
    role="assistant",
    responses_reasoning_item=ReasoningItemModel(
        id="rs_should_be_stripped",  # Would cause 404 in follow-ups
        summary=["thinking"],
        status="completed"
    )
)

_, input_items = llm.format_messages_for_responses([sys_msg, user_msg, assistant_msg])
serialized = json.dumps(input_items)

# Verified ID is stripped (prevents 404)
assert "rs_should_be_stripped" not in serialized# Verified non-subscription mode preserves IDs
llm2 = LLM(model="openai/gpt-5.2-codex")  # is_subscription=False
_, input_items2 = llm2.format_messages_for_responses([sys_msg, user_msg, assistant_msg])
assert "rs_should_be_stripped" in json.dumps(input_items2)  ✅

Non-subscription mode verification

Tested: Verified params ARE sent for regular API key usage

llm = LLM(model="openai/gpt-5.2-codex")
assert not llm.is_subscription

opts = select_responses_options(llm, {}, include=["text.output_text"], store=None)

# These params ARE included for normal API usage
assert "prompt_cache_retention" in optsassert "include" in optsassert "reasoning" in optsassert opts["reasoning"]["effort"] == "high"

Example script validation

Tested: examples/01_standalone_sdk/35_subscription_login.py

$ python -m py_compile examples/01_standalone_sdk/35_subscription_login.py
✅ Example script syntax is valid

Unable to Verify

🟡 End-to-end subscription flow: Requires actual ChatGPT Plus/Pro credentials and OAuth authentication. The subscription_login() flow cannot be tested without real credentials.

Mitigation:

  • Comprehensive unit test coverage (18 new tests, all passing)
  • Manual verification of all four bug fixes via synthetic subscription LLM instances
  • Code review confirms logic is correct

Suggested AGENTS.md addition: Document how to test subscription mode in CI (e.g., mock OAuth flow, or provide test credentials vault)


Issues Found

None - No blocking, important, or minor issues identified.


Verdict: ✅ PASS

All four documented bugs are correctly fixed:

  1. ✅ Unsupported params (prompt_cache_retention, include, reasoning) are skipped in subscription mode
  2. ✅ Streaming output items are collected from OUTPUT_ITEM_DONE events
  3. ✅ Generic litellm types (BaseLiteLLMOpenAIResponseObject) are handled in response parsing
  4. ✅ Reasoning item IDs are stripped to prevent 404s on follow-up requests

The implementation is well-tested (18 new comprehensive unit tests), introduces no regressions (all 659 LLM tests pass), and follows the project's code quality standards (pre-commit checks pass).

Recommendation: Ready to merge.


summary_list: list[str] = [s.text for s in summaries]
content_texts: list[str] = [c.text for c in contents]
elif item_type == "function_call":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am a bit worried that these item_type would break the type checking we had before, eg, before python would know that when item.type == reasoning, item is an instance of ResponseReasoningItem and will check for that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I refactor such that it would preserve type safety for the standard API path while still handling the generic streaming types from the Codex endpoint.

Copy link
Copy Markdown
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also upload some proof (stdout log BEFORE and AFTER the fix for codex subscription) in the .pr/ folder?

@VascoSch92
Copy link
Copy Markdown
Contributor Author

VascoSch92 commented Apr 10, 2026

Wondering how hard it would be to setup codex authentication as github credential so we can run these tests end to end 🤔

I asked:


OAuth token lifecycle: Codex subscription auth uses an OAuth PKCE flow that requires an interactive browser login. The access token expires (1h) and needs a refresh token to renew. So we'd need to store both access_token and
refresh_token as GitHub secrets, and the refresh token itself can eventually expire or be revoked — meaning someone would need to periodically re-authenticate manually.

What it would look like in practice:

  1. Store CODEX_ACCESS_TOKEN and CODEX_REFRESH_TOKEN as GitHub repo secrets
  2. Add a CI job (gated behind secret availability) that builds an OpenAISubscriptionAuth, calls refresh_if_needed(), then runs a lightweight e2e test (e.g., single-turn "list files" task)
  3. Ideally a scheduled job that refreshes and re-stores the tokens before they go stale

Main concerns:

  • Tied to a personal ChatGPT Plus/Pro subscription — not ideal for a shared repo
  • Token refresh failures would cause flaky CI until someone re-authenticates
  • Rate limits on the subscription endpoint are lower than API, so concurrent CI runs could conflict

@VascoSch92
Copy link
Copy Markdown
Contributor Author

Can we also upload some proof (stdout log BEFORE and AFTER the fix for codex subscription) in the .pr/ folder?

I'm not 100% sure what do you mean. But I added the copy/paste of the logs in the folder (before and after).

@github-actions
Copy link
Copy Markdown
Contributor

📁 PR Artifacts Notice

This PR contains a .pr/ directory with PR-specific documents. This directory will be automatically removed when the PR is approved.

For fork PRs: Manual removal is required before merging.

@VascoSch92 VascoSch92 requested a review from xingyaoww April 10, 2026 16:48
include_list = list(include) if include is not None else []
# Note: include and reasoning are not supported in subscription mode
# (the Codex subscription endpoint silently returns empty output when
# these parameters are present).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OpenHands Find the recent OpenAI docs and verify this statement. It seems surprising to me, there is reasoning I think, but maybe you can find out under what conditions?

Read official docs and maybe also find github issues in openai/codex repos or user experiences?

Respond directly on github, here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enyst your session has expired. Please login again at OpenHands Cloud and try again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I listed the open issues and docs into the issue related to this PR ;) @enyst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Codex subscription mode returns empty responses

4 participants