Skip to content

fix(security): first-message WebSocket auth to prevent token leakage#2790

Open
simonrosenberg wants to merge 3 commits intomainfrom
fix/websocket-first-message-auth
Open

fix(security): first-message WebSocket auth to prevent token leakage#2790
simonrosenberg wants to merge 3 commits intomainfrom
fix/websocket-first-message-auth

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

@simonrosenberg simonrosenberg commented Apr 9, 2026

Summary

  • _accept_authenticated_websocket now supports first-message auth: when no key is found in query params or headers, the server accepts the WebSocket, reads the first frame, and validates {"type": "auth", "session_api_key": "..."} before processing any other messages. 10-second timeout for the auth frame.
  • Query parameter and header-based auth are kept as deprecated fallbacks so existing clients continue to work (with a deprecation warning logged).
  • _safe_close_websocket now accepts optional code/reason parameters.
  • 14 new unit tests covering: valid/invalid first-message auth, malformed JSON, client disconnect, timeout, wrong payload type, missing key, and end-to-end through events_socket.

Addresses the server-side component of OpenHands/OpenHands#13777 — moves the first-message auth from the V0 listen_socket.py to the SDK as requested in tofarr's review.

Test plan

🤖 Generated with Claude Code


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:e7987de-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-e7987de-python \
  ghcr.io/openhands/agent-server:e7987de-python

All tags pushed for this build

ghcr.io/openhands/agent-server:e7987de-golang-amd64
ghcr.io/openhands/agent-server:e7987de-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:e7987de-golang-arm64
ghcr.io/openhands/agent-server:e7987de-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:e7987de-java-amd64
ghcr.io/openhands/agent-server:e7987de-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:e7987de-java-arm64
ghcr.io/openhands/agent-server:e7987de-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:e7987de-python-amd64
ghcr.io/openhands/agent-server:e7987de-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:e7987de-python-arm64
ghcr.io/openhands/agent-server:e7987de-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:e7987de-golang
ghcr.io/openhands/agent-server:e7987de-java
ghcr.io/openhands/agent-server:e7987de-python

About Multi-Architecture Support

  • Each variant tag (e.g., e7987de-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., e7987de-python-amd64) are also available if needed

Clients can now send {"type": "auth", "session_api_key": "..."} as the
first WebSocket frame instead of passing the token as a query parameter.
This keeps sk-oh-* tokens out of reverse-proxy / load-balancer access
logs (Traefik, Datadog, etc.).

Query param and header auth are preserved as deprecated fallbacks for
backwards compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@simonrosenberg simonrosenberg requested a review from tofarr April 9, 2026 22:02
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-agent-server/openhands/agent_server
   sockets.py1883382%163, 166, 169, 246–248, 369–370, 375–377, 383–385, 389, 391–392, 396–397, 400, 403, 405–406, 409–410, 430, 432, 446–450, 460
TOTAL229011040454% 

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves a real security problem (tokens in URLs leak to logs) with proper backward compatibility. The three-path auth flow is complex but necessary given WebSocket protocol constraints. A few improvements would help with operational visibility and clarity.

ws.send_json = AsyncMock()
ws.close = AsyncMock()
ws.headers = headers or {}
return ws
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Test gap - No test for the precedence case where BOTH query param AND first-message auth are provided. Based on the code in sockets.py, query param takes precedence (it's checked first at line 113). Add a test verifying this behavior:

@pytest.mark.asyncio
async def test_legacy_query_param_takes_precedence_over_first_message():
    """When both query param and first-message auth are provided, query param wins."""
    ws = _make_mock_websocket()
    ws.receive_text.return_value = json.dumps(
        {"type": "auth", "session_api_key": "sk-oh-different"}
    )
    with patch("openhands.agent_server.sockets.get_default_config") as mock_config:
        mock_config.return_value.session_api_keys = ["sk-oh-valid"]
        result = await _accept_authenticated_websocket(ws, session_api_key="sk-oh-valid")

    assert result is True
    ws.accept.assert_called_once()
    # Should NOT read first message because query param already authenticated
    ws.receive_text.assert_not_called()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Report: WebSocket First-Message Authentication

Verdict: ⚠️ PASS WITH ISSUES

The WebSocket first-message authentication feature works correctly and provides the intended security improvement. However, there is one minor formatting issue that needs to be fixed before merge.


Environment Setup

Dependencies installed: uv sync --frozen completed successfully
Python environment: Ready (Python 3.13.13)


CI & Test Status

CI Checks

  • Run tests/agent-server-tests: Passed (48s)
  • Run tests/sdk-tests: Passed (9s)
  • Run tests/tools-tests: Passed (9s)
  • Python API breakage checks: Passed
  • REST API breakage checks: Passed
  • Pre-commit checks: Failed due to formatting issues (see Issues Found below)

Local Test Results

$ uv run pytest tests/agent_server/test_websocket_first_message_auth.py -v
========== 14 passed in 0.27s ==========

$ uv run pytest tests/agent_server/ -k "websocket" -v
========== 33 passed, 679 deselected in 8.49s ==========

Test Coverage Analysis:

  • ✅ All 14 new first-message auth tests pass
  • ✅ All 19 existing WebSocket tests pass (no regressions)
  • ✅ Comprehensive edge case coverage:
    • No auth configured
    • Legacy query param auth (valid/invalid)
    • Legacy header auth (valid/invalid)
    • First-message auth (valid/invalid/malformed/timeout/disconnect)
    • End-to-end integration tests

Functional Verification

Code Review

I performed a thorough code review of the implementation:

Security Analysis:

  • ✅ No authentication bypass vulnerabilities
  • ✅ Proper timeout handling (10 seconds) prevents DoS
  • ✅ Generic error messages ("Authentication failed") prevent information leakage
  • ✅ Race conditions prevented by synchronous auth before event processing
  • ✅ First-message frame properly consumed by auth flow

Logic Verification:

# Auth flow order (as implemented):
1. If no session_api_keys configuredaccept unconditionally2. If legacy key provided (query/header):
   - Validlog deprecation warning, accept- Invalidclose with 4001 (before accept) ✓
3. If no legacy keyfirst-message auth:
   - Accept connection first (required by WebSocket protocol) ✓
   - Read first frame with 10s timeout- Validate: must be dict with type="auth" and valid key- Invalidclose with 4001

Backward Compatibility:

  • ✅ Legacy query param auth still works (with deprecation warning logged)
  • ✅ Legacy header auth still works (with deprecation warning logged)
  • ✅ Existing WebSocket tests all pass (no breaking changes)

Test Coverage Completeness:

Test scenarios covered (14 tests):
1. ✓ No auth configured - immediate acceptance
2. ✓ Legacy query param - valid key
3. ✓ Legacy query param - invalid key
4. ✓ Legacy header - valid key
5. ✓ Legacy header - invalid key  
6. ✓ First-message auth - valid key
7. ✓ First-message auth - invalid key
8. ✓ First-message auth - wrong type field
9. ✓ First-message auth - missing key field
10. ✓ First-message auth - malformed JSON
11. ✓ First-message auth - client disconnect
12. ✓ First-message auth - timeout (10s)
13. ✓ End-to-end events_socket - valid auth
14. ✓ End-to-end events_socket - rejected auth

Issues Found

🟡 Minor: Line Length Violation

File: tests/agent_server/test_websocket_first_message_auth.py:205
Issue: Comment line exceeds 88 character limit (89 characters)
Impact: Causes pre-commit check to fail; does not affect functionality

Current line (89 chars):

# First call: auth message (read by _accept_authenticated_websocket via receive_text)

Suggested fix:

# First call: auth message (read by _accept_authenticated_websocket)
# via receive_text, then receive_json calls disconnect

Or simply shorten to:

# First call: auth message read by _accept_authenticated_websocket

(The inline comment on the next line already explains "Then receive_json calls: disconnect")


Summary

Security Enhancement: Successfully prevents API keys from appearing in URLs/logs
Backward Compatible: Legacy auth methods work with deprecation warnings
Well Tested: 14 new tests + 19 existing tests all pass
Clean Implementation: Proper error handling, timeouts, and validation
🟡 Minor Fix Needed: One line length violation in test file

Recommendation: Fix the line length issue and merge. The feature is functionally complete and ready for production use.

- Add comment explaining 10-second timeout rationale
- Add comment explaining why accept() must precede first-message read
- Log specific failure reasons (timeout, malformed JSON, disconnect,
  wrong type, invalid key) instead of a single generic message
- Log info on successful first-message auth for adoption tracking
- Add test for query-param-takes-precedence-over-first-message case

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants