Skip to content

fix(shell): improve reliability — keepalive, error classification, agent-timeout detection#634

Merged
Guimove merged 1 commit into
mainfrom
improve/shell-ux
May 5, 2026
Merged

fix(shell): improve reliability — keepalive, error classification, agent-timeout detection#634
Guimove merged 1 commit into
mainfrom
improve/shell-ux

Conversation

@Guimove
Copy link
Copy Markdown
Contributor

@Guimove Guimove commented May 4, 2026

Context

Customer bug (Jonathan Petitcolas): `qovery shell` WebSocket connections drop with "Deadline of 30s exceeded for receiving agent response" because the K8s namespace lookup + pod list + exec can take longer than the hardcoded 30s gateway timeout.

Companion MR on rust-backend: qovery/backend/rust-backend!593

Changes

Keepalive

  • Replace unbounded read deadline with ping/pong keepalive (`PingInterval=30s`, `ReadTimeout=75s`).
    Idle-but-healthy connections now survive indefinitely; only truly dead connections are detected and closed.

WebSocket error classification (`wserror.go`)

  • Permanent errors (1007/1008): cancel the reconnect loop immediately — permission denied or auth rejected, retrying won't help.
  • Transient errors (1011): reconnect with delay.
  • Agent timeout subset (`IsAgentResponseTimeout`): 1011 errors whose text contains one of:
    • `"exceeded for receiving agent response"` — gateway DEFAULT_AGENT_RESPONSE_TIMEOUT
    • `"while connecting to pod"` — shell-agent KUBE_OPERATION_TIMEOUT (K8s exec)
    • `"while setting up port forward"` — shell-agent KUBE_PORT_FORWARD_TIMEOUT
    • `"Retry budget exhausted"` — shell-agent retry budget guard
      These show a specific "retrying…" message instead of the generic "service unavailable" one.

Port-forward

  • Replace `log.Fatal` with `log.Errorf` + return so the process does not exit on a connection error.
  • Improved error messages matching shell-agent timeout strings.

Tests

  • `go test ./pkg` — all wserror tests pass.
  • 19 tests covering `IsPermanentCloseError`, `IsInternalServerError`, `IsAgentResponseTimeout` (including wrapped errors and ordering invariant), `ServiceUnavailableMessage`.

@Guimove Guimove force-pushed the improve/shell-ux branch from d421e88 to 5e24490 Compare May 4, 2026 15:36
…ent-timeout detection

- Replace unbounded read deadline with ping/pong keepalive (PingInterval=30s, ReadTimeout=75s)
- Classify WebSocket close codes: 1007/1008 permanent (cancel+no retry), 1011 transient (retry)
- Detect agent-side timeout messages within 1011 errors and show specific retry guidance:
    "exceeded for receiving agent response" — gateway DEFAULT_AGENT_RESPONSE_TIMEOUT
    "while connecting to pod"              — shell-agent KUBE_OPERATION_TIMEOUT (K8s exec)
    "while setting up port forward"        — shell-agent KUBE_PORT_FORWARD_TIMEOUT
    "Retry budget exhausted"               — shell-agent retry budget guard
- Stop reconnect loop on permanent close errors (permission denied, auth rejected)
- Fix port-forward: replace log.Fatal with log.Errorf+return, improve error messages
- Add wserror.go with IsPermanentCloseError, IsInternalServerError, IsAgentResponseTimeout,
  ServiceUnavailableMessage helpers and full test coverage
@Guimove Guimove force-pushed the improve/shell-ux branch from 5e24490 to e49f29a Compare May 4, 2026 15:40
@Guimove Guimove changed the title improve(shell): apply ReadTimeout and improve port-forward error messages fix(shell): improve reliability — keepalive, error classification, agent-timeout detection May 4, 2026
@Guimove Guimove merged commit 8fcbd20 into main May 5, 2026
6 checks passed
@Guimove Guimove deleted the improve/shell-ux branch May 5, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants