Skip to content

feat: auto-stop idle sessions and preserve git repo state across restarts#651

Merged
Gkrumbach07 merged 4 commits intoambient-code:mainfrom
adalton:andalton/RHOAIENG-49782
Feb 19, 2026
Merged

feat: auto-stop idle sessions and preserve git repo state across restarts#651
Gkrumbach07 merged 4 commits intoambient-code:mainfrom
adalton:andalton/RHOAIENG-49782

Conversation

@adalton
Copy link
Contributor

@adalton adalton commented Feb 18, 2026

Summary

  • Add configurable inactivity timeout that automatically stops idle sessions, reclaiming cluster resources
  • Preserve git repo state (local branches, uncommitted/staged changes) to S3 on pod shutdown and restore on resume, so work is never lost when pods are stopped or recycled
  • Three-tier timeout resolution: session-level spec.inactivityTimeout > project-level ProjectSettings.spec.inactivityTimeoutSeconds > default 24h; set to 0 to disable

What changed

CRDs:

  • AgenticSession: added spec.inactivityTimeout, status.lastActivityTime, status.stoppedReason
  • ProjectSettings: added spec.inactivityTimeoutSeconds

Backend:

  • Debounced activity tracking in agui_proxy.go updates status.lastActivityTime on AG-UI events (RUN_STARTED, TEXT_MESSAGE_START, TEXT_MESSAGE_CONTENT, TOOL_CALL_START)
  • parseStatus() extracts new status fields for API responses

Operator:

  • New inactivity.go with shouldAutoStop(), resolveInactivityTimeout(), triggerInactivityStop(), and per-namespace ProjectSettings cache
  • monitorPod() checks inactivity on each tick; re-reads CR before stopping to avoid races
  • reconciler.go reads stop-reason annotation to set status.stoppedReason and condition reason

Frontend:

  • "Stopped (idle)" badge on session cards
  • Inactivity alert banner in session header (full/actions-only modes)
  • "This session was automatically stopped after being idle" message in hibernated section

State-sync:

  • sync.sh: on SIGTERM, creates git bundles, uncommitted/staged patches, and metadata.json per repo
  • hydrate.sh: restores repos from bundles, checks out saved branch, applies patches (best-effort)
  • TerminationGracePeriodSeconds increased from 30 to 60

Tests:

  • inactivity_test.go: 16 tests covering shouldAutoStop, resolveInactivityTimeout, triggerInactivityStop, getProjectInactivityTimeout
  • agui_proxy_test.go: 18 subtests for isActivityEvent

Test plan

  • Operator and backend unit tests pass (go test -race ./...)
  • Frontend builds with zero errors/warnings (npm run build)
  • Create session with 180s inactivity timeout, verify auto-stop after idle period
  • Verify status.stoppedReason=inactivity and "Stopped (idle)" badge in UI
  • Verify git repo with local branch + uncommitted changes survives stop/resume cycle
  • Verify inactivityTimeout: 0 disables auto-stop
  • Verify sessions without explicit timeout fall back to project settings, then 24h default

Fixes: RHOAIENG-49782

…arts

Sessions now automatically stop after a configurable inactivity timeout,
and git repo state (local branches, uncommitted/staged changes) is
preserved to S3 on pod shutdown and restored on resume.

- Track last AG-UI activity time on the session CR status (debounced to
  once per 60s to avoid excessive API calls)
- Operator monitors running sessions and triggers auto-stop when idle
  beyond the configured timeout
- Three-tier timeout resolution: session spec > project settings >
  default (24h); set to 0 to disable
- Race-condition safe: re-reads CR and re-checks activity before
  stopping
- Frontend shows "Stopped (idle)" badge and inactivity message in the
  session hibernated section

- On SIGTERM, sync.sh creates git bundles (all local refs), uncommitted
  patches, staged patches, and metadata.json for each repo
- On resume, hydrate.sh restores repos from bundles, checks out the
  saved branch, and applies patches (best-effort)
- Handles runtime-added repos not in the original session spec
- TerminationGracePeriodSeconds increased from 30 to 60 for backup time

- CRDs: inactivityTimeout (spec), lastActivityTime/stoppedReason
  (status), projectsettings inactivityTimeoutSeconds
- Backend: activity tracking in agui_proxy.go, parseStatus extracts new
  fields, session types updated
- Operator: inactivity detection extracted to inactivity.go with cache,
  reconciler handles stop-reason annotation
- Frontend: status badge, session header alert, hibernated section
  message
- State-sync: git backup in sync.sh, git restore in hydrate.sh
- Tests: unit tests for shouldAutoStop, resolveInactivityTimeout,
  triggerInactivityStop, getProjectInactivityTimeout, isActivityEvent

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adalton
Copy link
Contributor Author

adalton commented Feb 18, 2026

RHOAIENG-49782 Testing Notes

Prerequisites

  • make build-all push-all REGISTRY=quay.io/rh-ee-andalton
  • (cd e2e && ./scripts/setup-kind.sh)
  • (cd e2e && ./scripts/deploy.sh)
  • ./scripts/setup-vertex-kind.sh
  • Patch the state sync image:
    $ kubectl -n ambient-code set env deployment/agentic-operator \ STATE_SYNC_IMAGE=quay.io/rh-ee-andalton/vteam_state_sync:latest \ AMBIENT_CODE_RUNNER_IMAGE=quay.io/rh-ee-andalton/vteam_claude_runner:latest

1. CRD Validation

1.1 AgenticSession CRD fields

$ kubectl get crd agenticsessions.vteam.ambient-code -o json | \
  jq '.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties.inactivityTimeout'
{
  "default": 86400,
  "description": "Seconds of inactivity before auto-stopping an interactive session. 0 disables auto-shutdown.",
  "minimum": 0,
  "type": "integer"
}
  • type: integer, minimum: 0, default: 86400
$ kubectl get crd agenticsessions.vteam.ambient-code -o json | \
  jq '.spec.versions[0].schema.openAPIV3Schema.properties.status.properties | {lastActivityTime, stoppedReason}'
{
  "lastActivityTime": {
    "description": "Timestamp of last recorded AG-UI activity in this session.",
    "format": "date-time",
    "type": "string"
  },
  "stoppedReason": {
    "description": "Reason the session was stopped.",
    "enum": [
      "user",
      "inactivity"
    ],
    "type": "string"
  }
}
  • lastActivityTime: type: string, format: date-time
  • stoppedReason: type: string, enum: [user, inactivity]

1.2 ProjectSettings CRD field

$ kubectl get crd projectsettings.vteam.ambient-code -o json | \
  jq '.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties.inactivityTimeoutSeconds'
{
  "default": 86400,
  "description": "Default inactivity timeout for sessions in this project (seconds). 0 disables. Overridden by session-level spec.inactivityTimeout.",
  "minimum": 0,
  "type": "integer"
}
  • type: integer, minimum: 0, default: 86400

2. Git Repo State Preservation

2.1 Backup on pod shutdown

  1. Configure GitHub PAT

  2. Create an interactive session with a repo

  3. Wait for Running state

  4. In the session, create some changes with the following prompt:

    > Create three files in /workspace/repos/ambient-test-repo-1/:
      1. committed.txt containing "committed" — add and commit it, do not push
      2. staged.txt containing "staged" — add it to the index but do not commit
      3. Modify README.md by appending "modified" — do not stage or commit
    
    
    AI: All three files are in the correct states:
    
        committed.txt — committed (not shown in status since it's already committed)
        staged.txt — staged (listed under "Changes to be committed")
        README.md — modified but unstaged (listed under "Changes not staged for commit")
    
  5. Monitor the state-sync logs

    kubectl -n test-project logs -f $(kubectl -n test-project get pods | grep session | awk '{ print $1 }')  -c state-sync
  6. Stop the session (user stop)

    =========================================
    [2026-02-17T18:57:10+00:00] SIGTERM received, performing final sync...
    =========================================
      Backing up git state for ambient-test-repo-1...
      Backed up ambient-test-repo-1 (branch: ambient/session-1771354357, HEAD: 54dfcc0f)
      Syncing 1 repo state(s) to S3...
      Git repo state backup complete (1 repos)
    
  • Logs show Backing up git state for <repo>
  • Logs show Backed up <repo> (branch: test-branch, HEAD: ...)
  • Logs show Git repo state backup complete
  1. Verify S3 contents:

    $ kubectl exec -n ambient-code deploy/minio -- mc alias set local http://localhost:9000 minioadmin minioadmin123 && \
        kubectl exec -n ambient-code deploy/minio -- mc ls --recursive local/ambient-sessions/test-project/session-1771354357/repo-state/
    Added `local` successfully.
    [2026-02-17 18:57:10 UTC]   294B STANDARD ambient-test-repo-1/metadata.json
    [2026-02-17 18:57:10 UTC] 1.2KiB STANDARD ambient-test-repo-1/repo.bundle
    [2026-02-17 18:57:10 UTC]   134B STANDARD ambient-test-repo-1/staged.patch
    [2026-02-17 18:57:10 UTC]   270B STANDARD ambient-test-repo-1/uncommitted.patch
  • repo.bundle exists
  • uncommitted.patch exists and is non-empty
  • staged.patch exists and is non-empty
  • metadata.json exists with correct currentBranch, remoteUrl, headSha

2.2 Restore on pod restart

  1. Resume the session stopped in 2.1

  2. Wait for Running state

  3. Check init container logs:

    $ kubectl -n test-project logs -f session-1771354357-runner -c init-hydrate
    =========================================
    Ambient Code Session State Hydration
    =========================================
    Session: test-project/session-1771354357
    S3 Endpoint: http://minio.ambient-code.svc:9000
    S3 Bucket: ambient-sessions
    =========================================
    Creating workspace structure...
    Setting up rclone...
    Testing S3 connection...
               0 2000-01-01 00:00:00        -1 test-project
    S3 connection successful
    Checking for existing session state in S3...
    Found existing session state, downloading from S3...
      Downloading .claude/...
    Transferred:      161.590 KiB / 161.590 KiB, 100%, 0 B/s, ETA -
    Transferred:           12 / 12, 100%
    Elapsed time:         0.0s
      No data for artifacts/
      No data for file-uploads/
    State hydration complete!
    Setting ownership and permissions on subdirectories...
    =========================================
    Setting up repositories and workflows...
    =========================================
    Cloning repositories from spec...
    Found 1 repositories to clone
      Cloning ambient-test-repo-1 (branch: ambient/session-1771354357)...
    Cloning into '/workspace/repos/ambient-test-repo-1'...
    warning: Could not find remote branch ambient/session-1771354357 to clone.
    fatal: Remote branch ambient/session-1771354357 not found in upstream origin
      ⚠ Failed to clone ambient-test-repo-1 (may require authentication)
    =========================================
    Checking for git repo state backup...
    =========================================
    Found git repo state backup, restoring...
    Transferred:        1.904 KiB / 1.904 KiB, 100%, 0 B/s, ETA -
    Transferred:            4 / 4, 100%
    Elapsed time:         0.0s
      Restoring git state for ambient-test-repo-1...
      Cloning missing repo ambient-test-repo-1 from https://github.com/adalton/ambient-test-repo-1...
    Cloning into '/workspace/repos/ambient-test-repo-1'...
      Fetching refs from bundle for ambient-test-repo-1...
    From /tmp/repo-state/ambient-test-repo-1//repo.bundle
     * [new branch]      ambient/session-1771354357 -> ambient/session-1771354357
      Checking out branch: ambient/session-1771354357
    Previous HEAD position was 6965201 Initial commit
    Switched to branch 'ambient/session-1771354357'
      Applying uncommitted changes...
      Applying staged changes...
      Restored ambient-test-repo-1 (branch: ambient/session-1771354357)
    Git repo state restore complete
    =========================================
    Workspace initialized successfully
  • Logs show Found git repo state backup, restoring...
  • Logs show Fetching refs from bundle...
  • Logs show Checking out branch: test-branch
  • Logs show Applying uncommitted changes...
  • Logs show Applying staged changes...
  • Logs show Restored <repo> (branch: test-branch)
  1. Exec into the running pod and verify:

    $ kubectl -n test-project exec -it session-1771354357-runner -c ambient-code-runner -- /bin/bash
    (app-root) bash-5.1$ cd /workspace/repos/ambient-test-repo-1/
    (app-root) bash-5.1$ git config --global --add safe.directory /workspace/repos/ambient-test-repo-1
    (app-root) bash-5.1$ git branch
    * ambient/session-1771354357
      main
    (app-root) bash-5.1$ git status
    On branch ambient/session-1771354357
    Changes to be committed:
      (use "git restore --staged <file>..." to unstage)
            new file:   staged.txt
    
    Changes not staged for commit:
      (use "git add <file>..." to update what will be committed)
      (use "git restore <file>..." to discard changes in working directory)
            modified:   README.md
            modified:   committed.txt
            modified:   staged.txt
    
    (app-root) bash-5.1$ cat README.md
    # Test Repo 1
    modified
    (app-root) bash-5.1$ cat committed.txt
    committed
    (app-root) bash-5.1$ cat staged.txt
    staged
  • Correct branch checked out
  • Uncommitted changes present
  • Staged changes present

2.3 Termination grace period

$ kubectl -n test-project get pod session-1771354357-runner -o jsonpath='{.spec.terminationGracePeriodSeconds}'
60
  • Returns 60

2.4 Edge case: empty repo / no repos

  1. Create a session with no repos configured

  2. Stop the session

  3. Check state-sync logs

    =========================================
    [2026-02-17T19:13:47+00:00] SIGTERM received, performing final sync...
    =========================================
      No git repos to back up
    [2026-02-17T19:13:47+00:00] Starting sync to S3...
      Syncing .claude/...
      Syncing artifacts/...
      Syncing file-uploads/...
    [2026-02-17T19:13:48+00:00] Sync complete (3 paths synced)
    =========================================
    [2026-02-17T19:13:48+00:00] Final sync complete, exiting
    =========================================
    
  • Logs show No /workspace/repos directory, skipping git backup or No git repos to back up
  • No errors
  1. Resume session, check init-hydrate logs

    =========================================
    Ambient Code Session State Hydration
    =========================================
    Session: test-project/session-1771355474
    S3 Endpoint: http://minio.ambient-code.svc:9000
    S3 Bucket: ambient-sessions
    =========================================
    Creating workspace structure...
    Setting up rclone...
    Testing S3 connection...
               0 2000-01-01 00:00:00        -1 test-project
    S3 connection successful
    Checking for existing session state in S3...
    Found existing session state, downloading from S3...
      Downloading .claude/...
    Transferred:       36.090 KiB / 36.090 KiB, 100%, 0 B/s, ETA -
    Transferred:            5 / 5, 100%
    Elapsed time:         0.0s
      No data for artifacts/
      No data for file-uploads/
    State hydration complete!
    Setting ownership and permissions on subdirectories...
    =========================================
    Setting up repositories and workflows...
    =========================================
    No repositories configured in spec
    =========================================
    Checking for git repo state backup...
    =========================================
    No git repo state backup found
    =========================================
    Workspace initialized successfully
    =========================================
    
  • Logs show 'Not git repo state backup found
  • No errors

3. Activity Tracking

3.1 lastActivityTime updated on message send

  1. Create and start an interactive session

  2. Wait for Running state

  3. Send a message via the UI

  4. After ~5 seconds, check the CR:

    $ kubectl -n test-project get agenticsession session-1771355474 -o jsonpath='{.status.lastActivityTime}'
    2026-02-17T19:18:46Z
  • lastActivityTime is set to a recent RFC3339 timestamp

3.2 Debounce behavior

  1. Send multiple messages in quick succession (< 60s apart)

  2. After each message, check the CR:

    $ kubectl -n test-project get agenticsession session-1771355474 -o jsonpath='{.status.lastActivityTime}'
    2026-02-17T19:28:47Z
    ...
    $ kubectl -n test-project get agenticsession session-1771355474 -o jsonpath='{.status.lastActivityTime}'
    2026-02-17T19:28:47Z
    ...
    $ kubectl -n test-project get agenticsession session-1771355474 -o jsonpath='{.status.lastActivityTime}'
    2026-02-17T19:30:12Z
    
  • lastActivityTime does not change more than once per 60s window
  • Sending a message after 60s+ gap updates the timestamp

4. Inactivity Auto-Stop

4.1 Auto-stop with short timeout

  1. Create a session with a short inactivity timeout:

    $ kubectl -n test-project patch agenticsession session-1771358236 --type merge -p '{"spec":{"inactivityTimeout": 120}}'
    agenticsession.vteam.ambient-code/session-1771358236 patched
  2. Start the session, send one message, then leave idle for > 2 minutes

  3. Watch operator logs:

    $ kubectl -n ambient-code logs deploy/agentic-operator | grep -i inactivity
    2026/02/17 20:01:46 [Inactivity] Session test-project/session-1771358236: idle beyond timeout, triggering auto-stop
    2026/02/17 20:01:46 [Inactivity] Session test-project/session-1771358236: set desired-phase=Stopped with reason=inactivity
  • Operator logs: [Inactivity] Session <ns>/<name>: idle beyond timeout, triggering auto-stop

  • Session transitions to Stopping, then Stopped

  • status.stoppedReason is "inactivity":

    $ kubectl -n test-project get agenticsession session-1771358236 -o jsonpath='{.status.stoppedReason}'
    inactivity

4.2 Condition reasons reflect inactivity

$ kubectl -n test-project get agenticsession session-1771358236 -o json | \
        jq '.status.conditions[] | select(.type == "PodCreated" or .type == "RunnerStarted")'
{
  "lastTransitionTime": "2026-02-17T20:17:34Z",
  "message": "Pod deleted due to inactivity timeout",
  "reason": "InactivityTimeout",
  "status": "False",
  "type": "PodCreated"
}
{
  "lastTransitionTime": "2026-02-17T20:17:34Z",
  "message": "Runner stopped due to inactivity",
  "reason": "InactivityTimeout",
  "status": "False",
  "type": "RunnerStarted"
}
  • reason: "InactivityTimeout"
  • message mentions inactivity

4.3 Timeout disabled (inactivityTimeout: 0)

  1. Create/patch a session with inactivityTimeout: 0
  2. Leave idle beyond default timeout
  • Session remains Running (no auto-stop)

4.4 Project-level timeout precedence

  1. Patch ProjectSettings in the namespace:

    $ kubectl -n test-project patch projectsetting projectsettings \
        --type merge -p '{"spec":{"inactivityTimeoutSeconds": 180}}'
  2. Create a session WITHOUT spec.inactivityTimeout set

  3. Leave idle for > 3 minutes

  • Session auto-stops after ~3 minutes (uses project-level setting)

4.5 Session-level overrides project-level

  1. With project-level set to 180s (from 4.4)
  2. Create a session with inactivityTimeout: 0 (disabled)
  3. Leave idle for > 3 minutes
  • Session remains Running (session-level 0 overrides project 180)

4.6 Manual stop sets stoppedReason=user

  1. Manually stop a running session via UI or annotation

  2. Check CR:

    $ kubectl -n test-project get agenticsession session-1771358236 -o jsonpath='{.status.stoppedReason}'
    user
    
    $ kubectl -n test-project get agenticsession session-1771358236 -o json | jq '.status.conditions[] | select(.type == "PodCreated" or .type == "RunnerStarted")'
    {
      "lastTransitionTime": "2026-02-17T21:37:02Z",
      "message": "Pod deleted by user stop request",
      "reason": "UserStopped",
      "status": "False",
      "type": "PodCreated"
    }
    {
      "lastTransitionTime": "2026-02-17T21:37:02Z",
      "message": "Runner stopped by user",
      "reason": "UserStopped",
      "status": "False",
      "type": "RunnerStarted"
    }
  • Returns "user"
  • Conditions show reason: "UserStopped"

5. Frontend UI

5.1 Inactivity badge on session list

  1. Auto-stop a session via inactivity (from test 4.1)
  2. Navigate to the workspace sessions list
  • Session shows badge: Stopped (idle) instead of just "Stopped"

5.2 Manual stop badge unchanged

  1. Manually stop a session
  2. Navigate to the workspace sessions list
  • Session shows badge: Stopped (no "(idle)" suffix)

5.3 Inactivity alert banner on session detail

  1. Navigate to the auto-stopped session detail page
    I see:

    Session Hibernated
    
    This session was automatically stopped after being idle.
    You can resume it to continue working.
    

    with a "Resume Session" button.

5.4 No banner for manual stop

  1. Navigate to a manually stopped session detail page
    I see:

    Session Hibernated
    

    with a "Resume Session" button.

  • No inactivity alert banner shown

5.5 Resume after inactivity stop

  1. Click Resume on an inactivity-stopped session
  2. Wait for Running state
  • Session starts successfully
  • Inactivity banner disappears
  • Badge changes to "Running"
  • Git repo state restored (if applicable)

@github-actions

This comment has been minimized.

@adalton
Copy link
Contributor Author

adalton commented Feb 18, 2026

Review Response

Thanks for the thorough review. Addressing each item:

✅ Fixed

Critical 1 — Goroutine leak in updateLastActivityTime and Major 5 — No timeout on context

Both addressed in the same change to agui_proxy.go:

  • Added a bounded semaphore (activityUpdateSem, cap 50) — if all slots are busy, the update is dropped (the debounce will retry on the next event). Under normal conditions the 60s debounce means at most one goroutine per session is active; the semaphore protects against pathological bursts (e.g., many sessions starting simultaneously with immediate=true).
  • Replaced context.Background() with context.WithTimeout(context.Background(), 10*time.Second) so hung API calls don't leak goroutines indefinitely.

🚫 Not Changed — Pre-existing / Out of Scope

Critical 2 — Shell injection in sync.sh:14-15: This sanitization pattern (${VAR//[^a-zA-Z0-9-]/}) is pre-existing code present in both sync.sh and hydrate.sh prior to this PR. Both values (NAMESPACE, SESSION_NAME) come from Kubernetes downward API environment variables set by the operator, not user input. Additionally, Kubernetes namespace/name validation already restricts these to [a-z0-9-]. Changing the sanitization approach is out of scope for this Jira issue.

Major 4 — Type assertion at sessions.go:70: This code (stMap["phase"].(string)) is pre-existing and not modified in this PR. Our diff for sessions.go is limited to: (1) stop-reason annotation handling in the Stopping phase, (2) TerminationGracePeriodSeconds change, (3) GCE_METADATA_HOST env var, (4) DAC_READ_SEARCH capability, and (5) inactivity check in monitorPod. The type assertion at line 70 was there before.

🚫 Not Changed — Reviewer Error

Major 3 — Missing error handling in inactivity.go:192: The reviewer states "Update errors are logged but not propagated to reconciler." This is incorrect — looking at the actual code:

_, err = config.DynamicClient.Resource(gvr).Namespace(namespace).Update(...)
if err != nil {
    if errors.IsNotFound(err) {
        return nil
    }
    return fmt.Errorf("failed to set desired-phase for %s/%s: %w", namespace, name, err)
}

Errors are returned (wrapped with fmt.Errorf). The caller in monitorPod handles this with continue (retry on next tick). Only IsNotFound returns nil, which is correct — the session was deleted.

adalton and others added 2 commits February 18, 2026 11:29
Add a semaphore (cap 50) to limit concurrent goroutines spawned by
updateLastActivityTime, and replace context.Background() with a
10-second timeout to prevent goroutine leaks from hung API calls.

Under normal conditions the 60s debounce ensures at most one goroutine
per session; the semaphore protects against bursts when many sessions
start simultaneously with immediate=true. If all slots are busy the
update is dropped and the debounce will retry on the next event.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

Claude Code Review

Summary

This PR implements auto-stop for idle sessions and git state preservation - excellent work on resource management and user experience. Well-architected with proper race condition handling and comprehensive test coverage.

Overall Assessment: APPROVE with minor recommendations.

Issues by Severity

Critical Issues

Operator: Missing Context Timeout

  • File: components/operator/internal/handlers/inactivity.go:73,192
  • Issue: context.TODO() has no timeout - could cause goroutine leaks
  • Fix: Add 10s timeouts like backend (agui_proxy.go:817)

Major Issues

  1. Backend Type Assertions (sessions.go:74-82)

    • Uses direct assertions vs unstructured.NestedString helpers
    • Safe but inconsistent with project patterns
  2. Frontend Type Safety (status-badge.tsx:376)

    • stoppedReason should be "user" | "inactivity" not string
  3. Git Credential Risk (sync.sh:196-198)

    • Pipe directly to sed to avoid logging credentials

Minor Issues

  1. Duplicate inactivity message (extract to component)
  2. Cache race condition (optimization opportunity)
  3. No maximum on inactivityTimeout CRD

Positive Highlights

Exceptional Architecture

  • Race-safe auto-stop (re-reads CR before stopping)
  • Goroutine leak prevention (semaphore + timeouts)
  • Git state preservation (bundles + patches + runtime repos)

Code Quality

  • 34 unit tests with comprehensive coverage
  • Follows all security-standards.md patterns
  • Zero panic() in production
  • Proper RBAC throughout

Frontend Integration

  • Shadcn components used correctly
  • React Query patterns followed
  • Proper type definitions

Recommendations

High Priority:

  1. Add operator context timeouts
  2. Fix git credential logging

Medium Priority:
3. Add CRD validation maximum
4. Extract duplicate component

Low Priority:
5. Use unstructured helpers in parseStatus
6. Add literal types to SessionPhaseBadge

Final Assessment

Excellent software engineering. Issues are polish items, not fundamental flaws. Core implementation is production-ready.

Recommendation: Merge after addressing Critical #1 (operator timeouts)

Generated with Claude Code


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 18, 2026

Claude Code Review

Summary

This PR implements auto-stop for idle sessions with git repo state preservation. The implementation is comprehensive and well-tested, with excellent test coverage (16 operator tests + 18 backend tests). The code follows repository patterns closely, with only a few minor issues to address.

Issues by Severity

🔴 Critical Issues

1. Potential Goroutine Leak in Activity Tracking (MITIGATED in commit 2)

  • Location: components/backend/websocket/agui_proxy.go:662
  • Issue: Commit 1 spawned unbounded goroutines without timeout context
  • Status:FIXED in commit e246b15 - Added semaphore (cap 50) and 10s timeout
  • Verification: Good defensive programming - debounce provides natural rate limiting, semaphore prevents burst pathology

🟡 Major Issues

2. Missing Type Safety Check in Status Parsing

  • Location: components/backend/handlers/sessions.go:78
  • Pattern Violation: Direct type assertion without checking
// ❌ Current (violates type-safety rules)
if stoppedReason, ok := status["stoppedReason"].(string); ok && stoppedReason != "" {
    result.StoppedReason = types.StringPtr(stoppedReason)
}
  • Expected Pattern (per CLAUDE.md:452-456):
// ✅ Should use unstructured helpers
if stoppedReason, found, err := unstructured.NestedString(status, "stoppedReason"); found && err == nil && stoppedReason != "" {
    result.StoppedReason = types.StringPtr(stoppedReason)
}
  • Impact: Could panic if stoppedReason exists but is not a string
  • Same issue applies to: lastActivityTime parsing on line 74

3. Race Condition in Cache TTL Check

  • Location: components/operator/internal/handlers/inactivity.go:562-569
  • Issue: Lock is released between TTL check and cache read
// ❌ Race: cache could be updated between unlock and re-read
psTimeoutCache.mu.Lock()
if entry, ok := psTimeoutCache.entries[namespace]; ok {
    if time.Since(entry.fetchedAt) < inactivityTimeoutCacheTTL {
        psTimeoutCache.mu.Unlock()  // <-- Lock released here
        return entry.timeout        // <-- Cache could change here
    }
}
psTimeoutCache.mu.Unlock()
  • Fix: Store value under lock before returning
psTimeoutCache.mu.Lock()
if entry, ok := psTimeoutCache.entries[namespace]; ok {
    if time.Since(entry.fetchedAt) < inactivityTimeoutCacheTTL {
        timeout := entry.timeout  // Copy under lock
        psTimeoutCache.mu.Unlock()
        return timeout
    }
}
psTimeoutCache.mu.Unlock()

4. Incorrect Error Handling Pattern

  • Location: components/operator/internal/handlers/sessions.go:1767-1771
  • Issue: continue in error case abandons monitoring instead of retrying
if err := triggerInactivityStop(sessionNamespace, sessionName); err != nil {
    log.Printf("[Inactivity] Failed to auto-stop %s/%s: %v", sessionNamespace, sessionName, err)
    continue // ❌ Abandons monitoring - operator will never retry auto-stop
}
return
  • Expected: Either return error (for operator retry) or just log and continue monitoring:
if err := triggerInactivityStop(sessionNamespace, sessionName); err != nil {
    log.Printf("[Inactivity] Failed to auto-stop %s/%s, will retry: %v", sessionNamespace, sessionName, err)
    // Don't return - keep monitoring and retry on next tick
} else {
    return  // Only exit monitor if stop was successful
}

🔵 Minor Issues

5. Missing Security Context Documentation

  • Location: components/operator/internal/handlers/sessions.go:1380-1383
  • Issue: DAC_READ_SEARCH capability added without ADR or security review
  • Concern: Elevated privilege for state-sync container (reads all files regardless of permissions)
  • Recommendation: Document in security standards why this is necessary and safe

6. Hardcoded Magic Numbers

  • Location: components/backend/websocket/agui_proxy.go:686
  • Issue: Semaphore size 50 not documented
  • Fix: Add constant with comment explaining sizing rationale

7. Frontend Type Safety

  • Location: components/frontend/src/components/status-badge.tsx:375
  • Issue: stoppedReason parameter not strongly typed
// ❌ Current
export function SessionPhaseBadge({ phase, stoppedReason }: { phase: string; stoppedReason?: string })

// ✅ Should use proper type
export function SessionPhaseBadge({ phase, stoppedReason }: { 
  phase: string; 
  stoppedReason?: "user" | "inactivity" 
})

8. Shell Script Robustness

  • Location: components/runners/state-sync/sync.sh:1600
  • Issue: Credential stripping regex could fail on unusual URL formats
  • Current: sed 's|://[^@]*@|://|'
  • Risk: Might not catch all credential formats (e.g., URI-encoded passwords)
  • Recommendation: Test with various credential formats or use git's built-in credential helper

Positive Highlights

Excellent Test Coverage - 16 operator tests + 18 backend tests covering all edge cases
Race Condition Protection - Re-reads CR before auto-stop to avoid stale data
Proper Three-Tier Resolution - Session > Project > Default timeout hierarchy
Defensive Programming - Semaphore + timeout added proactively in commit 2
CRD Schema Validation - Proper minimum: 0 constraint and enum for stoppedReason
Type-Safe Unstructured Access (Operator) - Correctly uses unstructured.NestedInt64, NestedMap
Proper OwnerReferences - Not needed (no child resources created)
Error Logging with Context - All errors include namespace/session name
No Token Leaks - Credentials stripped from URLs in logs (sync.sh:1600, hydrate.sh:1455)
Frontend Follows Patterns - Uses Shadcn Alert component, proper conditional rendering
Documentation - Clear commit messages, inline comments explain non-obvious logic

Recommendations

Priority 1 (Before Merge):

  1. Fix type assertions in sessions.go:74,78 to use unstructured.Nested* helpers
  2. Fix cache race condition in inactivity.go:562-569
  3. Fix error handling in sessions.go:1767-1771 (don't abandon monitoring on error)

Priority 2 (Follow-up PR OK):
4. Document DAC_READ_SEARCH security rationale in .claude/context/security-standards.md
5. Add constant for semaphore size with sizing rationale
6. Strongly type stoppedReason in frontend ("user" | "inactivity")

Priority 3 (Nice to Have):
7. Add integration test for full idle → auto-stop → resume cycle
8. Test credential stripping with URI-encoded passwords
9. Consider adding Prometheus metric for auto-stop events

Architecture Alignment

✅ Follows Backend Development Context patterns
✅ Follows K8s Client Usage patterns (service account for status updates)
✅ Follows Error Handling patterns (mostly - see issue #4)
⚠️ Violates Type-Safe Unstructured Access in backend (issue #2)
✅ Follows Security Standards (token redaction, no sensitive logging)
✅ Follows Frontend patterns (Shadcn components, type safety mostly)

Test Coverage Assessment

Operator Tests (inactivity_test.go): 16 tests

  • shouldAutoStop: 9 tests (all edge cases covered)
  • resolveInactivityTimeout: 4 tests (precedence hierarchy validated)
  • triggerInactivityStop: 2 tests (race protection verified)
  • getProjectInactivityTimeout: 5 tests (cache behavior validated)

Backend Tests (agui_proxy_test.go): 18 tests

  • isActivityEvent: All AG-UI event types tested

Missing Tests:

  • ⚠️ Integration test for updateLastActivityTime goroutine behavior
  • ⚠️ E2E test for idle session auto-stop (mentioned in PR description as manual test)

Overall Assessment: Strong implementation with excellent test coverage. The three critical issues should be addressed before merge, but the core logic is sound. This is production-ready after addressing the type safety and race condition issues.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@Gkrumbach07 Gkrumbach07 merged commit 7d442a7 into ambient-code:main Feb 19, 2026
15 of 20 checks passed
@adalton adalton deleted the andalton/RHOAIENG-49782 branch February 19, 2026 18:29
Gkrumbach07 added a commit that referenced this pull request Feb 19, 2026
…bility (#658)

## Summary

- Replace `DAC_READ_SEARCH` capability on the state-sync sidecar with `RunAsUser: 1001` to match the runner container UID
- Fixes pod creation failures on OpenShift where `restricted-v2` SCC blocks `DAC_READ_SEARCH`

## Root Cause

PR #651 introduced `DAC_READ_SEARCH` on the state-sync container so it could read workspace files owned by the runner (UID 1001, mode 700). This works on kind/Kubernetes but fails on OpenShift because `restricted-v2` does not allow adding that capability.

## Fix

Running state-sync as the same UID (1001) as the runner gives it native read access to the workspace files without any elevated capabilities. Fully compatible with `restricted-v2` and all OpenShift SCCs.

## Test plan

- [x] `go build ./...` passes for operator
- [ ] Deploy on OpenShift and verify sessions create pods successfully
- [ ] Verify state-sync can still read workspace files and sync to S3

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments