feat: auto-stop idle sessions and preserve git repo state across restarts#651
Conversation
…arts Sessions now automatically stop after a configurable inactivity timeout, and git repo state (local branches, uncommitted/staged changes) is preserved to S3 on pod shutdown and restored on resume. - Track last AG-UI activity time on the session CR status (debounced to once per 60s to avoid excessive API calls) - Operator monitors running sessions and triggers auto-stop when idle beyond the configured timeout - Three-tier timeout resolution: session spec > project settings > default (24h); set to 0 to disable - Race-condition safe: re-reads CR and re-checks activity before stopping - Frontend shows "Stopped (idle)" badge and inactivity message in the session hibernated section - On SIGTERM, sync.sh creates git bundles (all local refs), uncommitted patches, staged patches, and metadata.json for each repo - On resume, hydrate.sh restores repos from bundles, checks out the saved branch, and applies patches (best-effort) - Handles runtime-added repos not in the original session spec - TerminationGracePeriodSeconds increased from 30 to 60 for backup time - CRDs: inactivityTimeout (spec), lastActivityTime/stoppedReason (status), projectsettings inactivityTimeoutSeconds - Backend: activity tracking in agui_proxy.go, parseStatus extracts new fields, session types updated - Operator: inactivity detection extracted to inactivity.go with cache, reconciler handles stop-reason annotation - Frontend: status badge, session header alert, hibernated section message - State-sync: git backup in sync.sh, git restore in hydrate.sh - Tests: unit tests for shouldAutoStop, resolveInactivityTimeout, triggerInactivityStop, getProjectInactivityTimeout, isActivityEvent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RHOAIENG-49782 Testing NotesPrerequisites
1. CRD Validation1.1 AgenticSession CRD fields$ kubectl get crd agenticsessions.vteam.ambient-code -o json | \
jq '.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties.inactivityTimeout'
{
"default": 86400,
"description": "Seconds of inactivity before auto-stopping an interactive session. 0 disables auto-shutdown.",
"minimum": 0,
"type": "integer"
}
$ kubectl get crd agenticsessions.vteam.ambient-code -o json | \
jq '.spec.versions[0].schema.openAPIV3Schema.properties.status.properties | {lastActivityTime, stoppedReason}'
{
"lastActivityTime": {
"description": "Timestamp of last recorded AG-UI activity in this session.",
"format": "date-time",
"type": "string"
},
"stoppedReason": {
"description": "Reason the session was stopped.",
"enum": [
"user",
"inactivity"
],
"type": "string"
}
}
1.2 ProjectSettings CRD field$ kubectl get crd projectsettings.vteam.ambient-code -o json | \
jq '.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties.inactivityTimeoutSeconds'
{
"default": 86400,
"description": "Default inactivity timeout for sessions in this project (seconds). 0 disables. Overridden by session-level spec.inactivityTimeout.",
"minimum": 0,
"type": "integer"
}
2. Git Repo State Preservation2.1 Backup on pod shutdown
2.2 Restore on pod restart
2.3 Termination grace period$ kubectl -n test-project get pod session-1771354357-runner -o jsonpath='{.spec.terminationGracePeriodSeconds}'
60
2.4 Edge case: empty repo / no repos
3. Activity Tracking3.1 lastActivityTime updated on message send
3.2 Debounce behavior
4. Inactivity Auto-Stop4.1 Auto-stop with short timeout
4.2 Condition reasons reflect inactivity$ kubectl -n test-project get agenticsession session-1771358236 -o json | \
jq '.status.conditions[] | select(.type == "PodCreated" or .type == "RunnerStarted")'
{
"lastTransitionTime": "2026-02-17T20:17:34Z",
"message": "Pod deleted due to inactivity timeout",
"reason": "InactivityTimeout",
"status": "False",
"type": "PodCreated"
}
{
"lastTransitionTime": "2026-02-17T20:17:34Z",
"message": "Runner stopped due to inactivity",
"reason": "InactivityTimeout",
"status": "False",
"type": "RunnerStarted"
}
4.3 Timeout disabled (inactivityTimeout: 0)
4.4 Project-level timeout precedence
4.5 Session-level overrides project-level
4.6 Manual stop sets stoppedReason=user
5. Frontend UI5.1 Inactivity badge on session list
5.2 Manual stop badge unchanged
5.3 Inactivity alert banner on session detail
5.4 No banner for manual stop
5.5 Resume after inactivity stop
|
This comment has been minimized.
This comment has been minimized.
Review ResponseThanks for the thorough review. Addressing each item: ✅ FixedCritical 1 — Goroutine leak in Both addressed in the same change to
🚫 Not Changed — Pre-existing / Out of ScopeCritical 2 — Shell injection in sync.sh:14-15: This sanitization pattern ( Major 4 — Type assertion at sessions.go:70: This code ( 🚫 Not Changed — Reviewer ErrorMajor 3 — Missing error handling in inactivity.go:192: The reviewer states "Update errors are logged but not propagated to reconciler." This is incorrect — looking at the actual code: _, err = config.DynamicClient.Resource(gvr).Namespace(namespace).Update(...)
if err != nil {
if errors.IsNotFound(err) {
return nil
}
return fmt.Errorf("failed to set desired-phase for %s/%s: %w", namespace, name, err)
}Errors are returned (wrapped with |
Add a semaphore (cap 50) to limit concurrent goroutines spawned by updateLastActivityTime, and replace context.Background() with a 10-second timeout to prevent goroutine leaks from hung API calls. Under normal conditions the 60s debounce ensures at most one goroutine per session; the semaphore protects against bursts when many sessions start simultaneously with immediate=true. If all slots are busy the update is dropped and the debounce will retry on the next event. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewSummaryThis PR implements auto-stop for idle sessions and git state preservation - excellent work on resource management and user experience. Well-architected with proper race condition handling and comprehensive test coverage. Overall Assessment: APPROVE with minor recommendations. Issues by SeverityCritical IssuesOperator: Missing Context Timeout
Major Issues
Minor Issues
Positive HighlightsExceptional Architecture
Code Quality
Frontend Integration
RecommendationsHigh Priority:
Medium Priority: Low Priority: Final AssessmentExcellent software engineering. Issues are polish items, not fundamental flaws. Core implementation is production-ready. Recommendation: Merge after addressing Critical #1 (operator timeouts) Generated with Claude Code 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Code ReviewSummaryThis PR implements auto-stop for idle sessions with git repo state preservation. The implementation is comprehensive and well-tested, with excellent test coverage (16 operator tests + 18 backend tests). The code follows repository patterns closely, with only a few minor issues to address. Issues by Severity🔴 Critical Issues1. Potential Goroutine Leak in Activity Tracking (MITIGATED in commit 2)
🟡 Major Issues2. Missing Type Safety Check in Status Parsing
// ❌ Current (violates type-safety rules)
if stoppedReason, ok := status["stoppedReason"].(string); ok && stoppedReason != "" {
result.StoppedReason = types.StringPtr(stoppedReason)
}
// ✅ Should use unstructured helpers
if stoppedReason, found, err := unstructured.NestedString(status, "stoppedReason"); found && err == nil && stoppedReason != "" {
result.StoppedReason = types.StringPtr(stoppedReason)
}
3. Race Condition in Cache TTL Check
// ❌ Race: cache could be updated between unlock and re-read
psTimeoutCache.mu.Lock()
if entry, ok := psTimeoutCache.entries[namespace]; ok {
if time.Since(entry.fetchedAt) < inactivityTimeoutCacheTTL {
psTimeoutCache.mu.Unlock() // <-- Lock released here
return entry.timeout // <-- Cache could change here
}
}
psTimeoutCache.mu.Unlock()
psTimeoutCache.mu.Lock()
if entry, ok := psTimeoutCache.entries[namespace]; ok {
if time.Since(entry.fetchedAt) < inactivityTimeoutCacheTTL {
timeout := entry.timeout // Copy under lock
psTimeoutCache.mu.Unlock()
return timeout
}
}
psTimeoutCache.mu.Unlock()4. Incorrect Error Handling Pattern
if err := triggerInactivityStop(sessionNamespace, sessionName); err != nil {
log.Printf("[Inactivity] Failed to auto-stop %s/%s: %v", sessionNamespace, sessionName, err)
continue // ❌ Abandons monitoring - operator will never retry auto-stop
}
return
if err := triggerInactivityStop(sessionNamespace, sessionName); err != nil {
log.Printf("[Inactivity] Failed to auto-stop %s/%s, will retry: %v", sessionNamespace, sessionName, err)
// Don't return - keep monitoring and retry on next tick
} else {
return // Only exit monitor if stop was successful
}🔵 Minor Issues5. Missing Security Context Documentation
6. Hardcoded Magic Numbers
7. Frontend Type Safety
// ❌ Current
export function SessionPhaseBadge({ phase, stoppedReason }: { phase: string; stoppedReason?: string })
// ✅ Should use proper type
export function SessionPhaseBadge({ phase, stoppedReason }: {
phase: string;
stoppedReason?: "user" | "inactivity"
})8. Shell Script Robustness
Positive Highlights✅ Excellent Test Coverage - 16 operator tests + 18 backend tests covering all edge cases RecommendationsPriority 1 (Before Merge):
Priority 2 (Follow-up PR OK): Priority 3 (Nice to Have): Architecture Alignment✅ Follows Backend Development Context patterns Test Coverage AssessmentOperator Tests (
Backend Tests (
Missing Tests:
Overall Assessment: Strong implementation with excellent test coverage. The three critical issues should be addressed before merge, but the core logic is sound. This is production-ready after addressing the type safety and race condition issues. 🔍 View AI decision process (logs available for 90 days) 📋 View memory system files loaded (click to expand)What Amber Loaded for Code ReviewAmber automatically loaded these repository standards from the memory system:
Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines. |
…bility (#658) ## Summary - Replace `DAC_READ_SEARCH` capability on the state-sync sidecar with `RunAsUser: 1001` to match the runner container UID - Fixes pod creation failures on OpenShift where `restricted-v2` SCC blocks `DAC_READ_SEARCH` ## Root Cause PR #651 introduced `DAC_READ_SEARCH` on the state-sync container so it could read workspace files owned by the runner (UID 1001, mode 700). This works on kind/Kubernetes but fails on OpenShift because `restricted-v2` does not allow adding that capability. ## Fix Running state-sync as the same UID (1001) as the runner gives it native read access to the workspace files without any elevated capabilities. Fully compatible with `restricted-v2` and all OpenShift SCCs. ## Test plan - [x] `go build ./...` passes for operator - [ ] Deploy on OpenShift and verify sessions create pods successfully - [ ] Verify state-sync can still read workspace files and sync to S3 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Summary
spec.inactivityTimeout> project-levelProjectSettings.spec.inactivityTimeoutSeconds> default 24h; set to 0 to disableWhat changed
CRDs:
AgenticSession: addedspec.inactivityTimeout,status.lastActivityTime,status.stoppedReasonProjectSettings: addedspec.inactivityTimeoutSecondsBackend:
agui_proxy.goupdatesstatus.lastActivityTimeon AG-UI events (RUN_STARTED, TEXT_MESSAGE_START, TEXT_MESSAGE_CONTENT, TOOL_CALL_START)parseStatus()extracts new status fields for API responsesOperator:
inactivity.gowithshouldAutoStop(),resolveInactivityTimeout(),triggerInactivityStop(), and per-namespace ProjectSettings cachemonitorPod()checks inactivity on each tick; re-reads CR before stopping to avoid racesreconciler.goreadsstop-reasonannotation to setstatus.stoppedReasonand condition reasonFrontend:
State-sync:
sync.sh: on SIGTERM, creates git bundles, uncommitted/staged patches, and metadata.json per repohydrate.sh: restores repos from bundles, checks out saved branch, applies patches (best-effort)TerminationGracePeriodSecondsincreased from 30 to 60Tests:
inactivity_test.go: 16 tests coveringshouldAutoStop,resolveInactivityTimeout,triggerInactivityStop,getProjectInactivityTimeoutagui_proxy_test.go: 18 subtests forisActivityEventTest plan
go test -race ./...)npm run build)status.stoppedReason=inactivityand "Stopped (idle)" badge in UIinactivityTimeout: 0disables auto-stopFixes: RHOAIENG-49782