You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: cleanup timeout for process termination (#207)
Implement configurable cleanup timeout to control how long the system waits
for process cleanup verification before triggering forced shutdown. This
replaces the hardcoded 10-second timeout with a user-configurable value.
Configuration:
- Add --cleanup-timeout CLI flag (default: 10s, env: COG_CLEANUP_TIMEOUT)
- Flow: CLI → Config → Runner for proper field injection
Implementation changes:
- Replace hardcoded timeout with context.WithTimeout() pattern
- Improve check interval from 100ms to 10ms for more responsive detection
- Add injectable verifyFn field to Runner for comprehensive testing
- Update terminology from "ungraceful shutdown" to "forced shutdown"
Testing improvements:
- Migrate time-dependent tests to Go 1.25 synctest for deterministic execution
- Add comprehensive test coverage for timeout scenarios and multiple ForceKill calls
- Use safe high PID values (9999999) and proper mocking to prevent real process operations
- Configure linter to handle synctest patterns correctly
This enables users to customize cleanup wait times based on their specific
workload requirements while maintaining safe defaults.
* Address leaking coglets from test
The test harness now properly cleans up orphaned coglet processes when interrupted:
Key Features:
- Cross-platform compatibility: Works on both macOS (Darwin) and Linux
- Robust process discovery: Uses pgrep -f coglet as primary method, falls back to ps with platform-specific flags
- Safe process killing: Validates we can signal processes before attempting to kill them
- Process group cleanup: Kills both process groups and individual processes
- Safety measures: Never kills ourselves, our parent, or PID 1
- Signal handling: Responds to both SIGINT (ctrl+c) and SIGTERM
Implementation Details:
- killAllChildProcesses() finds and kills coglet processes during cleanup
- findCogletProcesses() tries pgrep first, falls back to ps
- Platform-specific ps flags: -ax on macOS, -e on Linux
- Process ownership validation through signal testing before killing
- Integrated into existing TestMain signal handler
---------
Co-Authored-By: Michael Dwan <[email protected]>
Co-authored-by: Morgan Fainberg <[email protected]>
Copy file name to clipboardExpand all lines: cmd/cog/main.go
+34-15Lines changed: 34 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -19,13 +19,15 @@ import (
19
19
)
20
20
21
21
typeServerCmdstruct {
22
-
Hoststring`help:"Host address to bind the HTTP server to" default:"0.0.0.0"`
23
-
Portint`help:"Port number for the HTTP server" default:"5000"`
24
-
UseProcedureModebool`help:"Enable procedure mode for concurrent predictions" name:"use-procedure-mode"`
25
-
AwaitExplicitShutdownbool`help:"Wait for explicit shutdown signal instead of auto-shutdown" name:"await-explicit-shutdown"`
26
-
UploadURLstring`help:"Base URL for uploading prediction output files" name:"upload-url"`
27
-
WorkingDirectorystring`help:"Override the working directory for predictions" name:"working-directory"`
28
-
RunnerShutdownGracePeriod time.Duration`help:"Grace period before force-killing prediction runners" name:"runner-shutdown-grace-period" default:"600s"`
22
+
Hoststring`help:"Host address to bind the HTTP server to" default:"0.0.0.0" env:"COG_HOST"`
23
+
Portint`help:"Port number for the HTTP server" default:"5000" env:"COG_PORT"`
24
+
UseProcedureModebool`help:"Enable procedure mode for concurrent predictions" name:"use-procedure-mode" env:"COG_USE_PROCEDURE_MODE"`
25
+
AwaitExplicitShutdownbool`help:"Wait for explicit shutdown signal instead of auto-shutdown" name:"await-explicit-shutdown" env:"COG_AWAIT_EXPLICIT_SHUTDOWN"`
26
+
OneShotbool`help:"Enable one-shot mode (single runner, wait for cleanup before ready)" name:"one-shot" env:"COG_ONE_SHOT"`
27
+
UploadURLstring`help:"Base URL for uploading prediction output files" name:"upload-url" env:"COG_UPLOAD_URL"`
28
+
WorkingDirectorystring`help:"Override the working directory for predictions" name:"working-directory" env:"COG_WORKING_DIRECTORY"`
29
+
RunnerShutdownGracePeriod time.Duration`help:"Grace period before force-killing prediction runners" name:"runner-shutdown-grace-period" default:"600s" env:"COG_RUNNER_SHUTDOWN_GRACE_PERIOD"`
30
+
CleanupTimeout time.Duration`help:"Maximum time to wait for process cleanup before hard exit" name:"cleanup-timeout" default:"10s" env:"COG_CLEANUP_TIMEOUT"`
29
31
}
30
32
31
33
typeSchemaCmdstruct{}
@@ -43,6 +45,12 @@ var logger = util.CreateLogger("cog")
0 commit comments