Skip to content

feat: enhance random run with attach/detach, status, log-dir, and abort#128

Open
NETIZEN-11 wants to merge 5 commits intokrkn-chaos:mainfrom
NETIZEN-11:feature/random-run-enhancements
Open

feat: enhance random run with attach/detach, status, log-dir, and abort#128
NETIZEN-11 wants to merge 5 commits intokrkn-chaos:mainfrom
NETIZEN-11:feature/random-run-enhancements

Conversation

@NETIZEN-11
Copy link
Copy Markdown

Related Issue

Closes #126

#Summary

Adds full lifecycle management capabilities to krknctl random run <plan.json>, improving control, observability, and flexibility for chaos scenarios.

Changes

1. Attach / Detach Support

  • Added --detach (-d) flag to run chaos scenarios in background
  • Default behavior remains foreground execution
  • Detach mode re-spawns the binary as a background process
  • Updated RunGraph interface to include logDir parameter across docker, podman, and tests

2. Status Command

  • Introduced new command:

    krknctl random status
    
  • Added new package pkg/randomstate for state management:

    • SaveState
    • LoadState
    • ClearState
  • State stored at:

    $TMPDIR/.krknctl/random_run.json
    
  • Displays:

    • Running status
    • Scenario name
    • PID
    • Start time
    • Log directory

3. Custom Log Directory

  • Added --log-dir <path> flag
  • Automatically creates directory if it does not exist
  • Logs are written to the specified directory instead of current working directory

4. Abort Command

  • Introduced:

    krknctl random abort
    
  • Terminates running chaos scenario using PID from state file

  • Cleans up state file after termination

Why this change is needed

These enhancements provide:

  • Better control over long-running chaos experiments (detach/abort)
  • Improved observability (status command)
  • Flexible log management (custom log directory)

#Testing

# foreground
krknctl random run plan.json --max-parallel 2

# background
krknctl random run plan.json --max-parallel 2 --detach

# status
krknctl random status

# custom logs
krknctl random run plan.json --max-parallel 2 --log-dir /tmp/chaos-logs

# abort
krknctl random abort

Checklist

  • Code builds successfully
  • Tested locally
  • Backward compatible changes
  • Documentation updated (if required)

Nitesh added 4 commits March 30, 2026 21:43
Introduce --detach / -d flag on 'krknctl random run'. Default is
foreground (attached). When --detach is set the binary re-executes
itself as a background child process via os.Executable + exec.Command,
strips --detach from the forwarded argv in all token forms (--detach,
-d, --detach=true, -d=true), and returns immediately printing the PID.

- Extract runRandomPlan() as the shared execution core for both modes
- Add runDetached() with correct argv stripping and no unused parameters
- Register --detach / -d flag in root.go
- Thread logDir through RunGraph interface and CommonRunGraph so the
  log-dir feature works end-to-end without a second interface change
- Update docker, podman, graph command, and test call sites for the
  new 9-argument RunGraph signature

Signed-off-by: Nitesh <nitesh@example.com>
Introduce 'krknctl random status' to inspect a running chaos plan.

- Add pkg/randomstate with SaveState / LoadState / ClearState helpers
  that persist a JSON state file at os.TempDir()/.krknctl/random_run.json
  (portable across Linux, macOS, Windows - no hardcoded /tmp path)
- State written by runRandomPlan at startup, cleared on exit via defer
- isProcessAlive() uses syscall.Signal(0) to probe liveness without
  delivering an actual signal; os.Signal(nil) is a nil interface and
  must not be used here
- Status output: running (true/false), scenario name, plan-file path,
  PID, start time, log-dir when set
- Stale state (process dead) is detected and auto-cleared
- ScenarioName stores filepath.Base(planFile) for a clean display name

Signed-off-by: Nitesh <nitesh@example.com>
Add --log-dir flag to 'krknctl random run'.

- Directory is created automatically with os.MkdirAll if it does not
  exist (permissions 0750)
- CommonRunGraph writes each scenario log file into logDir via
  path.Join(logDir, containerName+'.log') instead of the current
  working directory when the flag is provided
- The full resolved log file path is reported in error messages and
  stored in state so 'random status' can display it accurately
- When --log-dir is omitted behaviour is identical to before

Signed-off-by: Nitesh <nitesh@example.com>
Introduce 'krknctl random abort' to immediately stop a running chaos
plan.

- Reads PID from the state file written by runRandomPlan
- Calls process.Kill() on the stored PID
- Clears the state file regardless of kill outcome so no stale state
  is left behind
- Prints confirmation with PID and scenario name on success
- Gracefully handles the case where no plan is running (no state file)

Signed-off-by: Nitesh <nitesh@example.com>
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add lifecycle management for random chaos runs: detach, status, abort, log-dir

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add detach/attach mode for background chaos execution with PID tracking
• Introduce status command to inspect running chaos plan details
• Add abort command to terminate running chaos scenarios cleanly
• Implement custom log directory support with automatic creation
• Create randomstate package for cross-process state persistence
Diagram
flowchart LR
  A["krknctl random run"] -->|--detach| B["runDetached"]
  B -->|re-exec with --attach| C["Child Process"]
  C -->|runRandomPlan| D["SaveState"]
  D -->|persist to tmpdir| E["randomstate"]
  A -->|foreground| F["runRandomPlan"]
  F -->|execute chaos| G["RunGraph"]
  G -->|logDir| H["Custom Log Directory"]
  I["krknctl random status"] -->|LoadState| E
  J["krknctl random abort"] -->|LoadState| E
  J -->|kill PID| K["Terminate Process"]
  K -->|ClearState| E
Loading

Grey Divider

File Changes

1. cmd/random.go ✨ Enhancement +350/-183

Refactor random run with lifecycle management features

cmd/random.go


2. cmd/root.go ✨ Enhancement +7/-0

Register detach, log-dir flags and new subcommands

cmd/root.go


3. pkg/randomstate/state.go ✨ Enhancement +69/-0

New state management package for chaos run persistence

pkg/randomstate/state.go


View more (6)
4. pkg/scenarioorchestrator/common_functions.go ✨ Enhancement +12/-5

Thread logDir parameter through execution pipeline

pkg/scenarioorchestrator/common_functions.go


5. pkg/scenarioorchestrator/docker/scenario_orchestrator.go ✨ Enhancement +2/-1

Update RunGraph signature to accept logDir parameter

pkg/scenarioorchestrator/docker/scenario_orchestrator.go


6. pkg/scenarioorchestrator/podman/scenario_orchestrator.go ✨ Enhancement +2/-2

Update RunGraph signature to accept logDir parameter

pkg/scenarioorchestrator/podman/scenario_orchestrator.go


7. pkg/scenarioorchestrator/scenario_orchestrator.go ✨ Enhancement +1/-0

Add logDir parameter to RunGraph interface definition

pkg/scenarioorchestrator/scenario_orchestrator.go


8. pkg/scenarioorchestrator/scenarioorchestratortest/common_test_functions.go 🧪 Tests +2/-2

Update test calls to RunGraph with empty logDir

pkg/scenarioorchestrator/scenarioorchestratortest/common_test_functions.go


9. cmd/graph.go ✨ Enhancement +1/-1

Update RunGraph call with empty logDir parameter

cmd/graph.go


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Mar 30, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (1)

Grey Divider


Action required

1. runDetached uses --attach📎 Requirement gap ≡ Correctness
Description
Detach mode respawns krknctl with random run --attach, but no --attach flag is defined for
random run, so the child process will fail to start and detach/attach behavior will not work as
specified.
Code

cmd/random.go[R307-314]

+	// Rebuild the child argv: replace --detach with --attach
+	childArgs := []string{"random", "run", "--attach"}
+	for _, a := range os.Args[1:] {
+		if a == "--detach" || a == "-d" {
+			continue
+		}
+		childArgs = append(childArgs, a)
+	}
Evidence
Compliance requires working attach/detach modes. The detach implementation hard-codes --attach for
the child process, but random run only defines --detach (no --attach), so the detach workflow
is broken.

krknctl random run supports attach/detach execution modes
cmd/random.go[299-314]
cmd/root.go[86-97]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`runDetached()` respawns the binary with `random run --attach`, but `random run` does not define an `--attach` flag. This causes the detached child process to fail with an unknown flag error, breaking detach/attach behavior.

## Issue Context
Foreground execution already represents “attached” mode, so the child process can either run without any extra flag, or you can explicitly add a hidden/optional `--attach` flag and handle it.

## Fix Focus Areas
- cmd/random.go[299-314]
- cmd/root.go[86-97]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. RunGraph missing logDir📎 Requirement gap ≡ Correctness
Description
random run reads/creates --log-dir, but does not pass logDir into RunGraph, so logs will not
be redirected to the user-specified directory (and the call signature no longer matches the updated
interface).
Code

cmd/random.go[R240-243]

+	commChannel := make(chan *models.GraphCommChannel)
+	go func() {
+		(*scenarioOrchestrator).RunGraph(nodes, executionPlan, environment, volumes, false, commChannel, registrySettings, nil)
+	}()
Evidence
Compliance requires that logs are written to the user-specified directory. While logDir is parsed
and stored, the RunGraph invocation in random run omits the new logDir parameter even though
the orchestrator interface/implementations were updated to accept it.

krknctl random run supports specifying a custom log directory
cmd/random.go[66-80]
cmd/random.go[240-243]
pkg/scenarioorchestrator/scenario_orchestrator.go[46-50]
pkg/scenarioorchestrator/common_functions.go[69-82]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`random run` supports `--log-dir` but does not pass `logDir` into `ScenarioOrchestrator.RunGraph(...)`, so custom log directory support is not actually applied (and the call does not match the updated interface).

## Issue Context
The orchestrator interface and implementations now accept `logDir` and `CommonRunGraph` uses it when creating log files.

## Fix Focus Areas
- cmd/random.go[240-243]
- pkg/scenarioorchestrator/scenario_orchestrator.go[46-50]
- pkg/scenarioorchestrator/common_functions.go[69-82]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. status uses nil signal📎 Requirement gap ☼ Reliability
Description
random status checks liveness with process.Signal(os.Signal(nil)), which is not a valid signal-0
check and may panic or always report not running, preventing reliable status reporting.
Code

cmd/random.go[R352-359]

+			// Verify the process is still alive
+			process, err := os.FindProcess(state.PID)
+			running := false
+			if err == nil {
+				// On Unix, FindProcess always succeeds; send signal 0 to check
+				if sigErr := process.Signal(os.Signal(nil)); sigErr == nil {
+					running = true
+				}
Evidence
Compliance requires the status command to correctly indicate whether a chaos run is currently
running. The current liveness check uses a nil os.Signal, which is not a valid cross-platform
“signal 0” liveness probe and can misreport the running state.

krknctl provides a status command/option for running chaos tests
cmd/random.go[352-360]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`random status` uses `process.Signal(os.Signal(nil))` to check if a PID is alive. This is not a correct liveness check and can misbehave (panic/false negatives).

## Issue Context
On Unix, the typical approach is `process.Signal(syscall.Signal(0))`. On Windows, you may need a different approach or accept limited support.

## Fix Focus Areas
- cmd/random.go[352-360]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (1)
4. State not cleared on exit 🐞 Bug ☼ Reliability
Description
With --exit-on-error, runRandomPlan calls os.Exit, which bypasses the deferred ClearState and leaves
a stale state file that status/abort will act on afterward.
Code

cmd/random.go[R263-266]

+				if exitOnError {
+					_, _ = color.New(color.FgHiRed).Println(fmt.Sprintf("aborting chaos run with exit status %d", statErr.ExitStatus))
+					os.Exit(statErr.ExitStatus)
+				}
Evidence
The state file cleanup is deferred, but os.Exit terminates immediately without running defers, so
the saved PID/state persists even though the run is no longer active.

cmd/random.go[223-235]
cmd/random.go[263-266]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When `--exit-on-error` is enabled, `runRandomPlan` calls `os.Exit(...)`, which prevents deferred functions from running. This leaves `random_run.json` on disk with a PID that may no longer be associated with a running chaos plan.

### Issue Context
`random status`/`random abort` rely on the state file to determine what to show/kill.

### Fix Focus Areas
- cmd/random.go[223-235]
- cmd/random.go[263-266]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

5. abort only kills PID 📎 Requirement gap ☼ Reliability
Description
random abort only kills the krknctl process PID from the state file and does not explicitly stop
running chaos containers/scenarios, so it may not reliably stop all running chaos scenarios as
required.
Code

cmd/random.go[R405-414]

+			process, err := os.FindProcess(state.PID)
			if err != nil {
-				return err
+				_ = randomstate.ClearState()
+				return fmt.Errorf("finding process %d: %w", state.PID, err)
			}
-			table.Print()
-			fmt.Print("\n\n")
-			spinner.Suffix = "starting chaos scenarios..."
-			spinner.Start()
-
-			commChannel := make(chan *models.GraphCommChannel)
-
-			go func() {
-				(*scenarioOrchestrator).RunGraph(nodes, executionPlan, environment, volumes, false, commChannel, registrySettings, nil)
-			}()
-
-			for {
-				c := <-commChannel
-				if c == nil {
-					break
-				} else {
-					if c.Err != nil {
-						spinner.Stop()
-						var statErr *utils.ExitError
-						if errors.As(c.Err, &statErr) {
-							if c.ScenarioID != nil && c.ScenarioLogFile != nil {
-								_, err = color.New(color.FgHiRed).Println(fmt.Sprintf("scenario %s at step %d with exit status %d, check log file %s aborting chaos run.",
-									*c.ScenarioID,
-									*c.Layer,
-									statErr.ExitStatus,
-									*c.ScenarioLogFile))
-								if err != nil {
-									return err
-								}
-							}
-							if exitOnerror {
-								_, err = color.New(color.FgHiRed).Println(fmt.Sprintf("aborting chaos run with exit status %d", statErr.ExitStatus))
-								if err != nil {
-									return err
-								}
-								os.Exit(statErr.ExitStatus)
-							}
-							spinner.Start()
-						}

-					}
-					spinner.Suffix = fmt.Sprintf("Running step %d scenario(s): %s", *c.Layer, strings.Join(executionPlan[*c.Layer], ", "))
-
-				}
+			if err = process.Kill(); err != nil {
+				_ = randomstate.ClearState()
+				return fmt.Errorf("killing process %d: %w", state.PID, err)
+			}
Evidence
Compliance requires aborting to immediately stop all running chaos scenarios. The implementation
terminates only the controller process via process.Kill() and performs no orchestrator/container
cleanup, which may leave scenarios running.

krknctl supports aborting running chaos scenarios
cmd/random.go[405-418]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`random abort` currently kills the `krknctl` PID but does not ensure that any running chaos scenario containers/processes are stopped. This may not meet the requirement to stop all running chaos scenarios reliably.

## Issue Context
Consider implementing a more graceful stop (e.g., SIGTERM then SIGKILL) and/or invoking orchestrator/container cleanup (e.g., stop/remove containers started by the run) before clearing state.

## Fix Focus Areas
- cmd/random.go[388-424]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Log path displayed wrong🐞 Bug ≡ Correctness
Description
When --log-dir is set, cmd/random.go prepends logDir to ScenarioLogFile even though ScenarioLogFile
is already a full path, producing a doubled/incorrect log path in error output.
Code

cmd/random.go[R254-258]

+				if c.ScenarioID != nil && c.ScenarioLogFile != nil {
+					logFile := *c.ScenarioLogFile
+					if logDir != "" {
+						logFile = fmt.Sprintf("%s/%s", logDir, *c.ScenarioLogFile)
+					}
Evidence
CommonRunGraph constructs logFilePath as path.Join(logDir, logFilename) and sends that as
ScenarioLogFile. random.go then prefixes logDir again when printing the error, resulting in paths
like /logs//logs/<file>.log.

cmd/random.go[254-258]
pkg/scenarioorchestrator/common_functions.go[68-83]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`cmd/random.go` builds a user-facing `logFile` path by prepending `logDir` to `c.ScenarioLogFile`. But `ScenarioLogFile` now already includes `logDir` when provided, so the displayed path becomes incorrect.

### Issue Context
The orchestrator layer now emits the full log file path via `ScenarioLogFile`.

### Fix Focus Areas
- cmd/random.go[254-258]
- pkg/scenarioorchestrator/common_functions.go[68-83]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. Non-portable path.Join used 🐞 Bug ≡ Correctness
Description
CommonRunGraph uses the path package (slash-separated) to build filesystem paths for log files;
this can break on Windows and is inconsistent with randomstate’s use of filepath for OS-specific
paths.
Code

pkg/scenarioorchestrator/common_functions.go[R68-78]

			containerName := utils.GenerateContainerName(config, scenario.Name, &scID)
-			filename := fmt.Sprintf("%s.log", containerName)
-			file, err := os.Create(path.Clean(filename))
-
+			logFilename := fmt.Sprintf("%s.log", containerName)
+			// honour custom log directory when provided
+			var logFilePath string
+			if logDir != "" {
+				logFilePath = path.Join(logDir, logFilename)
+			} else {
+				logFilePath = logFilename
+			}
+			file, err := os.Create(path.Clean(logFilePath))
			if err != nil {
Evidence
Log paths are constructed with path.Join/path.Clean instead of filepath.Join/filepath.Clean,
which is the correct OS-specific filesystem path package. The new randomstate package explicitly
uses filepath for cross-platform correctness.

pkg/scenarioorchestrator/common_functions.go[68-78]
pkg/randomstate/state.go[26-30]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Filesystem log paths are built using `path.Join`/`path.Clean`, which are intended for URL-style paths and can mis-handle Windows path separators.

### Issue Context
The state management code already uses `filepath.Join` for OS-appropriate paths.

### Fix Focus Areas
- pkg/scenarioorchestrator/common_functions.go[68-78]
- pkg/randomstate/state.go[26-30]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread cmd/random.go Outdated
Comment thread cmd/random.go
Comment thread cmd/random.go Outdated
Comment thread cmd/random.go
Implement client-side validation mode for 'krknctl run' that validates
scenario configuration without requiring cluster access, kubeconfig, or
any Kubernetes API calls.

Changes:
- cmd/dryrun.go: new file with parseDryRunFlag(), validateScenarioLocally(),
  and DryRunResult with coloured Print() output
- cmd/dryrun_test.go: 13 unit tests covering flag parsing, required fields,
  type validation, global fields, nil safety, and multiple errors
- cmd/run.go: PreRunE skips registry fetch in dry-run mode; RunE fetches
  scenario metadata from image registry only (no cluster calls), runs
  validateScenarioLocally(), prints results, returns error on failure
- cmd/root.go: register --dry-run flag on runCmd with help text
- cmd/random.go: fix RunGraph call to pass logDir (9-arg signature),
  fix runDetached() signature, fix isProcessAlive() to use
  syscall.Signal(0) instead of os.Signal(nil)

Exit codes: 0 valid, non-zero on validation errors
Output format matches issue spec (checkmark/cross symbols)

Signed-off-by: Nitesh <nitesh@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Enhancements to random run Command in krknctl

1 participant