Skip to content

fix: implement rollback support for Node Outage Scenario (#916)#1210

Open
NETIZEN-11 wants to merge 1 commit intokrkn-chaos:mainfrom
NETIZEN-11:fix/node-outage-rollback-916-v2
Open

fix: implement rollback support for Node Outage Scenario (#916)#1210
NETIZEN-11 wants to merge 1 commit intokrkn-chaos:mainfrom
NETIZEN-11:fix/node-outage-rollback-916-v2

Conversation

@NETIZEN-11
Copy link
Copy Markdown
Contributor

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization

Description

Implements rollback support for node outage scenarios to ensure the cluster is restored to its original state when a failure occurs during execution.

This integrates the node actions scenario with the existing rollback framework and ensures safe recovery from partial failures.

Changes

  • Add @set_rollback_context_decorator to run() to integrate with the rollback framework

  • Add REVERSIBLE_ACTIONS map:

    • node_stop_scenarionode_start_scenario
    • stop_kubelet_scenariorestart_kubelet_scenario
  • Add _register_rollback() to encode node/action metadata and register rollback callable before each reversible action executes

  • Add rollback_node_action() static method with local imports (required by serializer)

  • Remove unused imports: log_exception, krkn.utils

  • Fix dead else branch in cloud-type dispatch inside rollback callable

  • Switch to percent-style logging to match project conventions

Irreversible actions (no rollback registered)

  • node_termination_scenario
  • node_crash_scenario
  • node_reboot_scenario
  • node_disk_detach_attach_scenario
  • node_stop_start_scenario

Related Tickets & Documents

Documentation

  • Is documentation needed for this update?

Related Documentation PR (if applicable)

N/A

Checklist before requesting a review

  • Changes discussed in the issue and aligned with rollback feature expectations
  • Performed self-review of the code
  • Tested changes by running krkn with node scenarios
  • Added unit tests (not applicable for this change)

Test Execution

python run_kraken.py

Sample Output

INFO Starting Kraken run
INFO Executing node_stop_scenario on node worker-1
ERROR Scenario failed due to exception
INFO Triggering rollback for node actions
INFO Executing rollback: node_start_scenario on node worker-1
INFO Rollback completed successfully
INFO Kraken run completed

)

- Add @set_rollback_context_decorator to run() to integrate with
  the rollback framework
- Add REVERSIBLE_ACTIONS map: node_stop_scenario->node_start_scenario,
  stop_kubelet_scenario->restart_kubelet_scenario
- Add _register_rollback() to encode node/action metadata and register
  rollback callable before each reversible action executes
- Add rollback_node_action() static method with all imports local
  (required by version_template.j2 serializer)
- Remove unused imports: log_exception, krkn.utils
- Fix dead else-branch in cloud-type dispatch inside rollback callable
- Switch to percent-style logging to match project conventions

Irreversible actions (no rollback registered):
node_termination_scenario, node_crash_scenario, node_reboot_scenario,
node_disk_detach_attach_scenario, node_stop_start_scenario

Closes krkn-chaos#916

Signed-off-by: Nitesh <nitesh@example.com>
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Implement rollback support for node outage scenarios

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Implement rollback support for node outage scenarios using decorator pattern
• Add REVERSIBLE_ACTIONS map for node_stop and kubelet_stop with inverse actions
• Register rollback callables before executing reversible actions for safe recovery
• Encode node metadata as base64 JSON for self-contained rollback serialization
• Switch to percent-style logging and remove unused imports for consistency
Diagram
flowchart LR
  A["Node Action Execution"] -->|@set_rollback_context_decorator| B["Rollback Framework"]
  A -->|reversible action| C["_register_rollback"]
  C -->|encode metadata| D["base64 JSON payload"]
  D -->|set_rollback_callable| E["rollback_node_action"]
  E -->|decode payload| F["Reconstruct cloud provider"]
  F -->|execute inverse action| G["Cluster Restored"]
Loading

Grey Divider

File Changes

1. krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py ✨ Enhancement +215/-20

Add rollback support for reversible node actions

• Add Apache 2.0 license header and copyright notice
• Import base64, json, RollbackContent, and set_rollback_context_decorator for rollback integration
• Remove unused imports (log_exception, krkn.utils)
• Define REVERSIBLE_ACTIONS map mapping node_stop_scenario to node_start_scenario and
 stop_kubelet_scenario to restart_kubelet_scenario
• Decorate run() method with @set_rollback_context_decorator to integrate with rollback framework
• Add _register_rollback() method to encode node/action metadata and register rollback callable
 before reversible actions execute
• Add rollback_node_action() static method with local imports for serialization compatibility,
 handling cloud-type dispatch and executing inverse actions
• Add rollback registration calls before node_stop_scenario and stop_kubelet_scenario execution in
 run_node()
• Switch logging to percent-style format for project consistency
• Clean up formatting and comments throughout the file

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Mar 27, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (1) 📐 Spec deviations (0)

Grey Divider


Action required

1. RollbackHandler uses placeholder scenario_type 📎 Requirement gap ⛯ Reliability
Description
Rollback callables are serialized using the RollbackHandler's scenario_type, but
NodeActionsScenarioPlugin does not override __init__, so it inherits the default
placeholder_scenario_type. As a result, rollback version files may not be discovered/executed for
node_scenarios failures, breaking expected restore behavior.
Code

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[R145-148]

+        self.rollback_handler.set_rollback_callable(
+            NodeActionsScenarioPlugin.rollback_node_action,
+            RollbackContent(resource_identifier=resource_identifier),
+        )
Evidence
The PR registers rollback callables via self.rollback_handler.set_rollback_callable(...), which
names rollback files using RollbackHandler.scenario_type. Since NodeActionsScenarioPlugin does
not set scenario_type in __init__, it uses AbstractScenarioPlugin's default
placeholder_scenario_type, while rollback execution filters by scenario_telemetry.scenario_type
(node_scenarios), so the saved rollback files may not match the searched scenario type and won't
run on failure.

Scenario plugins implement rollback/restore logic for node outage actions
Scenario Plugin API restore hook is invoked when run() raises an exception
Rollback handler integration is used to perform restore operations
krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[145-148]
krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[66-75]
krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[518-519]
krkn/scenario_plugins/abstract_scenario_plugin.py[19-28]
krkn/rollback/handler.py[243-246]
krkn/scenario_plugins/abstract_scenario_plugin.py[86-135]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Rollback version files are serialized under `RollbackHandler.scenario_type`, but `NodeActionsScenarioPlugin` inherits `AbstractScenarioPlugin.__init__`'s default `scenario_type="placeholder_scenario_type"`. Rollback execution searches by `scenario_telemetry.scenario_type` (`node_scenarios`), so rollback files may not be executed.

## Issue Context
`NodeActionsScenarioPlugin` now registers rollback callables; it must ensure its `RollbackHandler` is initialized with the same scenario type returned by `get_scenario_types()`.

## Fix Focus Areas
- krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[66-76]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Rollback failures swallowed 🐞 Bug ⛯ Reliability
Description
NodeActionsScenarioPlugin.rollback_node_action() catches all exceptions and only logs them, so
rollback execution failures will not propagate to the rollback framework. As a result,
execute_rollback_version_files() will treat the rollback as successful and rename the rollback file
to .executed even when nothing was rolled back.
Code

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[R248-249]

+        except Exception as e:
+            logging.error("Rollback failed for node action: %s", e)
Evidence
rollback_node_action() swallows exceptions (logs only), while execute_rollback_version_files()
determines success/failure based on whether the rollback callable raises; no raise means it proceeds
as success and renames the file.

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[188-250]
krkn/rollback/handler.py[145-170]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`rollback_node_action()` catches `Exception` and does not re-raise. This prevents the rollback framework from detecting failures and will incorrectly mark the rollback version file as executed.

### Issue Context
`execute_rollback_version_files()` wraps rollback execution in a try/except and uses exceptions to decide whether to rename the version file to `.executed`.

### Fix Focus Areas
- krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[188-250]
- krkn/rollback/handler.py[145-170]

### Expected fix
- Either remove the broad try/except in `rollback_node_action()`, or log and `raise` the original exception so failures propagate.
- If you want to keep contextual logging, do:
 - `except Exception as e: logging.exception(...); raise`

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Rollback cloud dispatch incomplete 🐞 Bug ✓ Correctness
Description
rollback_node_action() falls back to general_node_scenarios for cloud types not in its hardcoded
allow-list; general_node_scenarios.node_start_scenario() is a no-op, so rollback of
node_stop_scenario will not restart nodes for cloud types that the forward path supports (e.g.,
openstack, alibaba, bm). This also means rollback entries are still registered and executed but do
nothing, giving a false sense of recovery.
Code

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[R207-217]

+            # Reconstruct the cloud-provider scenario object.
+            # Unknown / generic cloud types all fall through to general_node_scenarios.
+            cloud_type_lower = cloud_type.lower()
+            known_cloud_types = {
+                "aws", "gcp", "azure", "az", "docker",
+                "vsphere", "vmware", "ibm", "ibmcloud",
+                "ibmpower", "ibmcloudpower",
+            }
+            if cloud_type_lower not in known_cloud_types:
+                scenario_obj = general_node_scenarios(kubecli, True, affected_nodes_status)
+            elif cloud_type_lower == "aws":
Evidence
Forward execution supports openstack/alibaba/bm in get_node_scenario_object(), and run_node
registers rollback for node_stop_scenario regardless of cloud type. During rollback, cloud types not
in known_cloud_types fall back to general_node_scenarios, where node_start_scenario explicitly
performs no action—so stopped nodes remain stopped.

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[252-307]
krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[439-448]
krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[207-235]
krkn/scenario_plugins/node_actions/general_cloud_node_scenarios.py[14-32]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Rollback reconstructs provider scenarios using a limited `known_cloud_types` set. For supported forward providers like `openstack`, `alibaba/alicloud`, and `bm`, rollback falls back to `general_node_scenarios`, where `node_start_scenario()` is a no-op. This breaks rollback for `node_stop_scenario` on those providers.

### Issue Context
- Forward path: `get_node_scenario_object()` supports openstack/alibaba/bm.
- Rollback path: `rollback_node_action()` does not handle these cloud types.
- `general_node_scenarios.node_start_scenario()` logs and takes no action.

### Fix Focus Areas
- krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[100-149]
- krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[175-237]
- krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[252-307]
- krkn/scenario_plugins/node_actions/general_cloud_node_scenarios.py[14-32]

### Expected fix (choose one approach)
1) **Implement full rollback support for these providers**
  - Extend rollback dispatch to include `openstack`, `alibaba/alicloud`, and (if feasible) `bm`.
  - Add local imports and instantiate their scenario objects.
  - If `bm` requires extra config (bmc info/user/password), encode the required data in the rollback payload (or otherwise make it available).

2) **Make rollback registration conditional**
  - In `_register_rollback()`, only register rollback for `node_stop_scenario` when the cloud type is one that rollback can actually restart.
  - Log a warning when rollback is intentionally not registered for a cloud type/action pair to avoid a misleading "rollback registered" signal.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. BOM before shebang 🐞 Bug ⚙ Maintainability
Description
The file begins with a UTF-8 BOM character before the shebang. This prevents the OS from recognizing
the shebang if the file is ever executed directly and can confuse tooling/linters.
Code

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[1]

+#!/usr/bin/env python
Evidence
The first character of the file is a BOM (U+FEFF) preceding #!, so the file does not literally
start with #!.

krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[1-1]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
A UTF-8 BOM (U+FEFF) is present before the shebang line.

### Issue Context
This can break direct execution via shebang and may confuse some tools.

### Fix Focus Areas
- krkn/scenario_plugins/node_actions/node_actions_scenario_plugin.py[1-1]

### Expected fix
- Remove the BOM so the file starts exactly with `#!/usr/bin/env python`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant