fix(scenarios): fix cerberus call signature, missing imports, prometh… by NETIZEN-11 · Pull Request #1208 · krkn-chaos/krkn

NETIZEN-11 · 2026-03-27T17:45:42Z

What does this PR do?

Fixes multiple runtime bugs across cerberus, prometheus, utils, and scenario plugins.

Type of change

Refactor
New feature
Bug fix
Optimization

Description

This PR resolves several runtime issues across different components of the project,
including cerberus, prometheus, utils, and scenario plugins.

Fixes included:

cerberus/setup.py
- Fixed incorrect function call to application_status()
- Removed extra argument cerberus_url to match expected signature
prometheus/client.py
- Fixed missing return statement in critical_alerts()
- Added return firing_alerts to ensure correct output
utils/HealthChecker.py
- Replaced unreliable boolean check == True with proper identity check is True
- Added missing return self.ret_value to prevent unintended None return
scenario_plugins/network_chaos_ng
- Fixed incorrect length check using len(config)
- Updated to use len(scenario_config) for correct scenario handling
scenario_plugins/zone_outage
- Added missing imports base64 and json
- Prevents NameError during execution in node_based_zone()

These fixes prevent runtime errors such as TypeError, NameError, and unintended None returns,
ensuring more stable execution across modules.

Related Tickets & Documents

Related Issue #: N/A
Closes #: N/A

Documentation

Is documentation needed for this update?

Checklist before requesting a review

Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers.
I have performed a self-review of my code by running krkn and specific scenarios
If it is a core feature, I have added thorough unit tests with above 80% coverage

REQUIRED:

Description of combination of tests performed and output of run

Tested individual modules after fixes:

Verified cerberus status check runs without TypeError
Verified prometheus alerts return correct values
Verified HealthChecker returns expected values
Verified scenario plugins execute without runtime errors
Verified zone outage scenario runs without import-related failures

python run_kraken.py
All modules executed successfully without runtime errors
No TypeError / NameError observed
Expected outputs returned from all components

qodo-code-review · 2026-03-27T17:45:59Z

Review Summary by Qodo

Fix runtime bugs across cerberus, prometheus, utils, and scenario plugins

🐞 Bug fix

Walkthroughs

Description

• Fixed cerberus function call signature by removing extra cerberus_url argument
• Added missing return statement in prometheus critical_alerts() function
• Fixed HealthChecker boolean checks and added missing return value
• Corrected variable reference in network_chaos_ng from config to scenario_config
• Added missing base64 and json imports in zone_outage plugin
• Improved variable naming clarity in prometheus metrics iteration loops

Diagram

flowchart LR
  A["Runtime Bugs Identified"] --> B["Cerberus Setup"]
  A --> C["Prometheus Client"]
  A --> D["HealthChecker Utils"]
  A --> E["Network Chaos NG"]
  A --> F["Zone Outage Plugin"]
  B --> B1["Fixed function signature"]
  C --> C1["Added return statement"]
  D --> D1["Fixed bool checks & return"]
  E --> E1["Fixed variable reference"]
  F --> F1["Added missing imports"]
  B1 --> G["All modules execute without errors"]
  C1 --> G
  D1 --> G
  E1 --> G
  F1 --> G

File Changes

1. krkn/cerberus/setup.py 🐞 Bug fix +19/-19

Fix cerberus function signature and config typo

• Added Apache 2.0 license header
• Fixed typo in config key from check_applicaton_routes to check_application_routes
• Removed extra cerberus_url argument from application_status() function call
• Removed redundant check_application_routes = False variable initialization
• Simplified logging messages in publish_kraken_status() function

krkn/cerberus/setup.py

2. krkn/prometheus/client.py 🐞 Bug fix +32/-14

Add missing return statement and improve variable naming

• Added Apache 2.0 license header
• Added missing return firing_alerts statement in critical_alerts() function
• Improved variable naming in metrics iteration loops for clarity (e.g., k,v to `pod_type,
 pod_list`)
• Renamed loop variables to be more descriptive (pod_k, node_k, hc_k, vc_k)

krkn/prometheus/client.py

3. krkn/scenario_plugins/network_chaos_ng/network_chaos_ng_scenario_plugin.py 🐞 Bug fix +18/-6

Fix variable reference in scenario config check
• Added Apache 2.0 license header
• Fixed incorrect variable reference from len(config) to len(scenario_config)
• Simplified and corrected instance count comparison logic
• Removed unnecessary blank line for code consistency
krkn/scenario_plugins/network_chaos_ng/network_chaos_ng_scenario_plugin.py

View more (2)

4. krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py 🐞 Bug fix +18/-2

Add license header and reorganize imports
• Added Apache 2.0 license header
• Reordered imports to place base64 and json after standard library imports
• Maintained all import statements to prevent NameError during execution
krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py

5. krkn/utils/HealthChecker.py 🐞 Bug fix +24/-5

Fix boolean checks and add missing return statement

• Added Apache 2.0 license header
• Moved class variables current_iterations and ret_value to instance variables in __init__
 method
• Fixed boolean comparison from == True to is True for proper identity checking
• Fixed boolean comparison from != False to is not False for proper identity checking
• Added missing return self.ret_value statement at end of run_health_check() method
• Improved code formatting and spacing for readability

krkn/utils/HealthChecker.py

qodo-code-review · 2026-03-27T17:46:03Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

1. Cerberus key breaks tests 🐞 Bug ✓ Correctness

Description

set_url() now reads check_application_routes, but the existing unit test still uses the legacy
misspelled key check_applicaton_routes, causing the test to fail and potentially disabling route
checks for older user configs silently.

Code

krkn/cerberus/setup.py[R35-39]

        cerberus_url = get_yaml_item_value(config["cerberus"],"cerberus_url", "")
        global check_application_routes
        check_application_routes = \
-            get_yaml_item_value(config["cerberus"],"check_applicaton_routes","")
+            get_yaml_item_value(config["cerberus"],"check_application_routes","")

Evidence

The PR changes the config lookup key in set_url() to check_application_routes, while the repo’s
unit test still provides check_applicaton_routes and asserts that value propagates to
check_application_routes. This mismatch will make check_application_routes remain the default
and fail the test (and older configs using the misspelled key won’t take effect).

krkn/cerberus/setup.py[28-39]
tests/test_cerberus_setup.py[29-46]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`krkn/cerberus/setup.py:set_url()` was updated to read `check_application_routes`, but `tests/test_cerberus_setup.py` (and potentially existing user configs) still use the legacy misspelled key `check_applicaton_routes`. This breaks tests and can silently disable application-route checking for older configs.

### Issue Context
This PR intentionally fixes the typo, but dependent code wasn’t updated and there’s no compatibility fallback.

### Fix Focus Areas
- krkn/cerberus/setup.py[28-39]
- tests/test_cerberus_setup.py[29-46]

### Suggested fix
1. In `set_url()`, attempt to read `check_application_routes` first; if it’s unset/empty and the legacy key exists, fall back to `check_applicaton_routes` (optionally emit a deprecation warning).
2. Update `tests/test_cerberus_setup.py` to use the corrected key (and, if you add fallback support, add a second test ensuring the legacy key still works).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Cerberus exit logic unreachable 🐞 Bug ⛯ Reliability

Description

get_status() calls sys.exit(1) on unhealthy Cerberus/route status, so publish_kraken_status()
cannot apply its exit_on_failure logic and its new unhealthy-status branches are effectively dead
in real execution.

Code

krkn/cerberus/setup.py[R97-106]

    if not cerberus_status:
        if exit_on_failure:
            logging.info(
-                "Cerberus status is not healthy and post action scenarios "
-                "are still failing, exiting kraken run"
-            )
-            sys.exit(1)
-        else:
-            logging.info(
-                "Cerberus status is not healthy and post action scenarios "
-                "are still failing"
-            )
-    else:
-        if exit_on_failure:
-            logging.info(
-                "Cerberus status is healthy but post action scenarios "
-                "are still failing, exiting kraken run"
+                "Cerberus status is not healthy, exiting kraken run"
            )
            sys.exit(1)
        else:
            logging.info(
-                "Cerberus status is healthy but post action scenarios "
-                "are still failing"
+                "Cerberus status is not healthy"
            )

Evidence

publish_kraken_status() now conditionally exits based on exit_on_failure, but get_status()
exits unconditionally when Cerberus or application routes are unhealthy, meaning
publish_kraken_status() never observes a returned False in real runs.

krkn/cerberus/setup.py[40-89]
krkn/cerberus/setup.py[92-106]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`get_status()` unconditionally terminates the process (`sys.exit(1)`) when Cerberus or application-route status is unhealthy. As a result, `publish_kraken_status()`’s conditional behavior (`exit_on_failure`) is ineffective, and recent edits to `publish_kraken_status()` don’t change runtime behavior.

### Issue Context
`cerberus.publish_kraken_status()` is invoked from scenario execution flow; if the intent is to make exiting conditional, the unconditional exit must be removed from `get_status()`.

### Fix Focus Areas
- krkn/cerberus/setup.py[40-89]
- krkn/cerberus/setup.py[92-106]

### Suggested fix
- Replace `sys.exit(1)` in `get_status()` with a return path (e.g., return `False` when either Cerberus is unhealthy or application routes are unavailable).
- Ensure `get_status()` returns a single boolean that represents overall health (e.g., `cerberus_status and application_routes_status`).
- Keep the decision to `sys.exit` only in `publish_kraken_status()` (or at a higher-level caller), so `exit_on_failure` can be honored.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Pod metrics leak fields 🐞 Bug ✓ Correctness

Description

In metrics(), the same metric dict is reused for all pods of a pod_type, so keys from a
previous pod can persist into the next pod’s exported metric when the next pod dict lacks those
keys, producing incorrect telemetry-derived metrics.

Code

krkn/prometheus/client.py[R264-273]

+            for pod_type, pod_list in scenario["affected_pods"].items():
                metric_name = "affected_pods_recovery"
-                metric = {"metricName": metric_name, "type": k}
-                if type(v) is list:
-                    for pod in v:
-                        for k,v in pod.items():
-                            metric[k] = v
+                metric = {"metricName": metric_name, "type": pod_type}
+                if type(pod_list) is list:
+                    for pod in pod_list:
+                        for pod_k, pod_v in pod.items():
+                            metric[pod_k] = pod_v
                            metric['timestamp'] = str(datetime.datetime.now())
                        logging.debug("adding pod %s", metric)
                        metrics_list.append(metric.copy())

Evidence

The code initializes metric once per pod_type and then mutates it for each pod before appending
a copy; .copy() prevents retroactive changes to already-appended entries, but it does not prevent
stale keys from earlier pods being present in later pods’ copied dicts.

krkn/prometheus/client.py[262-274]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`krkn/prometheus/client.py:metrics()` reuses a single `metric` dict across multiple pods in the same `pod_type`. If pods have different key sets, later metrics can contain stale fields from earlier pods.

### Issue Context
This affects the `affected_pods_recovery` metric generation from `telemetry_json`.

### Fix Focus Areas
- krkn/prometheus/client.py[262-274]

### Suggested fix
- Move `metric = {"metricName": metric_name, "type": pod_type}` inside the `for pod in pod_list:` loop.
- Populate pod fields into that per-pod dict (e.g., `metric.update(pod)`), then set `timestamp` once per pod, then append (a copy is no longer necessary if you create a fresh dict per pod).
- (Optional) Apply the same pattern to affected_nodes/health_checks/virt_checks blocks for consistency.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

krkn/cerberus/setup.py

…eus return, HealthChecker bool flags, network_chaos_ng len check Signed-off-by: Nitesh <nitesh@example.com>

…d fix test Signed-off-by: Nitesh <nitesh@example.com>

krkn/prometheus/client.py

krkn/scenario_plugins/network_chaos_ng/network_chaos_ng_scenario_plugin.py

krkn/utils/HealthChecker.py

Per paigerube14 review on PR krkn-chaos#1208, move these changes to separate PRs for easier merge/test: - client.py: revert pod_type/pod_list variable rename back to k,v - network_chaos_ng_scenario_plugin.py: revert instance_count condition order - HealthChecker.py: revert current_iterations/ret_value back to class-level attrs

Per paigerube14 review on PR krkn-chaos#1208, move these changes to separate PRs for easier merge/test: - client.py: revert pod_type/pod_list variable rename back to k,v - network_chaos_ng_scenario_plugin.py: revert instance_count condition order - HealthChecker.py: revert current_iterations/ret_value back to class-level attrs Signed-off-by: NETIZEN-11 <niteshkumar121411@gmail.com>

NETIZEN-11 requested review from chaitanyaenr, paigerube14 and tsebastiani as code owners March 27, 2026 17:45

qodo-code-review bot reviewed Mar 27, 2026

View reviewed changes

krkn/cerberus/setup.py Show resolved Hide resolved

Nitesh added 2 commits March 27, 2026 23:24

fix(scenarios): fix cerberus call signature, missing imports, prometh…

7ee5adb

…eus return, HealthChecker bool flags, network_chaos_ng len check Signed-off-by: Nitesh <nitesh@example.com>

fix(cerberus): add legacy key fallback for check_applicaton_routes an…

5571b43

…d fix test Signed-off-by: Nitesh <nitesh@example.com>

NETIZEN-11 force-pushed the fix/scenario-plugins-and-utils branch from 6c7b4ce to 5571b43 Compare March 27, 2026 17:55

Merge branch 'main' into fix/scenario-plugins-and-utils

33efbef

paigerube14 reviewed Mar 30, 2026

View reviewed changes

krkn/prometheus/client.py Outdated Show resolved Hide resolved

paigerube14 reviewed Mar 30, 2026

View reviewed changes

krkn/scenario_plugins/network_chaos_ng/network_chaos_ng_scenario_plugin.py Outdated Show resolved Hide resolved

paigerube14 reviewed Mar 30, 2026

View reviewed changes

krkn/utils/HealthChecker.py Outdated Show resolved Hide resolved

NETIZEN-11 force-pushed the fix/scenario-plugins-and-utils branch from c343bcb to 046d037 Compare March 30, 2026 14:04

NETIZEN-11 added 2 commits March 30, 2026 19:34

Merge branch 'main' into fix/scenario-plugins-and-utils

5a219d7

Merge branch 'main' into fix/scenario-plugins-and-utils

2118c97

Merge branch 'main' into fix/scenario-plugins-and-utils

40140de

paigerube14 closed this in #1216 Apr 1, 2026

paigerube14 reopened this Apr 1, 2026

Merge branch 'main' into fix/scenario-plugins-and-utils

8bee316

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scenarios): fix cerberus call signature, missing imports, prometh…#1208

fix(scenarios): fix cerberus call signature, missing imports, prometh…#1208
NETIZEN-11 wants to merge 8 commits intokrkn-chaos:mainfrom
NETIZEN-11:fix/scenario-plugins-and-utils

NETIZEN-11 commented Mar 27, 2026

Uh oh!

qodo-code-review bot commented Mar 27, 2026

Uh oh!

qodo-code-review bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NETIZEN-11 commented Mar 27, 2026

What does this PR do?

Type of change

Description

Fixes included:

Related Tickets & Documents

Documentation

Related Documentation PR (if applicable)

Checklist before requesting a review

Uh oh!

qodo-code-review bot commented Mar 27, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qodo-code-review bot commented Mar 27, 2026 •

edited

Loading