Skip to content

fix(scenarios): fix cerberus call signature, missing imports, prometh…#1208

Open
NETIZEN-11 wants to merge 8 commits intokrkn-chaos:mainfrom
NETIZEN-11:fix/scenario-plugins-and-utils
Open

fix(scenarios): fix cerberus call signature, missing imports, prometh…#1208
NETIZEN-11 wants to merge 8 commits intokrkn-chaos:mainfrom
NETIZEN-11:fix/scenario-plugins-and-utils

Conversation

@NETIZEN-11
Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes multiple runtime bugs across cerberus, prometheus, utils, and scenario plugins.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization

Description

This PR resolves several runtime issues across different components of the project,
including cerberus, prometheus, utils, and scenario plugins.

Fixes included:

  • cerberus/setup.py

    • Fixed incorrect function call to application_status()
    • Removed extra argument cerberus_url to match expected signature
  • prometheus/client.py

    • Fixed missing return statement in critical_alerts()
    • Added return firing_alerts to ensure correct output
  • utils/HealthChecker.py

    • Replaced unreliable boolean check == True with proper identity check is True
    • Added missing return self.ret_value to prevent unintended None return
  • scenario_plugins/network_chaos_ng

    • Fixed incorrect length check using len(config)
    • Updated to use len(scenario_config) for correct scenario handling
  • scenario_plugins/zone_outage

    • Added missing imports base64 and json
    • Prevents NameError during execution in node_based_zone()

These fixes prevent runtime errors such as TypeError, NameError, and unintended None returns,
ensuring more stable execution across modules.

Related Tickets & Documents

  • Related Issue #: N/A
  • Closes #: N/A

Documentation

  • Is documentation needed for this update?

Related Documentation PR (if applicable)

N/A

Checklist before requesting a review

  • Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers.

  • I have performed a self-review of my code by running krkn and specific scenarios

  • If it is a core feature, I have added thorough unit tests with above 80% coverage

REQUIRED:

Description of combination of tests performed and output of run

Tested individual modules after fixes:

  • Verified cerberus status check runs without TypeError
  • Verified prometheus alerts return correct values
  • Verified HealthChecker returns expected values
  • Verified scenario plugins execute without runtime errors
  • Verified zone outage scenario runs without import-related failures
python run_kraken.py
All modules executed successfully without runtime errors
No TypeError / NameError observed
Expected outputs returned from all components

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Fix runtime bugs across cerberus, prometheus, utils, and scenario plugins

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Fixed cerberus function call signature by removing extra cerberus_url argument
• Added missing return statement in prometheus critical_alerts() function
• Fixed HealthChecker boolean checks and added missing return value
• Corrected variable reference in network_chaos_ng from config to scenario_config
• Added missing base64 and json imports in zone_outage plugin
• Improved variable naming clarity in prometheus metrics iteration loops
Diagram
flowchart LR
  A["Runtime Bugs Identified"] --> B["Cerberus Setup"]
  A --> C["Prometheus Client"]
  A --> D["HealthChecker Utils"]
  A --> E["Network Chaos NG"]
  A --> F["Zone Outage Plugin"]
  B --> B1["Fixed function signature"]
  C --> C1["Added return statement"]
  D --> D1["Fixed bool checks & return"]
  E --> E1["Fixed variable reference"]
  F --> F1["Added missing imports"]
  B1 --> G["All modules execute without errors"]
  C1 --> G
  D1 --> G
  E1 --> G
  F1 --> G
Loading

Grey Divider

File Changes

1. krkn/cerberus/setup.py 🐞 Bug fix +19/-19

Fix cerberus function signature and config typo

• Added Apache 2.0 license header
• Fixed typo in config key from check_applicaton_routes to check_application_routes
• Removed extra cerberus_url argument from application_status() function call
• Removed redundant check_application_routes = False variable initialization
• Simplified logging messages in publish_kraken_status() function

krkn/cerberus/setup.py


2. krkn/prometheus/client.py 🐞 Bug fix +32/-14

Add missing return statement and improve variable naming

• Added Apache 2.0 license header
• Added missing return firing_alerts statement in critical_alerts() function
• Improved variable naming in metrics iteration loops for clarity (e.g., k,v to `pod_type,
 pod_list`)
• Renamed loop variables to be more descriptive (pod_k, node_k, hc_k, vc_k)

krkn/prometheus/client.py


3. krkn/scenario_plugins/network_chaos_ng/network_chaos_ng_scenario_plugin.py 🐞 Bug fix +18/-6

Fix variable reference in scenario config check

• Added Apache 2.0 license header
• Fixed incorrect variable reference from len(config) to len(scenario_config)
• Simplified and corrected instance count comparison logic
• Removed unnecessary blank line for code consistency

krkn/scenario_plugins/network_chaos_ng/network_chaos_ng_scenario_plugin.py


View more (2)
4. krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py 🐞 Bug fix +18/-2

Add license header and reorganize imports

• Added Apache 2.0 license header
• Reordered imports to place base64 and json after standard library imports
• Maintained all import statements to prevent NameError during execution

krkn/scenario_plugins/zone_outage/zone_outage_scenario_plugin.py


5. krkn/utils/HealthChecker.py 🐞 Bug fix +24/-5

Fix boolean checks and add missing return statement

• Added Apache 2.0 license header
• Moved class variables current_iterations and ret_value to instance variables in __init__
 method
• Fixed boolean comparison from == True to is True for proper identity checking
• Fixed boolean comparison from != False to is not False for proper identity checking
• Added missing return self.ret_value statement at end of run_health_check() method
• Improved code formatting and spacing for readability

krkn/utils/HealthChecker.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Mar 27, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Action required

1. Cerberus key breaks tests 🐞 Bug ✓ Correctness
Description
set_url() now reads check_application_routes, but the existing unit test still uses the legacy
misspelled key check_applicaton_routes, causing the test to fail and potentially disabling route
checks for older user configs silently.
Code

krkn/cerberus/setup.py[R35-39]

        cerberus_url = get_yaml_item_value(config["cerberus"],"cerberus_url", "")
        global check_application_routes
        check_application_routes = \
-            get_yaml_item_value(config["cerberus"],"check_applicaton_routes","")
+            get_yaml_item_value(config["cerberus"],"check_application_routes","")
Evidence
The PR changes the config lookup key in set_url() to check_application_routes, while the repo’s
unit test still provides check_applicaton_routes and asserts that value propagates to
check_application_routes. This mismatch will make check_application_routes remain the default
and fail the test (and older configs using the misspelled key won’t take effect).

krkn/cerberus/setup.py[28-39]
tests/test_cerberus_setup.py[29-46]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`krkn/cerberus/setup.py:set_url()` was updated to read `check_application_routes`, but `tests/test_cerberus_setup.py` (and potentially existing user configs) still use the legacy misspelled key `check_applicaton_routes`. This breaks tests and can silently disable application-route checking for older configs.

### Issue Context
This PR intentionally fixes the typo, but dependent code wasn’t updated and there’s no compatibility fallback.

### Fix Focus Areas
- krkn/cerberus/setup.py[28-39]
- tests/test_cerberus_setup.py[29-46]

### Suggested fix
1. In `set_url()`, attempt to read `check_application_routes` first; if it’s unset/empty and the legacy key exists, fall back to `check_applicaton_routes` (optionally emit a deprecation warning).
2. Update `tests/test_cerberus_setup.py` to use the corrected key (and, if you add fallback support, add a second test ensuring the legacy key still works).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Cerberus exit logic unreachable 🐞 Bug ⛯ Reliability
Description
get_status() calls sys.exit(1) on unhealthy Cerberus/route status, so publish_kraken_status()
cannot apply its exit_on_failure logic and its new unhealthy-status branches are effectively dead
in real execution.
Code

krkn/cerberus/setup.py[R97-106]

    if not cerberus_status:
        if exit_on_failure:
            logging.info(
-                "Cerberus status is not healthy and post action scenarios "
-                "are still failing, exiting kraken run"
-            )
-            sys.exit(1)
-        else:
-            logging.info(
-                "Cerberus status is not healthy and post action scenarios "
-                "are still failing"
-            )
-    else:
-        if exit_on_failure:
-            logging.info(
-                "Cerberus status is healthy but post action scenarios "
-                "are still failing, exiting kraken run"
+                "Cerberus status is not healthy, exiting kraken run"
            )
            sys.exit(1)
        else:
            logging.info(
-                "Cerberus status is healthy but post action scenarios "
-                "are still failing"
+                "Cerberus status is not healthy"
            )
Evidence
publish_kraken_status() now conditionally exits based on exit_on_failure, but get_status()
exits unconditionally when Cerberus or application routes are unhealthy, meaning
publish_kraken_status() never observes a returned False in real runs.

krkn/cerberus/setup.py[40-89]
krkn/cerberus/setup.py[92-106]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`get_status()` unconditionally terminates the process (`sys.exit(1)`) when Cerberus or application-route status is unhealthy. As a result, `publish_kraken_status()`’s conditional behavior (`exit_on_failure`) is ineffective, and recent edits to `publish_kraken_status()` don’t change runtime behavior.

### Issue Context
`cerberus.publish_kraken_status()` is invoked from scenario execution flow; if the intent is to make exiting conditional, the unconditional exit must be removed from `get_status()`.

### Fix Focus Areas
- krkn/cerberus/setup.py[40-89]
- krkn/cerberus/setup.py[92-106]

### Suggested fix
- Replace `sys.exit(1)` in `get_status()` with a return path (e.g., return `False` when either Cerberus is unhealthy or application routes are unavailable).
- Ensure `get_status()` returns a single boolean that represents overall health (e.g., `cerberus_status and application_routes_status`).
- Keep the decision to `sys.exit` only in `publish_kraken_status()` (or at a higher-level caller), so `exit_on_failure` can be honored.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Pod metrics leak fields 🐞 Bug ✓ Correctness
Description
In metrics(), the same metric dict is reused for all pods of a pod_type, so keys from a
previous pod can persist into the next pod’s exported metric when the next pod dict lacks those
keys, producing incorrect telemetry-derived metrics.
Code

krkn/prometheus/client.py[R264-273]

+            for pod_type, pod_list in scenario["affected_pods"].items():
                metric_name = "affected_pods_recovery"
-                metric = {"metricName": metric_name, "type": k}
-                if type(v) is list:
-                    for pod in v:
-                        for k,v in pod.items():
-                            metric[k] = v
+                metric = {"metricName": metric_name, "type": pod_type}
+                if type(pod_list) is list:
+                    for pod in pod_list:
+                        for pod_k, pod_v in pod.items():
+                            metric[pod_k] = pod_v
                            metric['timestamp'] = str(datetime.datetime.now())
                        logging.debug("adding pod %s", metric)
                        metrics_list.append(metric.copy())
Evidence
The code initializes metric once per pod_type and then mutates it for each pod before appending
a copy; .copy() prevents retroactive changes to already-appended entries, but it does not prevent
stale keys from earlier pods being present in later pods’ copied dicts.

krkn/prometheus/client.py[262-274]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`krkn/prometheus/client.py:metrics()` reuses a single `metric` dict across multiple pods in the same `pod_type`. If pods have different key sets, later metrics can contain stale fields from earlier pods.

### Issue Context
This affects the `affected_pods_recovery` metric generation from `telemetry_json`.

### Fix Focus Areas
- krkn/prometheus/client.py[262-274]

### Suggested fix
- Move `metric = {"metricName": metric_name, "type": pod_type}` inside the `for pod in pod_list:` loop.
- Populate pod fields into that per-pod dict (e.g., `metric.update(pod)`), then set `timestamp` once per pod, then append (a copy is no longer necessary if you create a fresh dict per pod).
- (Optional) Apply the same pattern to affected_nodes/health_checks/virt_checks blocks for consistency.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Nitesh added 2 commits March 27, 2026 23:24
…eus return, HealthChecker bool flags, network_chaos_ng len check

Signed-off-by: Nitesh <nitesh@example.com>
…d fix test

Signed-off-by: Nitesh <nitesh@example.com>
@NETIZEN-11 NETIZEN-11 force-pushed the fix/scenario-plugins-and-utils branch from 6c7b4ce to 5571b43 Compare March 27, 2026 17:55
NETIZEN-11 added a commit to NETIZEN-11/krkn that referenced this pull request Mar 30, 2026
Per paigerube14 review on PR krkn-chaos#1208, move these changes to separate PRs
for easier merge/test:

- client.py: revert pod_type/pod_list variable rename back to k,v
- network_chaos_ng_scenario_plugin.py: revert instance_count condition order
- HealthChecker.py: revert current_iterations/ret_value back to class-level attrs
Per paigerube14 review on PR krkn-chaos#1208, move these changes to separate PRs
for easier merge/test:

- client.py: revert pod_type/pod_list variable rename back to k,v
- network_chaos_ng_scenario_plugin.py: revert instance_count condition order
- HealthChecker.py: revert current_iterations/ret_value back to class-level attrs

Signed-off-by: NETIZEN-11 <niteshkumar121411@gmail.com>
@NETIZEN-11 NETIZEN-11 force-pushed the fix/scenario-plugins-and-utils branch from c343bcb to 046d037 Compare March 30, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants