Health check plugin by paigerube14 · Pull Request #1176 · krkn-chaos/krkn

paigerube14 · 2026-03-03T20:30:13Z

Type of change

Refactor
New feature
Bug fix
Optimization

Description

Revamping health checks to be a plugin that allows for others in the future

Related Tickets & Documents

If no related issue, please create one and start the converasation on wants of

Related Issue #:
Closes #:

Documentation

Is documentation needed for this update?

If checked, a documentation PR must be created and merged in the website repository.

Checklist before requesting a review

[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See contributing guidelines
See testing your changes and run on any Kubernetes or OpenShift cluster to validate your changes

I have performed a self-review of my code by running krkn and specific scenario
If it is a core feature, I have added thorough unit tests with above 80% coverage

REQUIRED:
Description of combination of tests performed and output of run

python run_kraken.py
...
<---insert test results output--->

OR

python -m coverage run -a -m unittest discover -s tests -v
...
<---insert test results output--->

qodo-code-review · 2026-03-03T20:30:28Z

Review Summary by Qodo

Implement health check plugin architecture with HTTP and KubeVirt plugins

✨ Enhancement

Walkthroughs

Description

• Implement health check plugin architecture replacing legacy classes
• Add HTTP and KubeVirt VM health check plugins with comprehensive features
• Create factory pattern for dynamic plugin discovery and instantiation
• Integrate new plugin system into main run loop with proper error handling

Diagram

flowchart LR
  A["Legacy HealthChecker<br/>VirtChecker Classes"] -->|"Replaced by"| B["AbstractHealthCheckPlugin<br/>Base Class"]
  B -->|"Discovered by"| C["HealthCheckFactory"]
  C -->|"Creates"| D["HttpHealthCheckPlugin"]
  C -->|"creates"| E["VirtHealthCheckPlugin"]
  C -->|"creates"| F["SimpleHealthCheckPlugin"]
  D -->|"integrated in"| G["run_kraken.py"]
  E -->|"integrated in"| G
  G -->|"collects telemetry"| H["Telemetry Queues"]

File Changes

1. krkn/health_checks/__init__.py ✨ Enhancement +18/-0

Create health checks module with plugin exports

krkn/health_checks/init.py

2. krkn/health_checks/abstract_health_check_plugin.py ✨ Enhancement +83/-0

Define abstract base class for health check plugins

krkn/health_checks/abstract_health_check_plugin.py

3. krkn/health_checks/health_check_factory.py ✨ Enhancement +171/-0

Implement factory for dynamic plugin discovery and loading

krkn/health_checks/health_check_factory.py

View more (12)

4. krkn/health_checks/simple_health_check_plugin.py ✨ Enhancement +42/-0

Create minimal test health check plugin

krkn/health_checks/simple_health_check_plugin.py

5. krkn/health_checks/http_health_check_plugin.py ✨ Enhancement +229/-0

Implement HTTP endpoint health check plugin

krkn/health_checks/http_health_check_plugin.py

6. krkn/health_checks/virt_health_check_plugin.py ✨ Enhancement +505/-0

Implement KubeVirt VM health check plugin with SSH access

krkn/health_checks/virt_health_check_plugin.py

7. krkn/health_checks/README.md 📝 Documentation +349/-0

Document health check plugin system and architecture

krkn/health_checks/README.md

8. run_kraken.py ✨ Enhancement +77/-27

Integrate health check factory and plugin system

run_kraken.py

9. tests/test_health_check_factory.py 🧪 Tests +221/-0

Add comprehensive factory integration tests

tests/test_health_check_factory.py

10. tests/test_http_health_check_plugin.py 🧪 Tests +345/-0

Add HTTP health check plugin unit tests

tests/test_http_health_check_plugin.py

11. tests/test_virt_health_check_plugin.py 🧪 Tests +297/-0

Add virt health check plugin unit tests

tests/test_virt_health_check_plugin.py

12. krkn/utils/HealthChecker.py Additional files +0/-89

...

krkn/utils/HealthChecker.py

13. krkn/utils/VirtChecker.py Additional files +0/-280

...

krkn/utils/VirtChecker.py

14. tests/test_health_checker.py Additional files +0/-503

...

tests/test_health_checker.py

15. tests/test_virt_checker.py Additional files +0/-598

...

tests/test_virt_checker.py

qodo-code-review · 2026-03-03T20:30:29Z

Code Review by Qodo

🐞 Bugs (5) 📘 Rule violations (1) 📎 Requirement gaps (0)

1. Health failures return code 2 📘 Rule violation ⛯ Reliability

Description

Health check failures set/propagate exit code 2, but documented semantics require health check
failures to use exit code 3+. This can misclassify health failures as critical-alert exit code 2
for automation/CI consumers.

Code

run_kraken.py[R616-622]

+        if health_checker and health_checker.get_return_value() != 0:
           logging.error("Health check failed for the applications, Please check; exiting")
-            return health_checker.ret_value
+            return health_checker.get_return_value()

-        if kubevirt_checker.ret_value != 0:
+        if kubevirt_checker and kubevirt_checker.get_return_value() != 0:
           logging.error("Kubevirt check still had failed VMIs at end of run, Please check; exiting")
-            return kubevirt_checker.ret_value
+            return kubevirt_checker.get_return_value()

Evidence
Compliance requires health check failures to return 3+ (not 2). The new health check plugins set
ret_value to 2 on health check failure, and run_kraken.py returns the plugin’s return value
directly, which will propagate 2 for health check failures.
CLAUDE.md
run_kraken.py[616-622]
krkn/health_checks/http_health_check_plugin.py[171-176]
krkn/health_checks/virt_health_check_plugin.py[426-428]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Health check failures currently set/propagate exit code `2`, but compliance requires health check failures to use exit codes `3+`.
## Issue Context
- `HttpHealthCheckPlugin` and `VirtHealthCheckPlugin` set `self.ret_value = 2` on health failures.
- `run_kraken.py` returns the plugin return value directly, so the process may exit with `2` for health check failures.
## Fix Focus Areas
- krkn/health_checks/http_health_check_plugin.py[171-176]
- krkn/health_checks/virt_health_check_plugin.py[426-428]
- run_kraken.py[616-622]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Health threads can deadlock 🐞 Bug ⛯ Reliability

Description

Health check workers terminate only when current_iterations reaches iterations, but the main loop
can stop early (STOP/alerts) or use iterations=inf in daemon mode. run_kraken then join()s the
worker(s), which can block forever and hang the whole run.

Code

run_kraken.py[R449-456]

+        # Collect health check telemetry
+        if health_check_worker:
+            health_check_worker.join()
+            try:
+                chaos_telemetry.health_checks = health_check_telemetry_queue.get_nowait()
+            except queue.Empty:
+                chaos_telemetry.health_checks = None
+        else:

Evidence
In daemon mode, iterations is set to infinity, making the health check loop condition unreachable;
even without daemon mode, STOP can end the main loop before current_iterations reaches iterations.
Since plugins loop until current_iterations>=iterations and run_kraken unconditionally joins the
worker, the process can hang indefinitely.
run_kraken.py[274-285]
run_kraken.py[378-396]
run_kraken.py[449-457]
krkn/health_checks/http_health_check_plugin.py[132-133]
krkn/health_checks/virt_health_check_plugin.py[299-305]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Health check threads (HTTP + virt) can run indefinitely and block `run_kraken.py` during shutdown because their termination condition depends solely on `current_iterations &amp;gt;= iterations`. This breaks on early STOP/abort paths and is guaranteed to hang in daemon mode (`iterations = inf`).
### Issue Context
- Main loop can exit early (STOP/alert paths).
- Daemon mode sets iterations to infinity.
- Health check workers are joined at the end, requiring them to terminate.
### Fix Focus Areas
- run_kraken.py[274-285]
- run_kraken.py[378-396]
- run_kraken.py[449-457]
- krkn/health_checks/http_health_check_plugin.py[132-208]
- krkn/health_checks/virt_health_check_plugin.py[299-375]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~Virt checks not initialized~~ ☑ 🐞 Bug ✓ Correctness

Description

VirtHealthCheckPlugin is created but run_kraken calls batch_list() directly instead of
run_health_check(). Since batch_size/vm_list are only populated by _initialize_from_config(),
batch_list() becomes a no-op and virt health checking is effectively disabled.

Code

run_kraken.py[R366-374]

+        if kubevirt_check_config and kubevirt_check_config.get("namespace"):
+            try:
+                kubevirt_checker = health_check_factory.create_plugin(
+                    health_check_type="virt_health_check",
+                    iterations=iterations,
+                    krkn_lib=kubecli
+                )
+                kubevirt_checker.batch_list(kubevirt_check_telemetry_queue)
+            except HealthCheckPluginNotFound:

Evidence
Virt plugin’s run_health_check() performs initialization and sets batch_size, but run_kraken
bypasses it and calls batch_list() directly. batch_list() starts threads only when batch_size>0;
batch_size defaults to 0 until _initialize_from_config() runs.
run_kraken.py[366-374]
krkn/health_checks/virt_health_check_plugin.py[481-505]
krkn/health_checks/virt_health_check_plugin.py[266-273]
krkn/health_checks/virt_health_check_plugin.py[68-70]
krkn/health_checks/virt_health_check_plugin.py[172-173]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Virt health checks are never actually started because `run_kraken.py` calls `batch_list()` directly, skipping the plugin’s configuration initialization.
### Issue Context
`VirtHealthCheckPlugin.run_health_check()` is the entry point that calls `_initialize_from_config()` (populating `vm_list` and `batch_size`) and then starts worker threads via `batch_list()`.
### Fix Focus Areas
- run_kraken.py[366-374]
- krkn/health_checks/virt_health_check_plugin.py[481-505]
- krkn/health_checks/virt_health_check_plugin.py[266-273]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. HTTP telemetry status wrong 🐞 Bug ✓ Correctness

Description

HttpHealthCheckPlugin hard-codes status=False for every change record and status=True for the final
record regardless of the actual HTTP status_code. This produces misleading health_check telemetry
and downstream metrics.

Code

krkn/health_checks/http_health_check_plugin.py[R189-226]

+                        # Record the status change period
+                        change_record = {
+                            "url": url,
+                            "status": False,
+                            "status_code": previous_status_code,
+                            "start_timestamp": start_timestamp.isoformat(),
+                            "end_timestamp": end_timestamp.isoformat(),
+                            "duration": duration,
+                        }
+
+                        health_check_telemetry.append(HealthCheck(change_record))
+
+                        if response_tracker[url] != True:
+                            response_tracker[url] = True
+
+                        # Reset tracker with new status
+                        del health_check_tracker[url]
+
+            time.sleep(interval)
+
+        # Record final status for all tracked URLs
+        health_check_end_timestamp = datetime.now()
+        for url in health_check_tracker.keys():
+            duration = (
+                health_check_end_timestamp
+                - health_check_tracker[url]["start_timestamp"]
+            ).total_seconds()
+            final_record = {
+                "url": url,
+                "status": True,
+                "status_code": health_check_tracker[url]["status_code"],
+                "start_timestamp": health_check_tracker[url][
+                    "start_timestamp"
+                ].isoformat(),
+                "end_timestamp": health_check_end_timestamp.isoformat(),
+                "duration": duration,
+            }
+            health_check_telemetry.append(HealthCheck(final_record))

Evidence
Telemetry records are forwarded directly into metrics; hard-coding status fields makes failing
endpoints appear healthy in the final record and obscures up/down periods.
krkn/health_checks/http_health_check_plugin.py[189-197]
krkn/health_checks/http_health_check_plugin.py[216-225]
krkn/prometheus/client.py[263-270]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
HTTP health check telemetry is currently inconsistent because `status` is hard-coded rather than computed from HTTP response codes.
### Issue Context
Downstream metrics generation copies telemetry fields directly, so incorrect `status` values will yield misleading observability.
### Fix Focus Areas
- krkn/health_checks/http_health_check_plugin.py[160-226]
- krkn/prometheus/client.py[263-270]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. Factory state leaks 🐞 Bug ⛯ Reliability

Description

HealthCheckFactory keeps loaded_plugins and failed_plugins as class-level attributes and never
resets them in __init__. Multiple factory instantiations in the same process will accumulate stale
state (especially failed_plugins), leading to confusing behavior and test pollution.

Code

krkn/health_checks/health_check_factory.py[R22-34]

+    loaded_plugins: dict[str, Any] = {}
+    failed_plugins: list[Tuple[str, str, str]] = []
+    package_name = None
+
+    def __init__(self, package_name: str = "krkn.health_checks"):
+        """
+        Initializes the HealthCheckFactory and loads all available health check plugins.
+
+        :param package_name: the package to scan for health check plugins
+        """
+        self.package_name = package_name
+        self.__load_plugins(AbstractHealthCheckPlugin)
+

Evidence
Because these are class attributes, they are shared across all instances. __init__ calls
__load_plugins() but does not clear the shared dict/list first, so state persists across
instantiations.
krkn/health_checks/health_check_factory.py[22-34]
krkn/health_checks/health_check_factory.py[80-86]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
HealthCheckFactory uses shared class-level mutable state for `loaded_plugins` and `failed_plugins`, which can leak across factory instances.
### Issue Context
This can cause confusing logs and test pollution when multiple factories are created.
### Fix Focus Areas
- krkn/health_checks/health_check_factory.py[13-34]
- krkn/health_checks/health_check_factory.py[61-124]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

6. Shell=True in virt commands 🐞 Bug ⛨ Security

Description

VirtHealthCheckPlugin interpolates config- and cluster-derived values into shell command strings and
executes them via invoke_no_exit(), which uses subprocess.check_output(shell=True). This is a risky
pattern; harden by avoiding shell=True or quoting/sanitizing interpolated values.

Code

krkn/health_checks/virt_health_check_plugin.py[R186-233]

+        virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address}'"
+
+        all_out = invoke_no_exit(virtctl_vm_cmd)
+        logging.debug(
+            f"Checking disconnected access for {ip_address} on {worker_name} output: {all_out}"
+        )
+
+        virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+        output = invoke_no_exit(virtctl_vm_cmd)
+
+        if "True" in output:
+            logging.debug(
+                f"Disconnected access for {ip_address} on {worker_name} is successful: {output}"
+            )
+            return True, None, None
+        else:
+            logging.debug(
+                f"Disconnected access for {ip_address} on {worker_name} failed: {output}"
+            )
+            vmi = self.kube_vm_plugin.get_vmi(vmi_name, self.namespace)
+            interfaces = vmi.get("status", {}).get("interfaces", [])
+            new_ip_address = interfaces[0].get("ipAddress") if interfaces else None
+            new_node_name = vmi.get("status", {}).get("nodeName")
+
+            # Check if VM restarted with new IP
+            if new_ip_address != ip_address:
+                virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+                new_output = invoke_no_exit(virtctl_vm_cmd)
+                logging.debug(
+                    f"Disconnected access for {ip_address} on {worker_name}: {new_output}"
+                )
+                if "True" in new_output:
+                    return True, new_ip_address, None
+
+            # Check if VM migrated to new node
+            if new_node_name != worker_name:
+                virtctl_vm_cmd = f"ssh core@{new_node_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+                new_output = invoke_no_exit(virtctl_vm_cmd)
+                logging.debug(
+                    f"Disconnected access for {ip_address} on {new_node_name}: {new_output}"
+                )
+                if "True" in new_output:
+                    return True, new_ip_address, new_node_name
+
+            # Try common SSH node as fallback
+            if self.ssh_node:
+                virtctl_vm_cmd = f"ssh core@{self.ssh_node} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+                new_output = invoke_no_exit(virtctl_vm_cmd)

Evidence
The plugin pulls fields like namespace/ssh_node from config and uses them in f-string shell
commands. invoke_no_exit runs those commands with shell=True, so metacharacters could change command
meaning. Even if Kubernetes naming reduces risk, this is still a dangerous anti-pattern
(defense-in-depth).
krkn/health_checks/virt_health_check_plugin.py[107-113]
krkn/health_checks/virt_health_check_plugin.py[186-194]
krkn/health_checks/virt_health_check_plugin.py[250-252]
krkn/invoke/command.py[18-23]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Virt health checks execute interpolated shell commands with `shell=True`, which is a security hardening concern.
### Issue Context
Values such as `namespace` and `ssh_node` come from config and are interpolated into command strings.
### Fix Focus Areas
- krkn/health_checks/virt_health_check_plugin.py[107-113]
- krkn/health_checks/virt_health_check_plugin.py[186-239]
- krkn/health_checks/virt_health_check_plugin.py[250-259]
- krkn/invoke/command.py[18-24]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-03-03T20:35:06Z

run_kraken.py

+        if health_checker and health_checker.get_return_value() != 0:
            logging.error("Health check failed for the applications, Please check; exiting")
-            return health_checker.ret_value
+            return health_checker.get_return_value()

-        if kubevirt_checker.ret_value != 0:
+        if kubevirt_checker and kubevirt_checker.get_return_value() != 0:
            logging.error("Kubevirt check still had failed VMIs at end of run, Please check; exiting")
-            return kubevirt_checker.ret_value
+            return kubevirt_checker.get_return_value()


1. Health failures return code 2 📘 Rule violation ⛯ Reliability

Health check failures set/propagate exit code 2, but documented semantics require health check failures to use exit code 3+. This can misclassify health failures as critical-alert exit code 2 for automation/CI consumers.

Agent Prompt

## Issue description Health check failures currently set/propagate exit code `2`, but compliance requires health check failures to use exit codes `3+`. ## Issue Context - `HttpHealthCheckPlugin` and `VirtHealthCheckPlugin` set `self.ret_value = 2` on health failures. - `run_kraken.py` returns the plugin return value directly, so the process may exit with `2` for health check failures. ## Fix Focus Areas - krkn/health_checks/http_health_check_plugin.py[171-176] - krkn/health_checks/virt_health_check_plugin.py[426-428] - run_kraken.py[616-622]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-03-03T20:35:06Z

run_kraken.py

+        # Collect health check telemetry
+        if health_check_worker:
+            health_check_worker.join()
+            try:
+                chaos_telemetry.health_checks = health_check_telemetry_queue.get_nowait()
+            except queue.Empty:
+                chaos_telemetry.health_checks = None
+        else:


2. Health threads can deadlock 🐞 Bug ⛯ Reliability

Health check workers terminate only when current_iterations reaches iterations, but the main loop can stop early (STOP/alerts) or use iterations=inf in daemon mode. run_kraken then join()s the worker(s), which can block forever and hang the whole run.

Agent Prompt

### Issue description Health check threads (HTTP + virt) can run indefinitely and block `run_kraken.py` during shutdown because their termination condition depends solely on `current_iterations >= iterations`. This breaks on early STOP/abort paths and is guaranteed to hang in daemon mode (`iterations = inf`). ### Issue Context - Main loop can exit early (STOP/alert paths). - Daemon mode sets iterations to infinity. - Health check workers are joined at the end, requiring them to terminate. ### Fix Focus Areas - run_kraken.py[274-285] - run_kraken.py[378-396] - run_kraken.py[449-457] - krkn/health_checks/http_health_check_plugin.py[132-208] - krkn/health_checks/virt_health_check_plugin.py[299-375]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

run_kraken.py

paigerube14 added 3 commits March 3, 2026 15:20

health check plugin

4af06af

remvoing old files

87936c5

test fix

6d69dfc

paigerube14 requested review from chaitanyaenr and tsebastiani as code owners March 3, 2026 20:30

qodo-code-review bot reviewed Mar 3, 2026

View reviewed changes

paigerube14 added 4 commits March 3, 2026 15:42

test changes

2561039

adding tests fix

082d915

mock tests changes

1b10a8a

fixing kube virt

5828445

paigerube14 changed the title ~~[WIP] Health check plugin~~ Health check plugin Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check plugin#1176

Health check plugin#1176
paigerube14 wants to merge 7 commits intokrkn-chaos:mainfrom
paigerube14:health_check_plugin

paigerube14 commented Mar 3, 2026

Uh oh!

qodo-code-review bot commented Mar 3, 2026

Uh oh!

qodo-code-review bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot Mar 3, 2026

Uh oh!

qodo-code-review bot Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paigerube14 commented Mar 3, 2026

Type of change

Description

Related Tickets & Documents

Documentation

Related Documentation PR (if applicable)

Checklist before requesting a review

Uh oh!

qodo-code-review bot commented Mar 3, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review bot commented Mar 3, 2026 •

edited

Loading