Skip to content

Health check plugin#1176

Open
paigerube14 wants to merge 7 commits intokrkn-chaos:mainfrom
paigerube14:health_check_plugin
Open

Health check plugin#1176
paigerube14 wants to merge 7 commits intokrkn-chaos:mainfrom
paigerube14:health_check_plugin

Conversation

@paigerube14
Copy link
Collaborator

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization

Description

Revamping health checks to be a plugin that allows for others in the future

Related Tickets & Documents

If no related issue, please create one and start the converasation on wants of

  • Related Issue #:
  • Closes #:

Documentation

  • Is documentation needed for this update?

If checked, a documentation PR must be created and merged in the website repository.

Related Documentation PR (if applicable)

<-- Add the link to the corresponding documentation PR in the website repository -->

Checklist before requesting a review

[ ] Ensure the changes and proposed solution have been discussed in the relevant issue and have received acknowledgment from the community or maintainers. See contributing guidelines
See testing your changes and run on any Kubernetes or OpenShift cluster to validate your changes

  • I have performed a self-review of my code by running krkn and specific scenario
  • If it is a core feature, I have added thorough unit tests with above 80% coverage

REQUIRED:
Description of combination of tests performed and output of run

python run_kraken.py
...
<---insert test results output--->

OR

python -m coverage run -a -m unittest discover -s tests -v
...
<---insert test results output--->

@qodo-code-review
Copy link

Review Summary by Qodo

Implement health check plugin architecture with HTTP and KubeVirt plugins

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Implement health check plugin architecture replacing legacy classes
• Add HTTP and KubeVirt VM health check plugins with comprehensive features
• Create factory pattern for dynamic plugin discovery and instantiation
• Integrate new plugin system into main run loop with proper error handling
Diagram
flowchart LR
  A["Legacy HealthChecker<br/>VirtChecker Classes"] -->|"Replaced by"| B["AbstractHealthCheckPlugin<br/>Base Class"]
  B -->|"Discovered by"| C["HealthCheckFactory"]
  C -->|"Creates"| D["HttpHealthCheckPlugin"]
  C -->|"creates"| E["VirtHealthCheckPlugin"]
  C -->|"creates"| F["SimpleHealthCheckPlugin"]
  D -->|"integrated in"| G["run_kraken.py"]
  E -->|"integrated in"| G
  G -->|"collects telemetry"| H["Telemetry Queues"]
Loading

Grey Divider

File Changes

1. krkn/health_checks/__init__.py ✨ Enhancement +18/-0

Create health checks module with plugin exports

krkn/health_checks/init.py


2. krkn/health_checks/abstract_health_check_plugin.py ✨ Enhancement +83/-0

Define abstract base class for health check plugins

krkn/health_checks/abstract_health_check_plugin.py


3. krkn/health_checks/health_check_factory.py ✨ Enhancement +171/-0

Implement factory for dynamic plugin discovery and loading

krkn/health_checks/health_check_factory.py


View more (12)
4. krkn/health_checks/simple_health_check_plugin.py ✨ Enhancement +42/-0

Create minimal test health check plugin

krkn/health_checks/simple_health_check_plugin.py


5. krkn/health_checks/http_health_check_plugin.py ✨ Enhancement +229/-0

Implement HTTP endpoint health check plugin

krkn/health_checks/http_health_check_plugin.py


6. krkn/health_checks/virt_health_check_plugin.py ✨ Enhancement +505/-0

Implement KubeVirt VM health check plugin with SSH access

krkn/health_checks/virt_health_check_plugin.py


7. krkn/health_checks/README.md 📝 Documentation +349/-0

Document health check plugin system and architecture

krkn/health_checks/README.md


8. run_kraken.py ✨ Enhancement +77/-27

Integrate health check factory and plugin system

run_kraken.py


9. tests/test_health_check_factory.py 🧪 Tests +221/-0

Add comprehensive factory integration tests

tests/test_health_check_factory.py


10. tests/test_http_health_check_plugin.py 🧪 Tests +345/-0

Add HTTP health check plugin unit tests

tests/test_http_health_check_plugin.py


11. tests/test_virt_health_check_plugin.py 🧪 Tests +297/-0

Add virt health check plugin unit tests

tests/test_virt_health_check_plugin.py


12. krkn/utils/HealthChecker.py Additional files +0/-89

...

krkn/utils/HealthChecker.py


13. krkn/utils/VirtChecker.py Additional files +0/-280

...

krkn/utils/VirtChecker.py


14. tests/test_health_checker.py Additional files +0/-503

...

tests/test_health_checker.py


15. tests/test_virt_checker.py Additional files +0/-598

...

tests/test_virt_checker.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 3, 2026

Code Review by Qodo

🐞 Bugs (5) 📘 Rule violations (1) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Health failures return code 2 📘 Rule violation ⛯ Reliability
Description
Health check failures set/propagate exit code 2, but documented semantics require health check
failures to use exit code 3+. This can misclassify health failures as critical-alert exit code 2
for automation/CI consumers.
Code

run_kraken.py[R616-622]

+        if health_checker and health_checker.get_return_value() != 0:
           logging.error("Health check failed for the applications, Please check; exiting")
-            return health_checker.ret_value
+            return health_checker.get_return_value()

-        if kubevirt_checker.ret_value != 0:
+        if kubevirt_checker and kubevirt_checker.get_return_value() != 0:
           logging.error("Kubevirt check still had failed VMIs at end of run, Please check; exiting")
-            return kubevirt_checker.ret_value
+            return kubevirt_checker.get_return_value()
Evidence
Compliance requires health check failures to return 3+ (not 2). The new health check plugins set
ret_value to 2 on health check failure, and run_kraken.py returns the plugin’s return value
directly, which will propagate 2 for health check failures.

CLAUDE.md
run_kraken.py[616-622]
krkn/health_checks/http_health_check_plugin.py[171-176]
krkn/health_checks/virt_health_check_plugin.py[426-428]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Health check failures currently set/propagate exit code `2`, but compliance requires health check failures to use exit codes `3+`.
## Issue Context
- `HttpHealthCheckPlugin` and `VirtHealthCheckPlugin` set `self.ret_value = 2` on health failures.
- `run_kraken.py` returns the plugin return value directly, so the process may exit with `2` for health check failures.
## Fix Focus Areas
- krkn/health_checks/http_health_check_plugin.py[171-176]
- krkn/health_checks/virt_health_check_plugin.py[426-428]
- run_kraken.py[616-622]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Health threads can deadlock 🐞 Bug ⛯ Reliability
Description
Health check workers terminate only when current_iterations reaches iterations, but the main loop
can stop early (STOP/alerts) or use iterations=inf in daemon mode. run_kraken then join()s the
worker(s), which can block forever and hang the whole run.
Code

run_kraken.py[R449-456]

+        # Collect health check telemetry
+        if health_check_worker:
+            health_check_worker.join()
+            try:
+                chaos_telemetry.health_checks = health_check_telemetry_queue.get_nowait()
+            except queue.Empty:
+                chaos_telemetry.health_checks = None
+        else:
Evidence
In daemon mode, iterations is set to infinity, making the health check loop condition unreachable;
even without daemon mode, STOP can end the main loop before current_iterations reaches iterations.
Since plugins loop until current_iterations>=iterations and run_kraken unconditionally joins the
worker, the process can hang indefinitely.

run_kraken.py[274-285]
run_kraken.py[378-396]
run_kraken.py[449-457]
krkn/health_checks/http_health_check_plugin.py[132-133]
krkn/health_checks/virt_health_check_plugin.py[299-305]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Health check threads (HTTP + virt) can run indefinitely and block `run_kraken.py` during shutdown because their termination condition depends solely on `current_iterations &amp;gt;= iterations`. This breaks on early STOP/abort paths and is guaranteed to hang in daemon mode (`iterations = inf`).
### Issue Context
- Main loop can exit early (STOP/alert paths).
- Daemon mode sets iterations to infinity.
- Health check workers are joined at the end, requiring them to terminate.
### Fix Focus Areas
- run_kraken.py[274-285]
- run_kraken.py[378-396]
- run_kraken.py[449-457]
- krkn/health_checks/http_health_check_plugin.py[132-208]
- krkn/health_checks/virt_health_check_plugin.py[299-375]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Virt checks not initialized🐞 Bug ✓ Correctness
Description
VirtHealthCheckPlugin is created but run_kraken calls batch_list() directly instead of
run_health_check(). Since batch_size/vm_list are only populated by _initialize_from_config(),
batch_list() becomes a no-op and virt health checking is effectively disabled.
Code

run_kraken.py[R366-374]

+        if kubevirt_check_config and kubevirt_check_config.get("namespace"):
+            try:
+                kubevirt_checker = health_check_factory.create_plugin(
+                    health_check_type="virt_health_check",
+                    iterations=iterations,
+                    krkn_lib=kubecli
+                )
+                kubevirt_checker.batch_list(kubevirt_check_telemetry_queue)
+            except HealthCheckPluginNotFound:
Evidence
Virt plugin’s run_health_check() performs initialization and sets batch_size, but run_kraken
bypasses it and calls batch_list() directly. batch_list() starts threads only when batch_size>0;
batch_size defaults to 0 until _initialize_from_config() runs.

run_kraken.py[366-374]
krkn/health_checks/virt_health_check_plugin.py[481-505]
krkn/health_checks/virt_health_check_plugin.py[266-273]
krkn/health_checks/virt_health_check_plugin.py[68-70]
krkn/health_checks/virt_health_check_plugin.py[172-173]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Virt health checks are never actually started because `run_kraken.py` calls `batch_list()` directly, skipping the plugin’s configuration initialization.
### Issue Context
`VirtHealthCheckPlugin.run_health_check()` is the entry point that calls `_initialize_from_config()` (populating `vm_list` and `batch_size`) and then starts worker threads via `batch_list()`.
### Fix Focus Areas
- run_kraken.py[366-374]
- krkn/health_checks/virt_health_check_plugin.py[481-505]
- krkn/health_checks/virt_health_check_plugin.py[266-273]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. HTTP telemetry status wrong 🐞 Bug ✓ Correctness
Description
HttpHealthCheckPlugin hard-codes status=False for every change record and status=True for the final
record regardless of the actual HTTP status_code. This produces misleading health_check telemetry
and downstream metrics.
Code

krkn/health_checks/http_health_check_plugin.py[R189-226]

+                        # Record the status change period
+                        change_record = {
+                            "url": url,
+                            "status": False,
+                            "status_code": previous_status_code,
+                            "start_timestamp": start_timestamp.isoformat(),
+                            "end_timestamp": end_timestamp.isoformat(),
+                            "duration": duration,
+                        }
+
+                        health_check_telemetry.append(HealthCheck(change_record))
+
+                        if response_tracker[url] != True:
+                            response_tracker[url] = True
+
+                        # Reset tracker with new status
+                        del health_check_tracker[url]
+
+            time.sleep(interval)
+
+        # Record final status for all tracked URLs
+        health_check_end_timestamp = datetime.now()
+        for url in health_check_tracker.keys():
+            duration = (
+                health_check_end_timestamp
+                - health_check_tracker[url]["start_timestamp"]
+            ).total_seconds()
+            final_record = {
+                "url": url,
+                "status": True,
+                "status_code": health_check_tracker[url]["status_code"],
+                "start_timestamp": health_check_tracker[url][
+                    "start_timestamp"
+                ].isoformat(),
+                "end_timestamp": health_check_end_timestamp.isoformat(),
+                "duration": duration,
+            }
+            health_check_telemetry.append(HealthCheck(final_record))
Evidence
Telemetry records are forwarded directly into metrics; hard-coding status fields makes failing
endpoints appear healthy in the final record and obscures up/down periods.

krkn/health_checks/http_health_check_plugin.py[189-197]
krkn/health_checks/http_health_check_plugin.py[216-225]
krkn/prometheus/client.py[263-270]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
HTTP health check telemetry is currently inconsistent because `status` is hard-coded rather than computed from HTTP response codes.
### Issue Context
Downstream metrics generation copies telemetry fields directly, so incorrect `status` values will yield misleading observability.
### Fix Focus Areas
- krkn/health_checks/http_health_check_plugin.py[160-226]
- krkn/prometheus/client.py[263-270]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Factory state leaks 🐞 Bug ⛯ Reliability
Description
HealthCheckFactory keeps loaded_plugins and failed_plugins as class-level attributes and never
resets them in __init__. Multiple factory instantiations in the same process will accumulate stale
state (especially failed_plugins), leading to confusing behavior and test pollution.
Code

krkn/health_checks/health_check_factory.py[R22-34]

+    loaded_plugins: dict[str, Any] = {}
+    failed_plugins: list[Tuple[str, str, str]] = []
+    package_name = None
+
+    def __init__(self, package_name: str = "krkn.health_checks"):
+        """
+        Initializes the HealthCheckFactory and loads all available health check plugins.
+
+        :param package_name: the package to scan for health check plugins
+        """
+        self.package_name = package_name
+        self.__load_plugins(AbstractHealthCheckPlugin)
+
Evidence
Because these are class attributes, they are shared across all instances. __init__ calls
__load_plugins() but does not clear the shared dict/list first, so state persists across
instantiations.

krkn/health_checks/health_check_factory.py[22-34]
krkn/health_checks/health_check_factory.py[80-86]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
HealthCheckFactory uses shared class-level mutable state for `loaded_plugins` and `failed_plugins`, which can leak across factory instances.
### Issue Context
This can cause confusing logs and test pollution when multiple factories are created.
### Fix Focus Areas
- krkn/health_checks/health_check_factory.py[13-34]
- krkn/health_checks/health_check_factory.py[61-124]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Shell=True in virt commands 🐞 Bug ⛨ Security
Description
VirtHealthCheckPlugin interpolates config- and cluster-derived values into shell command strings and
executes them via invoke_no_exit(), which uses subprocess.check_output(shell=True). This is a risky
pattern; harden by avoiding shell=True or quoting/sanitizing interpolated values.
Code

krkn/health_checks/virt_health_check_plugin.py[R186-233]

+        virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address}'"
+
+        all_out = invoke_no_exit(virtctl_vm_cmd)
+        logging.debug(
+            f"Checking disconnected access for {ip_address} on {worker_name} output: {all_out}"
+        )
+
+        virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+        output = invoke_no_exit(virtctl_vm_cmd)
+
+        if "True" in output:
+            logging.debug(
+                f"Disconnected access for {ip_address} on {worker_name} is successful: {output}"
+            )
+            return True, None, None
+        else:
+            logging.debug(
+                f"Disconnected access for {ip_address} on {worker_name} failed: {output}"
+            )
+            vmi = self.kube_vm_plugin.get_vmi(vmi_name, self.namespace)
+            interfaces = vmi.get("status", {}).get("interfaces", [])
+            new_ip_address = interfaces[0].get("ipAddress") if interfaces else None
+            new_node_name = vmi.get("status", {}).get("nodeName")
+
+            # Check if VM restarted with new IP
+            if new_ip_address != ip_address:
+                virtctl_vm_cmd = f"ssh core@{worker_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+                new_output = invoke_no_exit(virtctl_vm_cmd)
+                logging.debug(
+                    f"Disconnected access for {ip_address} on {worker_name}: {new_output}"
+                )
+                if "True" in new_output:
+                    return True, new_ip_address, None
+
+            # Check if VM migrated to new node
+            if new_node_name != worker_name:
+                virtctl_vm_cmd = f"ssh core@{new_node_name} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+                new_output = invoke_no_exit(virtctl_vm_cmd)
+                logging.debug(
+                    f"Disconnected access for {ip_address} on {new_node_name}: {new_output}"
+                )
+                if "True" in new_output:
+                    return True, new_ip_address, new_node_name
+
+            # Try common SSH node as fallback
+            if self.ssh_node:
+                virtctl_vm_cmd = f"ssh core@{self.ssh_node} -o ConnectTimeout=5 'ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no root@{new_ip_address} 2>&1 | grep Permission' && echo 'True' || echo 'False'"
+                new_output = invoke_no_exit(virtctl_vm_cmd)
Evidence
The plugin pulls fields like namespace/ssh_node from config and uses them in f-string shell
commands. invoke_no_exit runs those commands with shell=True, so metacharacters could change command
meaning. Even if Kubernetes naming reduces risk, this is still a dangerous anti-pattern
(defense-in-depth).

krkn/health_checks/virt_health_check_plugin.py[107-113]
krkn/health_checks/virt_health_check_plugin.py[186-194]
krkn/health_checks/virt_health_check_plugin.py[250-252]
krkn/invoke/command.py[18-23]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Virt health checks execute interpolated shell commands with `shell=True`, which is a security hardening concern.
### Issue Context
Values such as `namespace` and `ssh_node` come from config and are interpolated into command strings.
### Fix Focus Areas
- krkn/health_checks/virt_health_check_plugin.py[107-113]
- krkn/health_checks/virt_health_check_plugin.py[186-239]
- krkn/health_checks/virt_health_check_plugin.py[250-259]
- krkn/invoke/command.py[18-24]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +616 to +622
if health_checker and health_checker.get_return_value() != 0:
logging.error("Health check failed for the applications, Please check; exiting")
return health_checker.ret_value
return health_checker.get_return_value()

if kubevirt_checker.ret_value != 0:
if kubevirt_checker and kubevirt_checker.get_return_value() != 0:
logging.error("Kubevirt check still had failed VMIs at end of run, Please check; exiting")
return kubevirt_checker.ret_value
return kubevirt_checker.get_return_value()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Health failures return code 2 📘 Rule violation ⛯ Reliability

Health check failures set/propagate exit code 2, but documented semantics require health check
failures to use exit code 3+. This can misclassify health failures as critical-alert exit code 2
for automation/CI consumers.
Agent Prompt
## Issue description
Health check failures currently set/propagate exit code `2`, but compliance requires health check failures to use exit codes `3+`.

## Issue Context
- `HttpHealthCheckPlugin` and `VirtHealthCheckPlugin` set `self.ret_value = 2` on health failures.
- `run_kraken.py` returns the plugin return value directly, so the process may exit with `2` for health check failures.

## Fix Focus Areas
- krkn/health_checks/http_health_check_plugin.py[171-176]
- krkn/health_checks/virt_health_check_plugin.py[426-428]
- run_kraken.py[616-622]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +449 to +456
# Collect health check telemetry
if health_check_worker:
health_check_worker.join()
try:
chaos_telemetry.health_checks = health_check_telemetry_queue.get_nowait()
except queue.Empty:
chaos_telemetry.health_checks = None
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Health threads can deadlock 🐞 Bug ⛯ Reliability

Health check workers terminate only when current_iterations reaches iterations, but the main loop
can stop early (STOP/alerts) or use iterations=inf in daemon mode. run_kraken then join()s the
worker(s), which can block forever and hang the whole run.
Agent Prompt
### Issue description
Health check threads (HTTP + virt) can run indefinitely and block `run_kraken.py` during shutdown because their termination condition depends solely on `current_iterations >= iterations`. This breaks on early STOP/abort paths and is guaranteed to hang in daemon mode (`iterations = inf`).

### Issue Context
- Main loop can exit early (STOP/alert paths).
- Daemon mode sets iterations to infinity.
- Health check workers are joined at the end, requiring them to terminate.

### Fix Focus Areas
- run_kraken.py[274-285]
- run_kraken.py[378-396]
- run_kraken.py[449-457]
- krkn/health_checks/http_health_check_plugin.py[132-208]
- krkn/health_checks/virt_health_check_plugin.py[299-375]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@paigerube14 paigerube14 changed the title [WIP] Health check plugin Health check plugin Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant