-
Notifications
You must be signed in to change notification settings - Fork 168
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
Describe the bug
During the pod scenarios, I want to stop the run during the pod wait and properly exit with a warning but complete the run at that time, not wait for the expected recovery time to complete
To Reproduce
python run_kraken.py
+ c to stop run during wait for pod recovery
Error seen
Scenario File
Scenario file(s) that were specified in your config file (can be starred (*) with confidential information )
kraken:
kubeconfig_path: ~/.kube/config # Path to kubeconfig
exit_on_failure: False # Exit when a post action scenario fails
auto_rollback: True # Enable auto rollback for scenarios.
rollback_versions_directory: /tmp/kraken-rollback # Directory to store rollback version files.
publish_kraken_status: True # Can be accessed at http://0.0.0.0:8081
signal_state: RUN # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
signal_address: 0.0.0.0 # Signal listening address
port: 8081 # Signal port
chaos_scenarios:
# List of policies/chaos scenarios to load
- pod_disruption_scenarios:
- scenarios/openshift/etcd.yml
cerberus:
cerberus_enabled: False # Enable it when cerberus is previously installed
cerberus_url: # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
check_application_routes: False # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run
performance_monitoring:
prometheus_url: '' # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
prometheus_bearer_token: # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
uuid: # uuid for the run is generated by default if not set
enable_alerts: False # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
enable_metrics: False
alert_profile: config/alerts.yaml # Path or URL to alert profile with the prometheus queries
metrics_profile: config/metrics-report.yaml
check_critical_alerts: False # When enabled will check prometheus for critical alerts firing post chaos
elastic:
enable_elastic: False
verify_certs: False
elastic_url: "" # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
elastic_port: 32766
username: "elastic"
password: "test"
metrics_index: "krkn-metrics"
alerts_index: "krkn-alerts"
telemetry_index: "krkn-telemetry"
tunings:
wait_duration: 1 # Duration to wait between each chaos scenario
iterations: 1 # Number of times to execute the scenarios
daemon_mode: False # Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
enabled: False # enable/disables the telemetry collection feature
api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
username: username # telemetry service username
password: password # telemetry service password
prometheus_backup: True # enables/disables prometheus data collection
prometheus_namespace: "" # namespace where prometheus is deployed (if distribution is kubernetes)
prometheus_container_name: "" # name of the prometheus container name (if distribution is kubernetes)
prometheus_pod_name: "" # name of the prometheus pod (if distribution is kubernetes)
full_prometheus_backup: False # if is set to False only the /prometheus/wal folder will be downloaded.
backup_threads: 5 # number of telemetry download/upload threads
archive_path: /tmp # local path where the archive files will be temporarly stored
max_retries: 0 # maximum number of upload retries (if 0 will retry forever)
run_tag: '' # if set, this will be appended to the run folder in the bucket (useful to group the runs)
archive_size: 500000
telemetry_group: '' # if set will archive the telemetry in the S3 bucket on a folder named after the value, otherwise will use "default"
# the size of the prometheus data archive size in KB. The lower the size of archive is
# the higher the number of archive files will be produced and uploaded (and processed by backup_threads
# simultaneously).
# For unstable/slow connection is better to keep this value low
# increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
# failed chunk without affecting the whole upload.
logs_backup: True
logs_filter_patterns:
- "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+" # Sep 9 11:20:36.123425532
- "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+" # kinit 2023/09/15 11:20:36 log
- "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+" # 2023-09-15T11:20:36.123425532Z log
oc_cli_path: /usr/bin/oc # optional, if not specified will be search in $PATH
events_backup: True # enables/disables cluster events collection
health_checks: # Utilizing health check endpoints to observe application behavior during chaos injection.
interval: # Interval in seconds to perform health checks, default value is 2 seconds
config: # Provide list of health check configurations for applications
- url: # Provide application endpoint
bearer_token: # Bearer token for authentication if any
auth: # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
exit_on_failure: # If value is True exits when health check failed for application, values can be True/False
kubevirt_checks: # Utilizing virt check endpoints to observe ssh ability to VMI's during chaos injection.
interval: 2 # Interval in seconds to perform virt checks, default value is 2 seconds
namespace: # Namespace where to find VMI's
name: # Regex Name style of VMI's to watch, optional, will watch all VMI names in the namespace if left blank
only_failures: False # Boolean of whether to show all VMI's failures and successful ssh connection (False), or only failure status' (True)
disconnected: False # Boolean of how to try to connect to the VMIs; if True will use the ip_address to try ssh from within a node, if false will use the name and uses virtctl to try to connect; Default is False
ssh_node: "" # If set, will be a backup way to ssh to a node. Will want to set to a node that isn't targeted in chaos
node_names: ""
exit_on_failure: # If value is True and VMI's are failing post chaos returns failure, values can be True/False
Config File
Config file you used when error was seen (the default used is config/config.yaml)
- id: kill-pods
config:
namespace_pattern: ^openshift-etcd$
label_selector: k8s-app=etcd
krkn_pod_recovery_time: 120
exclude_label: "" # excludes pods marked with this label from chaos
Expected behavior
Keyboard interrupt the run print ^C [WARNING] Signal SIGINT received without complete context, skipping rollback once then exit the scenario and output telemetry and give failure end status
Krkn Output
% python run_kraken.py
_ _
| | ___ __ __ _| | _____ _ __
| |/ / '__/ _` | |/ / _ \ '_ \
| <| | | (_| | < __/ | | |
|_|\_\_| \__,_|_|\_\___|_| |_|
2026-01-29 14:42:46,357 [INFO] Starting kraken
2026-01-29 14:42:46,369 [INFO] Initializing client to talk to the Kubernetes cluster
2026-01-29 14:42:46,369 [INFO] Generated a uuid for the run: 29fae477-1d2c-4f89-8208-c5540edf32de
2026-01-29 14:42:46,574 [INFO] Detected distribution openshift
2026-01-29 14:42:48,865 [INFO] Publishing kraken status at http://0.0.0.0:8081
2026-01-29 14:42:48,884 [INFO] Starting http server at http://0.0.0.0:8081
2026-01-29 14:42:48,885 [INFO] Fetching cluster info
2026-01-29 14:42:49,225 [INFO] 4.20.4
2026-01-29 14:42:49,225 [INFO] Server URL: https://api.***openshift.com:6443
2026-01-29 14:42:49,226 [INFO] Daemon mode not enabled, will run through 1 iterations
2026-01-29 14:42:50,897 [INFO] 📣 `ScenarioPluginFactory`: types from config.yaml mapped to respective classes for execution:
2026-01-29 14:42:50,897 [INFO] ✅ type: application_outages_scenarios ➡️ `ApplicationOutageScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: container_scenarios ➡️ `ContainerScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: hog_scenarios ➡️ `HogsScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: kubevirt_vm_outage ➡️ `KubevirtVmOutageScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: managedcluster_scenarios ➡️ `ManagedClusterScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ types: [pod_network_scenarios, ingress_node_scenarios] ➡️ `NativeScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: network_chaos_scenarios ➡️ `NetworkChaosScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: network_chaos_ng_scenarios ➡️ `NetworkChaosNgScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: node_scenarios ➡️ `NodeActionsScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: pod_disruption_scenarios ➡️ `PodDisruptionScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: pvc_scenarios ➡️ `PvcScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: service_disruption_scenarios ➡️ `ServiceDisruptionScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: service_hijacking_scenarios ➡️ `ServiceHijackingScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: cluster_shut_down_scenarios ➡️ `ShutDownScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: syn_flood_scenarios ➡️ `SynFloodScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: time_scenarios ➡️ `TimeActionsScenarioPlugin`
2026-01-29 14:42:50,897 [INFO] ✅ type: zone_outages_scenarios ➡️ `ZoneOutageScenarioPlugin`
2026-01-29 14:42:50,897 [INFO]
2026-01-29 14:42:50,898 [INFO] health checks config is not defined, skipping them
2026-01-29 14:42:50,898 [INFO] kube virt checks config is not defined, skipping them
2026-01-29 14:42:50,898 [INFO] Executing scenarios for iteration 0
2026-01-29 14:42:50,898 [INFO] connection set up
127.0.0.1 - - [29/Jan/2026 14:42:50] "GET / HTTP/1.1" 200 -
2026-01-29 14:42:50,899 [INFO] response RUN
2026-01-29 14:42:50,901 [INFO] Signal handlers registered globally
2026-01-29 14:42:50,901 [INFO] Running PodDisruptionScenarioPlugin: ['pod_disruption_scenarios'] -> scenarios/openshift/etcd.yml
2026-01-29 14:42:51,136 [INFO] waiting up to 120 seconds for pod recovery, pod label pattern: k8s-app=etcd namespace pattern: ^openshift-etcd$
2026-01-29 14:42:51,493 [INFO] ('etcd-prubenda1111-ck5s7-master-1.c.chaos-438115.internal', 'openshift-etcd')
2026-01-29 14:42:51,493 [INFO] Deleting pod etcd-prubenda1111-ck5s7-master-1.c.chaos-438115.internal
^C2026-01-29 14:43:33,142 [INFO] Performing rollback for signal SIGINT with run_uuid=29fae477-1d2c-4f89-8208-c5540edf32de, scenario_type=pod_disruption_scenarios
2026-01-29 14:43:33,143 [WARNING] Skip execution for run_uuid=29fae477-1d2c-4f89-8208-c5540edf32de, scenario_type=pod_disruption_scenarios
2026-01-29 14:43:33,143 [INFO] Calling original handler for SIGINT
Traceback (most recent call last):
File "/Users/prubenda/Github/kraken/run_kraken.py", line 717, in <module>
retval = main(options, command)
File "/Users/prubenda/Github/kraken/run_kraken.py", line 367, in main
scenario_plugin.run_scenarios(
File "/Users/prubenda/Github/kraken/krkn/scenario_plugins/abstract_scenario_plugin.py", line 104, in run_scenarios
return_value = self.run(
File "/Users/prubenda/Github/kraken/krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py", line 65, in run
snapshot = future_snapshot.result()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 440, in result
self._condition.wait(timeout)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 312, in wait
waiter.acquire()
File "/Users/prubenda/Github/kraken/krkn/rollback/signal.py", line 64, in _signal_handler
original_handler(signum, frame)
KeyboardInterrupt
^C2026-01-29 14:43:33,350 [WARNING] Signal SIGINT received without complete context, skipping rollback.
^C2026-01-29 14:43:33,549 [WARNING] Signal SIGINT received without complete context, skipping rollback.
Additional context
The rollback functionality can't be added for this scenario as the pod should come back on its own
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working