Skip to content

[BUG] Pod scenarios (not configured for rollback) stuck #1137

@paigerube14

Description

@paigerube14

Bug Description

Describe the bug

During the pod scenarios, I want to stop the run during the pod wait and properly exit with a warning but complete the run at that time, not wait for the expected recovery time to complete

To Reproduce

python run_kraken.py
+ c to stop run during wait for pod recovery
Error seen

Scenario File

Scenario file(s) that were specified in your config file (can be starred (*) with confidential information )

kraken:
    kubeconfig_path: ~/.kube/config                     # Path to kubeconfig
    exit_on_failure: False                                 # Exit when a post action scenario fails
    auto_rollback: True                                    # Enable auto rollback for scenarios.
    rollback_versions_directory: /tmp/kraken-rollback      # Directory to store rollback version files.
    publish_kraken_status: True                            # Can be accessed at http://0.0.0.0:8081
    signal_state: RUN                                      # Will wait for the RUN signal when set to PAUSE before running the scenarios, refer docs/signal.md for more details
    signal_address: 0.0.0.0                                # Signal listening address
    port: 8081                                             # Signal port
    chaos_scenarios:
       # List of policies/chaos scenarios to load
       - pod_disruption_scenarios:
           - scenarios/openshift/etcd.yml

cerberus:
    cerberus_enabled: False                                # Enable it when cerberus is previously installed
    cerberus_url:                                          # When cerberus_enabled is set to True, provide the url where cerberus publishes go/no-go signal
    check_application_routes: False                         # When enabled will look for application unavailability using the routes specified in the cerberus config and fails the run

performance_monitoring:
    prometheus_url: ''                                    # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                              # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
    uuid:                                                 # uuid for the run is generated by default if not set
    enable_alerts: False                                  # Runs the queries specified in the alert profile and displays the info or exits 1 when severity=error
    enable_metrics: False
    alert_profile: config/alerts.yaml                          # Path or URL to alert profile with the prometheus queries
    metrics_profile: config/metrics-report.yaml
    check_critical_alerts: False                          # When enabled will check prometheus for critical alerts firing post chaos
elastic:
    enable_elastic: False
    verify_certs: False
    elastic_url: ""                                         # To track results in elasticsearch, give url to server here; will post telemetry details when url and index not blank
    elastic_port: 32766
    username: "elastic"
    password: "test"
    metrics_index: "krkn-metrics"
    alerts_index: "krkn-alerts"
    telemetry_index: "krkn-telemetry"

tunings:
    wait_duration: 1                                      # Duration to wait between each chaos scenario
    iterations: 1                                          # Number of times to execute the scenarios
    daemon_mode: False                                     # Iterations are set to infinity which means that the kraken will cause chaos forever
telemetry:
    enabled: False                                           # enable/disables the telemetry collection feature
    api_url: https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production #telemetry service endpoint
    username: username                                      # telemetry service username
    password: password                                    # telemetry service password
    prometheus_backup: True                                 # enables/disables prometheus data collection
    prometheus_namespace: ""                                # namespace where prometheus is deployed (if distribution is kubernetes)
    prometheus_container_name: ""                           # name of the prometheus container name (if distribution is kubernetes)
    prometheus_pod_name: ""                                 # name of the prometheus pod (if distribution is kubernetes)
    full_prometheus_backup: False                           # if is set to False only the /prometheus/wal folder will be downloaded.
    backup_threads: 5                                       # number of telemetry download/upload threads
    archive_path: /tmp                                      # local path where the archive files will be temporarly stored
    max_retries: 0                                          # maximum number of upload retries (if 0 will retry forever)
    run_tag: ''                                             # if set, this will be appended to the run folder in the bucket (useful to group the runs)
    archive_size: 500000
    telemetry_group: ''                                     # if set will archive the telemetry in the S3 bucket on a folder named after the value, otherwise will use "default"
    # the size of the prometheus data archive size in KB. The lower the size of archive is
                                                            # the higher the number of archive files will be produced and uploaded (and processed by backup_threads
                                                            # simultaneously).
                                                            # For unstable/slow connection is better to keep this value low
                                                            # increasing the number of backup_threads, in this way, on upload failure, the retry will happen only on the
                                                            # failed chunk without affecting the whole upload.
    logs_backup: True
    logs_filter_patterns:
     - "(\\w{3}\\s\\d{1,2}\\s\\d{2}:\\d{2}:\\d{2}\\.\\d+).+"         # Sep 9 11:20:36.123425532
     - "kinit (\\d+/\\d+/\\d+\\s\\d{2}:\\d{2}:\\d{2})\\s+"          # kinit 2023/09/15 11:20:36 log
     - "(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d+Z).+"      # 2023-09-15T11:20:36.123425532Z log
    oc_cli_path: /usr/bin/oc                                # optional, if not specified will be search in $PATH
    events_backup: True                                     # enables/disables cluster events collection

health_checks:                                              # Utilizing health check endpoints to observe application behavior during chaos injection.
    interval:                                               # Interval in seconds to perform health checks, default value is 2 seconds
    config:                                                 # Provide list of health check configurations for applications
        - url:                                              # Provide application endpoint
          bearer_token:                                     # Bearer token for authentication if any
          auth:                                             # Provide authentication credentials (username , password) in tuple format if any, ex:("admin","secretpassword")
          exit_on_failure:                                  # If value is True exits when health check failed for application, values can be True/False

kubevirt_checks:                                            # Utilizing virt check endpoints to observe ssh ability to VMI's during chaos injection.
    interval: 2                                             # Interval in seconds to perform virt checks, default value is 2 seconds
    namespace:                                              # Namespace where to find VMI's
    name:                                                   # Regex Name style of VMI's to watch, optional, will watch all VMI names in the namespace if left blank
    only_failures: False                                    # Boolean of whether to show all VMI's failures and successful ssh connection (False), or only failure status' (True) 
    disconnected: False                                     # Boolean of how to try to connect to the VMIs; if True will use the ip_address to try ssh from within a node, if false will use the name and uses virtctl to try to connect; Default is False
    ssh_node: ""                                            # If set, will be a backup way to ssh to a node. Will want to set to a node that isn't targeted in chaos
    node_names: ""
    exit_on_failure:                                        # If value is True and VMI's are failing post chaos returns failure, values can be True/False

Config File

Config file you used when error was seen (the default used is config/config.yaml)

- id: kill-pods
  config:
    namespace_pattern: ^openshift-etcd$
    label_selector: k8s-app=etcd
    krkn_pod_recovery_time: 120
    exclude_label: "" # excludes pods marked with this label from chaos

Expected behavior

Keyboard interrupt the run print ^C [WARNING] Signal SIGINT received without complete context, skipping rollback once then exit the scenario and output telemetry and give failure end status

Krkn Output

% python run_kraken.py 
 _              _              
| | ___ __ __ _| | _____ _ __  
| |/ / '__/ _` | |/ / _ \ '_ \ 
|   <| | | (_| |   <  __/ | | |
|_|\_\_|  \__,_|_|\_\___|_| |_|
                               

2026-01-29 14:42:46,357 [INFO] Starting kraken
2026-01-29 14:42:46,369 [INFO] Initializing client to talk to the Kubernetes cluster
2026-01-29 14:42:46,369 [INFO] Generated a uuid for the run: 29fae477-1d2c-4f89-8208-c5540edf32de
2026-01-29 14:42:46,574 [INFO] Detected distribution openshift
2026-01-29 14:42:48,865 [INFO] Publishing kraken status at http://0.0.0.0:8081
2026-01-29 14:42:48,884 [INFO] Starting http server at http://0.0.0.0:8081

2026-01-29 14:42:48,885 [INFO] Fetching cluster info
2026-01-29 14:42:49,225 [INFO] 4.20.4
2026-01-29 14:42:49,225 [INFO] Server URL: https://api.***openshift.com:6443
2026-01-29 14:42:49,226 [INFO] Daemon mode not enabled, will run through 1 iterations

2026-01-29 14:42:50,897 [INFO] 📣 `ScenarioPluginFactory`: types from config.yaml mapped to respective classes for execution:
2026-01-29 14:42:50,897 [INFO]   ✅ type: application_outages_scenarios ➡️ `ApplicationOutageScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: container_scenarios ➡️ `ContainerScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: hog_scenarios ➡️ `HogsScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: kubevirt_vm_outage ➡️ `KubevirtVmOutageScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: managedcluster_scenarios ➡️ `ManagedClusterScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ types: [pod_network_scenarios, ingress_node_scenarios] ➡️ `NativeScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: network_chaos_scenarios ➡️ `NetworkChaosScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: network_chaos_ng_scenarios ➡️ `NetworkChaosNgScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: node_scenarios ➡️ `NodeActionsScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: pod_disruption_scenarios ➡️ `PodDisruptionScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: pvc_scenarios ➡️ `PvcScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: service_disruption_scenarios ➡️ `ServiceDisruptionScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: service_hijacking_scenarios ➡️ `ServiceHijackingScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: cluster_shut_down_scenarios ➡️ `ShutDownScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: syn_flood_scenarios ➡️ `SynFloodScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: time_scenarios ➡️ `TimeActionsScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO]   ✅ type: zone_outages_scenarios ➡️ `ZoneOutageScenarioPlugin` 
2026-01-29 14:42:50,897 [INFO] 

2026-01-29 14:42:50,898 [INFO] health checks config is not defined, skipping them
2026-01-29 14:42:50,898 [INFO] kube virt checks config is not defined, skipping them
2026-01-29 14:42:50,898 [INFO] Executing scenarios for iteration 0
2026-01-29 14:42:50,898 [INFO] connection set up
127.0.0.1 - - [29/Jan/2026 14:42:50] "GET / HTTP/1.1" 200 -
2026-01-29 14:42:50,899 [INFO] response RUN
2026-01-29 14:42:50,901 [INFO] Signal handlers registered globally
2026-01-29 14:42:50,901 [INFO] Running PodDisruptionScenarioPlugin: ['pod_disruption_scenarios'] -> scenarios/openshift/etcd.yml
2026-01-29 14:42:51,136 [INFO] waiting up to 120 seconds for pod recovery, pod label pattern: k8s-app=etcd namespace pattern: ^openshift-etcd$
2026-01-29 14:42:51,493 [INFO] ('etcd-prubenda1111-ck5s7-master-1.c.chaos-438115.internal', 'openshift-etcd')
2026-01-29 14:42:51,493 [INFO] Deleting pod etcd-prubenda1111-ck5s7-master-1.c.chaos-438115.internal
^C2026-01-29 14:43:33,142 [INFO] Performing rollback for signal SIGINT with run_uuid=29fae477-1d2c-4f89-8208-c5540edf32de, scenario_type=pod_disruption_scenarios
2026-01-29 14:43:33,143 [WARNING] Skip execution for run_uuid=29fae477-1d2c-4f89-8208-c5540edf32de, scenario_type=pod_disruption_scenarios
2026-01-29 14:43:33,143 [INFO] Calling original handler for SIGINT
Traceback (most recent call last):
  File "/Users/prubenda/Github/kraken/run_kraken.py", line 717, in <module>
    retval = main(options, command)
  File "/Users/prubenda/Github/kraken/run_kraken.py", line 367, in main
    scenario_plugin.run_scenarios(
  File "/Users/prubenda/Github/kraken/krkn/scenario_plugins/abstract_scenario_plugin.py", line 104, in run_scenarios
    return_value = self.run(
  File "/Users/prubenda/Github/kraken/krkn/scenario_plugins/pod_disruption/pod_disruption_scenario_plugin.py", line 65, in run
    snapshot = future_snapshot.result()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 440, in result
    self._condition.wait(timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
  File "/Users/prubenda/Github/kraken/krkn/rollback/signal.py", line 64, in _signal_handler
    original_handler(signum, frame)
KeyboardInterrupt
^C2026-01-29 14:43:33,350 [WARNING] Signal SIGINT received without complete context, skipping rollback.
^C2026-01-29 14:43:33,549 [WARNING] Signal SIGINT received without complete context, skipping rollback.

Additional context

The rollback functionality can't be added for this scenario as the pod should come back on its own

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions