diff --git a/modules/get-started/pages/whats-new.adoc b/modules/get-started/pages/whats-new.adoc index a3fd440b1c..5cabe2b36d 100644 --- a/modules/get-started/pages/whats-new.adoc +++ b/modules/get-started/pages/whats-new.adoc @@ -112,7 +112,7 @@ The following `rpk` commands are new in this version: * xref:reference:rpk/rpk-cluster/rpk-cluster-storage-status-mount.adoc[`rpk cluster storage status mount`] * xref:reference:rpk/rpk-cluster/rpk-cluster-storage-unmount.adoc[`rpk cluster storage unmount`] -== New properties +== New cluster properties The following cluster properties are new in this version: @@ -133,3 +133,9 @@ The following cluster properties are new in this version: * xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_request_timeout_ms[`iceberg_rest_catalog_request_timeout_ms`] * xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_token[`iceberg_rest_catalog_token`] * xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_trust_file[`iceberg_rest_catalog_trust_file`] + +== New broker properties + +The following broker properties are new in this version: + +- xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`] diff --git a/modules/manage/pages/cluster-maintenance/configure-availability.adoc b/modules/manage/pages/cluster-maintenance/configure-availability.adoc index f90d51cd9c..b0d6da05de 100644 --- a/modules/manage/pages/cluster-maintenance/configure-availability.adoc +++ b/modules/manage/pages/cluster-maintenance/configure-availability.adoc @@ -46,7 +46,7 @@ See also: xref:develop:produce-data/configure-producers.adoc[Configure Producers A Redpanda broker may create log segments at startup. If a broker crashes after startup, and if it gets stuck in a crash loop, it could produce progressively more stored state that uses more disk space and takes more time for each restart to process. -To prevent infinite crash loops, the Redpanda node property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions: +To prevent infinite crash loops, the Redpanda broker property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions: * The `redpanda.yaml` configuration file is updated. * The `startup_log` file in the broker's xref:reference:node-properties.adoc#data_directory[data_directory] is manually deleted. @@ -58,3 +58,5 @@ To prevent infinite crash loops, the Redpanda node property xref:reference:node- * The `crash_loop_limit` property is disabled by default. You must manually enable it by setting it to a non-zero value. * If the limit is less than two, the broker is blocked from restarting after every crash, until one of the reset conditions is met. ==== + +To facilitate debugging in environments where a broker is stuck in a crash loop, set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration]. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. The window during which the broker remains available allows you to troubleshoot the issue. This setting is most useful when xref:troubleshoot:errors-solutions/k-resolve-errors.adoc[troubleshooting in Kubernetes environments]. diff --git a/modules/reference/pages/properties/broker-properties.adoc b/modules/reference/pages/properties/broker-properties.adoc index 5497a7ad21..9263e251f2 100644 --- a/modules/reference/pages/properties/broker-properties.adoc +++ b/modules/reference/pages/properties/broker-properties.adoc @@ -53,7 +53,7 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo * The broker shuts down cleanly. * One hour passes since the last crash. * The `redpanda.yaml` broker configuration file is updated. -* The `startup_log` file in the broker's <> is manually deleted. +* The `startup_log` file in the broker's <> is manually deleted. *Unit*: number of consecutive crashes of a broker @@ -67,6 +67,28 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo --- +=== crash_loop_sleep_sec + +*Introduced in v24.3.4* + +The amount of time the broker sleeps before terminating when the limit on consecutive broker crashes (<>) is reached. This property provides a debugging window for you to access the broker before it terminates, and is particularly useful in Kubernetes environments. + +If `null`, the property is disabled, and the broker terminates immediately after reaching the crash loop limit. + +For information about how to reset the crash loop limit, see `crash_loop_limit`. + +*Unit:* seconds + +*Visibility:* `user` + +*Type:* integer or null + +*Accepted values:* [`0`, `4294967295`] or `null` + +*Default:* `null` + +--- + === data_directory Path to the directory for storing Redpanda's streaming data files. diff --git a/modules/troubleshoot/partials/errors-and-solutions.adoc b/modules/troubleshoot/partials/errors-and-solutions.adoc index 4ae8a3e078..f161e184e1 100644 --- a/modules/troubleshoot/partials/errors-and-solutions.adoc +++ b/modules/troubleshoot/partials/errors-and-solutions.adoc @@ -397,9 +397,22 @@ endif::[] ifdef::env-kubernetes[] === Crash loop backoffs -If a broker crashes after startup, or gets stuck in a crash loop, it could produce progressively more stored state that uses additional disk space and takes more time for each restart to process. +If a broker crashes after startup, or gets stuck in a crash loop, it can accumulate an increasing amount of stored state. This accumulated state not only consumes additional disk space but also prolongs the time required for each subsequent restart to process it. -To prevent infinite crash loops, the Redpanda Helm chart sets the `crash_loop_limit` node property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. After Redpanda reaches this limit, it will not start until its internal consecutive crash counter is reset to zero. In Kubernetes, the Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero. +To prevent infinite crash loops, the Redpanda Helm chart sets the xref:reference:properties/broker-properties.adoc#crash_loop_limit[`crash_loop_limit`] broker configuration property to `5`. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. By default, the broker terminates immediately after hitting the `crash_loop_limit`. The Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero. + +To facilitate debugging in environments where a broker is stuck in a crash loop, you can also set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`] broker configuration property. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. By providing a window during which the Pod remains available, you can SSH into it and troubleshoot the issue. + +Example configuration: + +```yaml +config: + node: + crash_loop_limit: 5 + crash_loop_sleep_sec: 60 +``` + +In this example, when the broker hits the `crash_loop_limit` of 5, it will sleep for 60 seconds before terminating the process. This delay allows administrators to access the Pod and troubleshoot. To troubleshoot a crash loop backoff: