From 903e948464a9447654973dfc8120ec6cc360754b Mon Sep 17 00:00:00 2001 From: JakeSCahill Date: Mon, 27 Jan 2025 11:27:13 +0000 Subject: [PATCH 1/4] DOC-975 crash_loop_sleep_sec broker config --- .../configure-availability.adoc | 4 +++- .../pages/properties/broker-properties.adoc | 22 ++++++++++++++++++- .../partials/errors-and-solutions.adoc | 17 ++++++++++++-- 3 files changed, 39 insertions(+), 4 deletions(-) diff --git a/modules/manage/pages/cluster-maintenance/configure-availability.adoc b/modules/manage/pages/cluster-maintenance/configure-availability.adoc index f90d51cd9c..7aa053ed38 100644 --- a/modules/manage/pages/cluster-maintenance/configure-availability.adoc +++ b/modules/manage/pages/cluster-maintenance/configure-availability.adoc @@ -46,7 +46,7 @@ See also: xref:develop:produce-data/configure-producers.adoc[Configure Producers A Redpanda broker may create log segments at startup. If a broker crashes after startup, and if it gets stuck in a crash loop, it could produce progressively more stored state that uses more disk space and takes more time for each restart to process. -To prevent infinite crash loops, the Redpanda node property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions: +To prevent infinite crash loops, the Redpanda broker property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions: * The `redpanda.yaml` configuration file is updated. * The `startup_log` file in the broker's xref:reference:node-properties.adoc#data_directory[data_directory] is manually deleted. @@ -58,3 +58,5 @@ To prevent infinite crash loops, the Redpanda node property xref:reference:node- * The `crash_loop_limit` property is disabled by default. You must manually enable it by setting it to a non-zero value. * If the limit is less than two, the broker is blocked from restarting after every crash, until one of the reset conditions is met. ==== + +To facilitate debugging in environments where a broker is stuck in a crash loop, set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration]. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. The window during which the broker remains available allows you to troubleshoot the issue. This setting is mostly useful when xref:troubleshoot:errors-solutions/k-resolve-errors.adoc[troubleshooting in Kubernetes environments]. diff --git a/modules/reference/pages/properties/broker-properties.adoc b/modules/reference/pages/properties/broker-properties.adoc index 5497a7ad21..3a3ab6c4ba 100644 --- a/modules/reference/pages/properties/broker-properties.adoc +++ b/modules/reference/pages/properties/broker-properties.adoc @@ -53,7 +53,7 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo * The broker shuts down cleanly. * One hour passes since the last crash. * The `redpanda.yaml` broker configuration file is updated. -* The `startup_log` file in the broker's <> is manually deleted. +* The `startup_log` file in the broker's <> is manually deleted. *Unit*: number of consecutive crashes of a broker @@ -67,6 +67,26 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo --- +=== crash_loop_sleep_sec + +The amount of time the broker sleeps before terminating when the limit on consecutive broker crashes (<>) is reached. This property provides a debugging window for you to access the broker before it terminates, particularly useful in Kubernetes environments. + +If `null`, the property is disabled, and the broker terminates immediately after reaching the crash loop limit. + +This setting does not affect the reset of the crash-tracking logic. The conditions for resetting remain the same as those specified for `crash_loop_limit`. + +*Unit:* seconds + +*Visibility:* `user` + +*Type:* integer or null + +*Accepted values:* [`0`, `4294967295`] or `null` + +*Default:* `null` + +--- + === data_directory Path to the directory for storing Redpanda's streaming data files. diff --git a/modules/troubleshoot/partials/errors-and-solutions.adoc b/modules/troubleshoot/partials/errors-and-solutions.adoc index 4ae8a3e078..62bbdb1435 100644 --- a/modules/troubleshoot/partials/errors-and-solutions.adoc +++ b/modules/troubleshoot/partials/errors-and-solutions.adoc @@ -397,9 +397,22 @@ endif::[] ifdef::env-kubernetes[] === Crash loop backoffs -If a broker crashes after startup, or gets stuck in a crash loop, it could produce progressively more stored state that uses additional disk space and takes more time for each restart to process. +If a broker crashes after startup, or gets stuck in a crash loop, it can accumulate an increasing amount of stored state. This accumulated state not only consumes additional disk space but also prolongs the time required for each subsequent restart to process it. -To prevent infinite crash loops, the Redpanda Helm chart sets the `crash_loop_limit` node property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. After Redpanda reaches this limit, it will not start until its internal consecutive crash counter is reset to zero. In Kubernetes, the Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero. +To prevent infinite crash loops, the Redpanda Helm chart sets the xref:reference:properties/broker-properties.adoc#crash_loop_limit[`crash_loop_limit`] broker configuration property to `5`. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. By default, the broker terminates immediately after hitting the `crash_loop_limit`. The Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero. + +To facilitate debugging in environments where a broker is stuck in a crash loop, you can also set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration] configuration. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. By providing a window during which the Pod remains in a paused state, you can SSH into the Pod and troubleshoot the issue. + +Example configuration: + +```yaml +config: + node: + crash_loop_limit: 5 + crash_loop_sleep_sec: 60 +``` + +In this example, when the broker hits the `crash_loop_limit` of 5, it will sleep for 60 seconds before terminating the process. This delay allows administrators to access the Pod and troubleshoot. To troubleshoot a crash loop backoff: From a4d4064df9b2d25e262942bba4864d11c85bd6e1 Mon Sep 17 00:00:00 2001 From: JakeSCahill Date: Mon, 27 Jan 2025 11:39:14 +0000 Subject: [PATCH 2/4] Add to what's new --- modules/get-started/pages/whats-new.adoc | 8 +++++++- modules/reference/pages/properties/broker-properties.adoc | 2 ++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/modules/get-started/pages/whats-new.adoc b/modules/get-started/pages/whats-new.adoc index a3fd440b1c..5cabe2b36d 100644 --- a/modules/get-started/pages/whats-new.adoc +++ b/modules/get-started/pages/whats-new.adoc @@ -112,7 +112,7 @@ The following `rpk` commands are new in this version: * xref:reference:rpk/rpk-cluster/rpk-cluster-storage-status-mount.adoc[`rpk cluster storage status mount`] * xref:reference:rpk/rpk-cluster/rpk-cluster-storage-unmount.adoc[`rpk cluster storage unmount`] -== New properties +== New cluster properties The following cluster properties are new in this version: @@ -133,3 +133,9 @@ The following cluster properties are new in this version: * xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_request_timeout_ms[`iceberg_rest_catalog_request_timeout_ms`] * xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_token[`iceberg_rest_catalog_token`] * xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_trust_file[`iceberg_rest_catalog_trust_file`] + +== New broker properties + +The following broker properties are new in this version: + +- xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`] diff --git a/modules/reference/pages/properties/broker-properties.adoc b/modules/reference/pages/properties/broker-properties.adoc index 3a3ab6c4ba..ed3969686b 100644 --- a/modules/reference/pages/properties/broker-properties.adoc +++ b/modules/reference/pages/properties/broker-properties.adoc @@ -69,6 +69,8 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo === crash_loop_sleep_sec +*Introduced in v24.3.4* + The amount of time the broker sleeps before terminating when the limit on consecutive broker crashes (<>) is reached. This property provides a debugging window for you to access the broker before it terminates, particularly useful in Kubernetes environments. If `null`, the property is disabled, and the broker terminates immediately after reaching the crash loop limit. From 8502727a92130e299efa5d814ec78e648d73c9e2 Mon Sep 17 00:00:00 2001 From: Jake Cahill <45230295+JakeSCahill@users.noreply.github.com> Date: Mon, 27 Jan 2025 17:03:01 +0000 Subject: [PATCH 3/4] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Gellért Peresztegi-Nagy --- .../pages/cluster-maintenance/configure-availability.adoc | 2 +- modules/reference/pages/properties/broker-properties.adoc | 2 +- modules/troubleshoot/partials/errors-and-solutions.adoc | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/modules/manage/pages/cluster-maintenance/configure-availability.adoc b/modules/manage/pages/cluster-maintenance/configure-availability.adoc index 7aa053ed38..b0d6da05de 100644 --- a/modules/manage/pages/cluster-maintenance/configure-availability.adoc +++ b/modules/manage/pages/cluster-maintenance/configure-availability.adoc @@ -59,4 +59,4 @@ To prevent infinite crash loops, the Redpanda broker property xref:reference:nod * If the limit is less than two, the broker is blocked from restarting after every crash, until one of the reset conditions is met. ==== -To facilitate debugging in environments where a broker is stuck in a crash loop, set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration]. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. The window during which the broker remains available allows you to troubleshoot the issue. This setting is mostly useful when xref:troubleshoot:errors-solutions/k-resolve-errors.adoc[troubleshooting in Kubernetes environments]. +To facilitate debugging in environments where a broker is stuck in a crash loop, set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration]. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. The window during which the broker remains available allows you to troubleshoot the issue. This setting is most useful when xref:troubleshoot:errors-solutions/k-resolve-errors.adoc[troubleshooting in Kubernetes environments]. diff --git a/modules/reference/pages/properties/broker-properties.adoc b/modules/reference/pages/properties/broker-properties.adoc index ed3969686b..1141cb9824 100644 --- a/modules/reference/pages/properties/broker-properties.adoc +++ b/modules/reference/pages/properties/broker-properties.adoc @@ -75,7 +75,7 @@ The amount of time the broker sleeps before terminating when the limit on consec If `null`, the property is disabled, and the broker terminates immediately after reaching the crash loop limit. -This setting does not affect the reset of the crash-tracking logic. The conditions for resetting remain the same as those specified for `crash_loop_limit`. +For information about how to reset the crash loop limit, see `crash_loop_limit`. *Unit:* seconds diff --git a/modules/troubleshoot/partials/errors-and-solutions.adoc b/modules/troubleshoot/partials/errors-and-solutions.adoc index 62bbdb1435..f161e184e1 100644 --- a/modules/troubleshoot/partials/errors-and-solutions.adoc +++ b/modules/troubleshoot/partials/errors-and-solutions.adoc @@ -401,7 +401,7 @@ If a broker crashes after startup, or gets stuck in a crash loop, it can accumul To prevent infinite crash loops, the Redpanda Helm chart sets the xref:reference:properties/broker-properties.adoc#crash_loop_limit[`crash_loop_limit`] broker configuration property to `5`. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. By default, the broker terminates immediately after hitting the `crash_loop_limit`. The Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero. -To facilitate debugging in environments where a broker is stuck in a crash loop, you can also set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration] configuration. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. By providing a window during which the Pod remains in a paused state, you can SSH into the Pod and troubleshoot the issue. +To facilitate debugging in environments where a broker is stuck in a crash loop, you can also set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`] broker configuration property. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. By providing a window during which the Pod remains available, you can SSH into it and troubleshoot the issue. Example configuration: From e55a8ceaa6fa9eaebfee6236f16e50ab66804450 Mon Sep 17 00:00:00 2001 From: Jake Cahill <45230295+JakeSCahill@users.noreply.github.com> Date: Mon, 27 Jan 2025 19:52:25 +0000 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com> --- modules/reference/pages/properties/broker-properties.adoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/modules/reference/pages/properties/broker-properties.adoc b/modules/reference/pages/properties/broker-properties.adoc index 1141cb9824..9263e251f2 100644 --- a/modules/reference/pages/properties/broker-properties.adoc +++ b/modules/reference/pages/properties/broker-properties.adoc @@ -71,7 +71,7 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo *Introduced in v24.3.4* -The amount of time the broker sleeps before terminating when the limit on consecutive broker crashes (<>) is reached. This property provides a debugging window for you to access the broker before it terminates, particularly useful in Kubernetes environments. +The amount of time the broker sleeps before terminating when the limit on consecutive broker crashes (<>) is reached. This property provides a debugging window for you to access the broker before it terminates, and is particularly useful in Kubernetes environments. If `null`, the property is disabled, and the broker terminates immediately after reaching the crash loop limit.