Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion modules/get-started/pages/whats-new.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ The following `rpk` commands are new in this version:
* xref:reference:rpk/rpk-cluster/rpk-cluster-storage-status-mount.adoc[`rpk cluster storage status mount`]
* xref:reference:rpk/rpk-cluster/rpk-cluster-storage-unmount.adoc[`rpk cluster storage unmount`]

== New properties
== New cluster properties

The following cluster properties are new in this version:

Expand All @@ -133,3 +133,9 @@ The following cluster properties are new in this version:
* xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_request_timeout_ms[`iceberg_rest_catalog_request_timeout_ms`]
* xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_token[`iceberg_rest_catalog_token`]
* xref:reference:properties/cluster-properties.adoc#iceberg_rest_catalog_trust_file[`iceberg_rest_catalog_trust_file`]

== New broker properties

The following broker properties are new in this version:

- xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`]
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ See also: xref:develop:produce-data/configure-producers.adoc[Configure Producers

A Redpanda broker may create log segments at startup. If a broker crashes after startup, and if it gets stuck in a crash loop, it could produce progressively more stored state that uses more disk space and takes more time for each restart to process.

To prevent infinite crash loops, the Redpanda node property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions:
To prevent infinite crash loops, the Redpanda broker property xref:reference:node-properties.adoc#crash_loop_limit[`crash_loop_limit`] sets an upper limit on the number of consecutive crashes that can happen within one hour of each other. After it reaches the limit, a broker cannot restart until its internal consecutive crash counter is reset to zero by one of the following conditions:

* The `redpanda.yaml` configuration file is updated.
* The `startup_log` file in the broker's xref:reference:node-properties.adoc#data_directory[data_directory] is manually deleted.
Expand All @@ -58,3 +58,5 @@ To prevent infinite crash loops, the Redpanda node property xref:reference:node-
* The `crash_loop_limit` property is disabled by default. You must manually enable it by setting it to a non-zero value.
* If the limit is less than two, the broker is blocked from restarting after every crash, until one of the reset conditions is met.
====

To facilitate debugging in environments where a broker is stuck in a crash loop, set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec` configuration]. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. The window during which the broker remains available allows you to troubleshoot the issue. This setting is most useful when xref:troubleshoot:errors-solutions/k-resolve-errors.adoc[troubleshooting in Kubernetes environments].
24 changes: 23 additions & 1 deletion modules/reference/pages/properties/broker-properties.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo
* The broker shuts down cleanly.
* One hour passes since the last crash.
* The `redpanda.yaml` broker configuration file is updated.
* The `startup_log` file in the broker's <<data_directory,data_directory>> is manually deleted.
* The `startup_log` file in the broker's <<data_directory, data_directory>> is manually deleted.

*Unit*: number of consecutive crashes of a broker

Expand All @@ -67,6 +67,28 @@ The crash-tracking logic is reset (to zero consecutive crashes) by any of the fo

---

=== crash_loop_sleep_sec

*Introduced in v24.3.4*

The amount of time the broker sleeps before terminating when the limit on consecutive broker crashes (<<crash_loop_limit, `crash_loop_limit`>>) is reached. This property provides a debugging window for you to access the broker before it terminates, and is particularly useful in Kubernetes environments.

If `null`, the property is disabled, and the broker terminates immediately after reaching the crash loop limit.

For information about how to reset the crash loop limit, see `crash_loop_limit`.

*Unit:* seconds

*Visibility:* `user`

*Type:* integer or null

*Accepted values:* [`0`, `4294967295`] or `null`

*Default:* `null`

---

=== data_directory

Path to the directory for storing Redpanda's streaming data files.
Expand Down
17 changes: 15 additions & 2 deletions modules/troubleshoot/partials/errors-and-solutions.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -397,9 +397,22 @@ endif::[]
ifdef::env-kubernetes[]
=== Crash loop backoffs

If a broker crashes after startup, or gets stuck in a crash loop, it could produce progressively more stored state that uses additional disk space and takes more time for each restart to process.
If a broker crashes after startup, or gets stuck in a crash loop, it can accumulate an increasing amount of stored state. This accumulated state not only consumes additional disk space but also prolongs the time required for each subsequent restart to process it.

To prevent infinite crash loops, the Redpanda Helm chart sets the `crash_loop_limit` node property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. After Redpanda reaches this limit, it will not start until its internal consecutive crash counter is reset to zero. In Kubernetes, the Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero.
To prevent infinite crash loops, the Redpanda Helm chart sets the xref:reference:properties/broker-properties.adoc#crash_loop_limit[`crash_loop_limit`] broker configuration property to `5`. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. By default, the broker terminates immediately after hitting the `crash_loop_limit`. The Pod running Redpanda remains in a `CrashLoopBackoff` state until its internal consecutive crash counter is reset to zero.

To facilitate debugging in environments where a broker is stuck in a crash loop, you can also set the xref:reference:properties/broker-properties.adoc#crash_loop_sleep_sec[`crash_loop_sleep_sec`] broker configuration property. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. By providing a window during which the Pod remains available, you can SSH into it and troubleshoot the issue.

Example configuration:

```yaml
config:
node:
crash_loop_limit: 5
crash_loop_sleep_sec: 60
```

In this example, when the broker hits the `crash_loop_limit` of 5, it will sleep for 60 seconds before terminating the process. This delay allows administrators to access the Pod and troubleshoot.

To troubleshoot a crash loop backoff:

Expand Down