-
Check that all brokers are healthy:
rpk cluster health
Example output:
CLUSTER HEALTH OVERVIEW ======================= Healthy: true (1) Controller ID: 0 All nodes: [0 1 2] (2) Nodes down: [] (3) Leaderless partitions: [] (3) Under-replicated partitions: [1] (3)-
The cluster is either healthy (
true) or unhealthy (false). -
The node IDs of all brokers in the cluster.
-
If the cluster is unhealthy, these fields will contain data.
-
-
Optional: You can use the Admin API (default port: 9644) to perform additional checks for potential risks with restarting a specific broker.
curl -X GET "http://<broker-address>:<admin-api-port>/v1/broker/pre_restart_probe" | jq .
Example output:// Returns tuples of partitions (in the format {namespace}/{topic_name}/{partition_id}) affected by the broker restart. { "risks": { "rf1_offline": [ "kafka/topic_a/0" ], "full_acks_produce_unavailable": [], "unavailable": [], "acks1_data_loss": [] } }
In this example, the restart probe indicates that there is an under-replicated partition
kafka/topic_a/0(with a replication factor of 1) at risk of going offline if the broker is restarted.See the Admin API reference for more details on the restart probe endpoint.
rpk cluster maintenance enable <node-id> --wait
The
--waitoption tells the command to wait until a given broker, 0 in this example, finishes draining all partitions it originally served. After the partition draining completes, the command completes.Expected output:Successfully enabled maintenance mode for node 0 Waiting for node to drain...
-
Verify that the broker is in maintenance mode:
rpk cluster maintenance status
Expected output:
NODE-ID DRAINING FINISHED ERRORS PARTITIONS ELIGIBLE TRANSFERRING FAILED 0 true true false 3 0 2 0 1 false false false 0 0 0 0 2 false false false 0 0 0 0
The
Finishedcolumn should readtruefor the broker that you put into maintenance mode. -
Validate the health of the cluster again:
rpk cluster health --watch --exit-when-healthy
The combination of the
--watchand--exit-when-healthyflags tell rpk to monitor the cluster health and exit only when the cluster is back in a healthy state.Noterpk cluster maintenance disable <node-id>