This proposal is about integrating the remove_disks endpoint from Cruise Control into Strimzi cluster operator.
This endpoint will allow us to move the data between two JBOD disks.
Currently, we get a multiple requests from community users to add the ability for moving all Kafka logs between two disks on the JBOD storage array. This feature can be useful in following scenarios:
- The current disk is too small and the user wants to use a bigger one, or vice versa.
- When we want to use a different Storage Class with different parameters or different storage types.
- In case of disk removal to reduce the total storage.
For now, we can do this using the Kafka CLI kafka-reassign-partitions.sh tool, but it takes a lot of manual steps which is time-consuming and not so user-friendly.
We should introduce the logic to Strimzi to leverage Cruise Control integration and make it possible to move the data between two JBOD disks. This feature will also allow us to remove the disks without the loss of data.
Cruise Control provides the remove_disks HTTP REST endpoint to move replicas from a specified disk to other disks for the same broker. The operation is only for intra-broker rebalancing, not moving data between brokers.
This endpoint triggers a rebalancing operation that moves replicas, starting with the largest and proceeding to the smallest, to the remaining disks while ensuring the following constraint is met:
1 - (remainingUsageAfterRemoval / remainingCapacity) > errorMarginwhere:
remainingUsageAfterRemoval = current usage for remaining disks + additional usage from removed disks
remainingCapacity = sum of capacities of the remaining disks
errorMargin = configurable property (default 0.1); it makes sure that a disk percentage is always free when moving replicasTo use the remove_disks endpoint in the Strimzi cluster operator, it should be added to the CruiseControlApi interface, and the corresponding implementation developed.
To implement this feature, we will be adding a new mode to the KafkaRebalanceMode class.
remove-disks: It moves replicas from a specified disk to other disks of the same broker. It always uses intra-broker re-balancing. You can use this mode by changing thespec.modetoremove-disksin theKafkaRebalanceresource.
A KafkaRebalance custom resource would look like this.
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
name: my-rebalance
labels:
strimzi.io/cluster: my-cluster
spec:
# setting the mode as `remove-disks` to move data between the JBOD disks
mode: remove-disks
# providing the list of brokers, and the corresponding volumes from which you want to move the replicas
moveReplicasOffVolumes:
- brokerId: 0
volumeIds: [1, 2]
- brokerId: 2
volumeIds: [1]
# ...- The user should be using the
Kafkaresource with JBOD configured, making sure that they have more than one disk configured on the brokers. - When the Kafka cluster is ready, the user creates a
KafkaRebalancecustom resource with thespec.modefield asremove-disksand provides a list of the brokers, and the corresponding volumes from which they want to move the replicas in thespec.moveReplicasOffVolumesfield. In case, thespec.moveReplicasOffVolumesfield is not set, then theKafkaRebalanceresource will move toNotReadystate prompting thatspec.moveReplicasOffVolumesfield is missing. - The
KafkaRebalanceAssemblyOperatorinteracts with Cruise Control via the/remove_disksendpoint to generate an optimization proposal (by using the dryrun feature). - You can use
strimzi.io/rebalance-auto-approval:trueannotation on theKafkaRebalanceresource for auto-approval of proposal. In case you want to do it manually you can do it by applying thestrimzi.io/rebalance=approveannotation on it. - The
KafkaRebalanceAssemblyOperatorinteracts with Cruise Control via the/remove_disksendpoint to perform the actual rebalancing.
NOTE The optimization proposal will not show the load before optimization, it will only show the load after optimization. This is because in upstream Cruise Control we don't have the verbose tag enabled with the
remove_disksendpoint.
- In case the user is not using JBOD storage and tries to generate the optimization proposal, the
KafkaRebalanceresource will move toNotReadystate prompting invalid log dirs provided for the broker. - If you are using JBOD with single disk configured on the brokers, in that case
KafkaRebalancewill move toNotReadystate prompting that you don't have enough log dirs to move the replicas for that broker. - If the disk capacity has exceeded for the broker, in that case
KafkaRebalancewill move toNotReadyprompting that enough capacity is not remaining to move replicas for that broker. - This feature works fine with
KafkaNodePoolresources. - This feature works with KRaft only if Kafka version is greater than 3.7.0, as that version supports multiple JBOD disks on brokers.
Errors for these scenarios are reported by Cruise Control.
Based on these errors, we transition the KafkaRebalance resource to the NotReady state and update its status with the corresponding error message.
This change impacts the Cruise Control API related classes and the KafkaRebalanceAssemblyOperator class.
No rejected alternatives.