-
Notifications
You must be signed in to change notification settings - Fork 1
Voldemort Donor Based Rebalancing
The Donor Based Rebalancing improves the overall rebalancing speed by reducing the amount of disk scans. The original rebalancing algorithm requires the entire data set of a donor node to be scanned at least once per each stealer node. So the total rebalancing cost for rebalacning from one node to N new nodes is the scan cost of the original node times N. Because disk scans are costly, we want to minimize the amount of scans during the rebalancing.
In the Donor Based Rebalancing algorithm, the cost of rebalancing from one node to N new nodes is just the scan cost of the original node. The rebalancing cost is not subject to the number of new nodes. Basically, each Donor node scans its data set once and places the entries into their corresponding destination queues depending on the partitions they are hashed to. Noted that all Donor nodes can scan their data set and send the entries to the corresponding Stealer nodes in parallel.
In the case of a Stealer node failure, the scan processes on all Donor nodes will pause until either the Stealer node recovers or the rebalancing job is cancelled. When a Stealer node is down, it does not make sense for other Stealer nodes to make progress. Because when the failed Stealer node recovers, all Donor nodes have to repeat the scan process anyways.
We ran the Donor Based Rebalancing on one of our clusters to go from 8 to 12 nodes. With the combination of the new rebalacing algorithm and the use of SSD on voldemort servers, the overall expansion time was reduced from 500+ hours (last year) to less than 30 hours. While the total amount of data moved as part of this rebalancing was about 3 times more than that during last year’s rebalancing, the rebalancing process was 15+ times faster.
Noted that the above performance number is based on the BDB Storage Engine.
Command to start Donor Based rebalancing:
bin/voldemort-rebalance.sh --current-cluster [initial cluster.xml file] --current-stores [store.xml file]
--target-cluster [final cluster.xml file] --batch 100 --parallelism 10 --stealer-based false
--output-dir [location for intermittent cluster.xml files during rebalancing]
"--batch" controls the number of partitions moved in each step of the rebalancing process. In the Donor Based rebalancing, it is highly recommended to move all partitions out of a Donor node in one step (to minimize the amount of disk scans). Therefore, the value specified for this option shall be greater than the total number of partitions to be moved off any Donor node. (The value can be bigger than the total amount of partitions to be moved.)
"--parallelism" controls the number of Donor nodes allowed to work in parallel. It shall be set to the number of new nodes.
Command to start the Repair Job post rebalancing:
The Rebalance operation has a built in functionality to delete the migrated data. However, a more conservative approach is to retain the data until the process is complete and we have verified that everything works fine. In this case, a 'Repair Job' tool is required to do the cleanup. This tool can be invoked like so
bin/voldemort-admin-tool.sh --repair-job --url tcp://[cluster-url]:[port] --node [NODE ID]
This will kick off an Asynchronous job on the server indicated by that NODE ID. A NODE ID of -1 will start the repair job on all the nodes in the cluster The steps involved in the cleanup are as follows:
- Acquire a scan permit (since repair job also involves a scan of the underlying database)
- For each store specified in stores.xml, iterate through all the entries.
- For each such entry compute if this entry should belong on this node. If no, then simply delete it using the StorageEngine API.
Just like the rebalance operation, it gives periodic updates of the number of keys deleted thus far.