-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the problem you're trying to solve
I have the following demand: In a elastic training, I want to keep the global_batch constant to ensure computational consistency.
The formula is: global_batch = local_batch * acc_steps * world_size.
global_batch is the batch used for allreduce gradient. local_batch represents the batch for a single worker's training iteration. acc_steps represents the allreduce gradient performed every acc_steps training steps. world_size represents the number of worker nodes.
Considering the impact of OutOfMemoryError (OOM) and Batch Normalization, local_batch is set to a fixed value. world_size, however, is a value adjusted according to the resource. Therefore, we need to adjust acc_steps to keep global_batch constant.
However, for some world_size values, it's difficult to find a positive integer value for acc_steps. Therefore, I want the world_size of the training task to be scheduled according to a expected size. For example, if global_batch=4 and local_batch=1, then we would like world_size=[1,2,4], so that the corresponding acc_steps=[4,2,1] are all positive integers.
Although we make extra nodes wait for training at the training framework level, this is an unnecessary waste of resources.
Describe the solution you'd like
The current gang scheduling cannot meet this demand. I suggest supporting sub jobs tree, and then customizing the sub job tree according to requirements to achieve more flexible scheduling.
In this example, we can easily construct it using sub jobs tree. Suppose PartitionSize=2, ExpectedPartitions=[1,2,4]. Then we will get the subjob tree shown in the following diagram.
Let's look at the SubJob-2 node, -2 indicates that the MatchIndex (PartitionGroupId) of level 0 is 2. It has two children: SubJob-2-3 and SubJob-2-4. In SubJob-2-3, -3 indicates that the MatchIndex (PartitionId) of level 1 is 3. For leaf nodes, MinAvailable is equivalent to PartitionSize, which is 2. For non-leaf nodes like SubJob-2, we set MinAvailable to 4, meaning that the two partitions with partitionIds 3 and 4 must be allocated simultaneously.
Complex scheduling can be easily implemented using sub-job trees. Simply set the MinAvailable property of the sub job tree on the controller, and the scheduler only needs to support sub-group trees to meet the demand of this example.
About release
The above describes how to schedule according to expected partitions. Regarding preempt/reclaim, let's wait for #4374 firstly.
Additional context
No response
Documentation Updates
- This feature requires design or user documentation changes.
- If documentation changes are required, I will ensure the relevant documents are updated and published to the Volcano official website (https://volcano.sh) via the volcano-sh/website repository.