Skip to content

Supports scheduling based on the expected partitions. #4907

@zhengchenyu

Description

@zhengchenyu

What is the problem you're trying to solve

I have the following demand: In a elastic training, I want to keep the global_batch constant to ensure computational consistency.

The formula is: global_batch = local_batch * acc_steps * world_size.

global_batch is the batch used for allreduce gradient. local_batch represents the batch for a single worker's training iteration. acc_steps represents the allreduce gradient performed every acc_steps training steps. world_size represents the number of worker nodes.

Considering the impact of OutOfMemoryError (OOM) and Batch Normalization, local_batch is set to a fixed value. world_size, however, is a value adjusted according to the resource. Therefore, we need to adjust acc_steps to keep global_batch constant.

However, for some world_size values, it's difficult to find a positive integer value for acc_steps. Therefore, I want the world_size of the training task to be scheduled according to a expected size. For example, if global_batch=4 and local_batch=1, then we would like world_size=[1,2,4], so that the corresponding acc_steps=[4,2,1] are all positive integers.

Although we make extra nodes wait for training at the training framework level, this is an unnecessary waste of resources.

Describe the solution you'd like

The current gang scheduling cannot meet this demand. I suggest supporting sub jobs tree, and then customizing the sub job tree according to requirements to achieve more flexible scheduling.

In this example, we can easily construct it using sub jobs tree. Suppose PartitionSize=2, ExpectedPartitions=[1,2,4]. Then we will get the subjob tree shown in the following diagram.

Image

Let's look at the SubJob-2 node, -2 indicates that the MatchIndex (PartitionGroupId) of level 0 is 2. It has two children: SubJob-2-3 and SubJob-2-4. In SubJob-2-3, -3 indicates that the MatchIndex (PartitionId) of level 1 is 3. For leaf nodes, MinAvailable is equivalent to PartitionSize, which is 2. For non-leaf nodes like SubJob-2, we set MinAvailable to 4, meaning that the two partitions with partitionIds 3 and 4 must be allocated simultaneously.

Complex scheduling can be easily implemented using sub-job trees. Simply set the MinAvailable property of the sub job tree on the controller, and the scheduler only needs to support sub-group trees to meet the demand of this example.

About release

The above describes how to schedule according to expected partitions. Regarding preempt/reclaim, let's wait for #4374 firstly.

Additional context

No response

Documentation Updates

  • This feature requires design or user documentation changes.
  • If documentation changes are required, I will ensure the relevant documents are updated and published to the Volcano official website (https://volcano.sh) via the volcano-sh/website repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions