Skip to content

Discussion: DCP APIs and broader contracts for rescalabilityΒ #1456

@daviswer

Description

@daviswer

After much discussion, it was decided that the best approach to implementing rescalability would be to implement rescaling in the base file reader, in order to maintain low overhead and avoid proliferation of logical shard objects (see #1372 , #1455, torchtitan PR). However, this approach necessitates that all nodes above the base node become rescaling-aware: we must decide what behaviors to support and how to make specifying these behaviors friendly to the user.

I have identified four behaviors that I believe a fully capable rescalable pipeline should support, with some correspondence to the existing placement behaviors of DTensors:

  1. Drop on rescale. Certain values, such as scalars and RNG states, cannot be repartitioned and it makes no sense to try. These values should be dropped when rescaling but kept otherwise.
  2. Sharded save, sharded load. Large buffers (for example, a local shuffling buffer) can be pooled into a single DTensor, which is then resharded over a new number of workers when rescaling. DCP is largely built around supporting this particular behavior, but note that we must now handle cases where the number of workers may not divide the length of the buffer evenly, and we also may not know the length of the buffer in advance.
  3. Replicated values. This encompasses any expensive metadata objects that we may want to construct (slowly) once, but load from checkpoint afterwards. These values would ideally be saved from rank 0 only, but loaded back to all workers. DCP supports this behavior for non-DTensor objects.
  4. Sharded save, global load. Any state that cannot be resharded simply via (2), such as logical shard state, which must first be accumulated/divided into global pools of visited vs unvisited shards. Local values are saved from each rank, but accumulated globally on load. DCP supports this behavior for non-DTensor objects, by assigning a unique rank-based key for all such objects and recompiling them manually on load.

Note that while the above 4 behaviors raise some questions on DCP support, the larger question revolves around how we want to expose these options to users and/or incorporate them into existing Datasets or Nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions