Discussion: DCP APIs and broader contracts for rescalability

After much discussion, it was decided that the best approach to implementing rescalability would be to implement rescaling in the base file reader, in order to maintain low overhead and avoid proliferation of logical shard objects (see #1372 , #1455, [torchtitan PR](https://github.com/pytorch/torchtitan/pull/376)). However, this approach necessitates that all nodes above the base node become rescaling-aware: we must decide what behaviors to support and how to make specifying these behaviors friendly to the user. 

I have identified four behaviors that I believe a fully capable rescalable pipeline should support, with some correspondence to the existing placement behaviors of DTensors:

1. Drop on rescale. Certain values, such as scalars and RNG states, cannot be repartitioned and it makes no sense to try. These values should be dropped when rescaling but kept otherwise.
2. Sharded save, sharded load. Large buffers (for example, a local shuffling buffer) can be pooled into a single DTensor, which is then resharded over a new number of workers when rescaling. DCP is largely built around supporting this particular behavior, but note that we must now handle cases where the number of workers may not divide the length of the buffer evenly, and we also may not know the length of the buffer in advance.
3. Replicated values. This encompasses any expensive metadata objects that we may want to construct (slowly) once, but load from checkpoint afterwards. These values would ideally be saved from rank 0 only, but loaded back to all workers. DCP supports this behavior for non-DTensor objects.
4. Sharded save, global load. Any state that cannot be resharded simply via (2), such as logical shard state, which must first be accumulated/divided into global pools of visited vs unvisited shards. Local values are saved from each rank, but accumulated globally on load. DCP supports this behavior for non-DTensor objects, by assigning a unique rank-based key for all such objects and recompiling them manually on load. 

Note that while the above 4 behaviors raise some questions on DCP support, the larger question revolves around how we want to expose these options to users and/or incorporate them into existing Datasets or Nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion: DCP APIs and broader contracts for rescalability #1456

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion: DCP APIs and broader contracts for rescalability #1456

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions