Skip to content

Conversation

Ritesh1905
Copy link
Contributor

A split version of this RFC

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 15, 2025
@Ritesh1905 Ritesh1905 marked this pull request as ready for review September 15, 2025 18:55
Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ritesh1905
Copy link
Contributor Author

Ritesh1905 commented Sep 15, 2025

same as https://github.com/meta-pytorch/forge/pull/159/files ?

Yes, I thought it would create a stacked diff. But seems like this is cumulative. :(

let me know if there is an easy way to fix this.

Copy link
Contributor

@pbontrager pbontrager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly on board with the trainer interface, I just have concerns about if compile and pipeline works well with these.

pass

@abstractmethod
def apply_gradients(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks fine, my main concern is whether this would be compatible with Compile and Pipeline parallel APIs? @H-Huang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a concern if we want to expose another Step API for trainer.

pass

@abstractmethod
def snapshot_weights(self) -> WeightsBuffer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would likely push weights to store for checkpoint handling and weight sync to take over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to update_weights in the policy, this will be somewhat dependent on the internal state of apply_gradients where you want to call it right after apply_gradients is done (without awaiting it) and then not call apply_gradients again until it has completed. Not as complex as the policy side, but something to keep in mind.

# TODO: This file needs should NOT be in the data_models folder/package


class Store(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left my comment in 3/N but to repeat here, is it valuable to abstract the buffer too? It's as core to the library as Monarch.

Copy link
Contributor Author

@Ritesh1905 Ritesh1905 Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's as core to the library as Monarch.

buffer is just a wrapper on top of store, hence I did not do that. can you elaborate on your reasoning for abstracting buffer?

[EDIT]: Don't have an opinion but does not hurt to abstract the buffer too.

pass


class WeightsBuffer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the reason to have this extra layer? Also a buffer is what holds some individual data, vs this would be the entire store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high-level...

Store is a generic key-value storage abstraction. It can store any kind of data (strings, tensors, configs, etc.), not just model weights.

WeightsBuffer is a specialized abstraction focused on the logic and conventions for storing and retrieving model weights. It may add domain-specific features, validation, serialization, or metadata handling that are unique to weights.

@vidhyav vidhyav closed this Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants