[2/N] Core trainer abstraction #158

Ritesh1905 · 2025-09-15T18:55:37Z

A split version of this RFC

LucasLLC

same as https://github.com/meta-pytorch/forge/pull/159/files ?

Ritesh1905 · 2025-09-15T23:07:03Z

same as https://github.com/meta-pytorch/forge/pull/159/files ?

Yes, I thought it would create a stacked diff. But seems like this is cumulative. :(

let me know if there is an easy way to fix this.

pbontrager

I'm mostly on board with the trainer interface, I just have concerns about if compile and pipeline works well with these.

pbontrager · 2025-09-16T15:41:11Z

src/forge/data_models/api.py

+        pass
+
+    @abstractmethod
+    def apply_gradients(self) -> None:


In general this looks fine, my main concern is whether this would be compatible with Compile and Pipeline parallel APIs? @H-Huang

I don't have a concern if we want to expose another Step API for trainer.

pbontrager · 2025-09-16T15:41:51Z

src/forge/data_models/api.py

+        pass
+
+    @abstractmethod
+    def snapshot_weights(self) -> WeightsBuffer:


This would likely push weights to store for checkpoint handling and weight sync to take over.

Similar to update_weights in the policy, this will be somewhat dependent on the internal state of apply_gradients where you want to call it right after apply_gradients is done (without awaiting it) and then not call apply_gradients again until it has completed. Not as complex as the policy side, but something to keep in mind.

pbontrager · 2025-09-16T15:46:27Z

src/forge/data_models/api.py

+# TODO: This file needs should NOT be in the data_models folder/package
+
+
+class Store(ABC):


I left my comment in 3/N but to repeat here, is it valuable to abstract the buffer too? It's as core to the library as Monarch.

It's as core to the library as Monarch.

buffer is just a wrapper on top of store, hence I did not do that. can you elaborate on your reasoning for abstracting buffer?

[EDIT]: Don't have an opinion but does not hurt to abstract the buffer too.

pbontrager · 2025-09-16T15:48:21Z

src/forge/data_models/api.py

+        pass
+
+
+class WeightsBuffer:


I don't follow the reason to have this extra layer? Also a buffer is what holds some individual data, vs this would be the entire store?

At a high-level...

Store is a generic key-value storage abstraction. It can store any kind of data (strings, tensors, configs, etc.), not just model weights.

WeightsBuffer is a specialized abstraction focused on the logic and conventions for storing and retrieving model weights. It may add domain-specific features, validation, serialization, or metadata handling that are unique to weights.

rithesh added 2 commits September 15, 2025 10:55

Core Data Models

edbad27

Core trainer abstraction

e8b503d

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 15, 2025

Ritesh1905 marked this pull request as ready for review September 15, 2025 18:55

LucasLLC reviewed Sep 15, 2025

View reviewed changes

pbontrager reviewed Sep 16, 2025

View reviewed changes

vidhyav closed this Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2/N] Core trainer abstraction #158

[2/N] Core trainer abstraction #158

Uh oh!

Ritesh1905 commented Sep 15, 2025

Uh oh!

LucasLLC left a comment

Uh oh!

Ritesh1905 commented Sep 15, 2025 •

edited

Loading

Uh oh!

pbontrager left a comment

Uh oh!

pbontrager Sep 16, 2025

Uh oh!

Ritesh1905 Sep 16, 2025

Uh oh!

pbontrager Sep 16, 2025

Uh oh!

pbontrager Sep 16, 2025

Uh oh!

pbontrager Sep 16, 2025

Uh oh!

Ritesh1905 Sep 16, 2025 •

edited

Loading

Uh oh!

pbontrager Sep 16, 2025

Uh oh!

Ritesh1905 Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# TODO: This file needs should NOT be in the data_models folder/package


		class Store(ABC):

[2/N] Core trainer abstraction #158

[2/N] Core trainer abstraction #158

Uh oh!

Conversation

Ritesh1905 commented Sep 15, 2025

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

Ritesh1905 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbontrager left a comment

Choose a reason for hiding this comment

Uh oh!

pbontrager Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Ritesh1905 Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

pbontrager Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

pbontrager Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

pbontrager Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Ritesh1905 Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbontrager Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Ritesh1905 Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ritesh1905 commented Sep 15, 2025 •

edited

Loading

Ritesh1905 Sep 16, 2025 •

edited

Loading