Add basic API docs #336

svekars · 2025-10-07T19:12:11Z

The docs build workflow now installs torchmonarch instead of a custom monarch wheel, and sets up library and CUDA paths to ensure all dependencies (especially native ones) are available.
The ForgeActor class now includes detailed docstrings for its resource attributes (procs, hosts, with_gpus, num_replicas, mesh_name), and the options method provides example usage in the docstring. This makes it easier for users to understand resource configuration for distributed training.
Adds a docsting for the Service.stop method
Docstrings from Service methods are copied to ServiceActor methods, ensuring complete documentation for Sphinx autodoc.
Sphinx configuration changes improve navigation depth

codecov-commenter · 2025-10-09T18:42:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@7be455d). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #336   +/-   ##
=======================================
  Coverage        ?   64.72%           
=======================================
  Files           ?       79           
  Lines           ?     7707           
  Branches        ?        0           
=======================================
  Hits            ?     4988           
  Misses          ?     2719           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

allenwang28

looks good! I just had some nits and minor suggestions for filling in docstrings that we didn't include before.

docs/source/api.md

allenwang28 · 2025-10-13T17:53:02Z

src/forge/actors/reference_model.py

 @dataclass
 class ReferenceModel(ForgeActor):
+    """
+    Reference model implementation for the TorchForge service.


suggested docstring:

"""A reference model actor for reinforcement learning (RL) training. Based on TorchTitan's engine architecture, this actor provides a frozen model that only runs forward passes without gradient computation. It is typically used to maintain algorithmic consistency in policy optimization methods such as GRPO (Group Relative Policy Optimization) or PPO (Proximal Policy Optimization), where it serves as a fixed reference point to compute KL divergence penalties against the training policy. The reference model is loaded from a checkpoint and runs in evaluation mode with inference_mode enabled to optimize memory and compute efficiency. Attributes: model (Model): Model configuration (architecture, vocab size, etc.) parallelism (Parallelism): Parallelism strategy configuration (TP, PP, CP, DP) checkpoint (Checkpoint): Checkpoint loading configuration compile (Compile): Torch compilation settings comm (Comm): Communication backend configuration training (Training): Training-related settings (dtype, garbage collection, etc.) """

allenwang28 · 2025-10-13T17:57:58Z

src/forge/actors/trainer.py

 @dataclass
 class RLTrainer(ForgeActor):
+    """
+    RL Trainer implementation for the TorchForge service.


suggested docstring:

"""A reinforcement learning trainer actor for policy optimization training. Built on top of TorchTitan's training engine, this actor provides a complete training loop for reinforcement learning. It performs forward and backward passes with gradient computation, optimization steps, and checkpoint management. Unlike the ReferenceModel actor which only runs forward passes, RLTrainer actively updates the policy model parameters through gradient descent. The trainer supports the same distributed distributed training strategies that TorchTitan does, including but not limited to, tensor parallelism, data parallelism, and FSDP (Fully Sharded Data Parallel). It is typically used in conjunction with ReferenceModel for policy optimization algorithms like GRPO (Group Relative Policy Optimization), where it optimizes the policy against a loss that includes KL divergence penalties from the reference model. The trainer handles: - Forward and backward propagation with automatic mixed precision (AMP) - Optimizer steps with learning rate scheduling - Distributed checkpoint saving and loading - Weight synchronization via torchstore for distributed inference - Memory management with garbage collection Attributes: job (Job): Job configuration (name, dump path, etc.) model (Model): Model configuration (architecture, vocab size, etc.) optimizer (Optimizer): Optimizer configuration (type, learning rate, etc.) lr_scheduler (LRScheduler): Learning rate scheduler configuration training (Training): Training settings (steps, batch size, dtype, etc.) parallelism (Parallelism): Parallelism strategy configuration (TP, PP, CP, DP) checkpoint (Checkpoint): Checkpoint loading and saving configuration activation_checkpoint (ActivationCheckpoint): Activation checkpointing settings compile (Compile): Torch compilation settings quantize (Quantize): Quantization settings comm (Comm): Communication backend configuration memory_estimation (MemoryEstimation): Memory profiling configuration loss (Callable): Loss function to compute training loss from logits and targets state_dict_key (str): Key for state dict storage in torchstore use_dcp (bool): Whether to use distributed checkpoint (DCP) format dcp_path (str): Path for DCP storage """

Co-authored-by: Allen Wang <[email protected]>

svekars added 4 commits October 7, 2025 10:49

Update

d3f729f

Update

9823f1f

Add API docs for actor/service

e9bd538

Update

23dda07

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025

svekars added 12 commits October 7, 2025 12:30

Update

76434e3

Update

d9542a3

Update

f101a80

Update

067ded4

Update

499c919

precommit

ef35244

Fixes

bbd8d8f

Update

cba68df

Update

f828bc1

Precommit

60ec7e7

Update

754f7f7

Update

cf66872

svekars added 8 commits October 9, 2025 11:51

Update

b7b39c1

Update

dfb2b01

Update

678c82c

Update

8162490

Update

5d1c5c9

Update

197d24d

Update

bb6b585

Update

8c186b1

svekars requested review from allenwang28 and pbontrager October 10, 2025 20:27

svekars marked this pull request as ready for review October 10, 2025 20:27

Update

25221e7

Update

f3c79f6

allenwang28 approved these changes Oct 13, 2025

View reviewed changes

svekars and others added 7 commits October 13, 2025 11:45

Update docs/source/api.md

bee92c6

Co-authored-by: Allen Wang <[email protected]>

Update

ae05ebf

Update

0060494

Merge branch 'main' into api-docs

16c89c1

Update

00842a5

Update

635bbc1

Update

a1972f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add basic API docs #336

Add basic API docs #336

svekars commented Oct 7, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 9, 2025 •

edited

Loading

Uh oh!

allenwang28 left a comment

Uh oh!

Uh oh!

allenwang28 Oct 13, 2025

Uh oh!

allenwang28 Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add basic API docs #336

Are you sure you want to change the base?

Add basic API docs #336

Conversation

svekars commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

allenwang28 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

svekars commented Oct 7, 2025 •

edited

Loading

codecov-commenter commented Oct 9, 2025 •

edited

Loading