[Not for submission] Example of 4 node Trainer #434

allenwang28 · 2025-10-16T00:54:01Z

for Hangoo reference

Key changes -

set data_parallel_shard_degree: -1 - this enables "full FSDP"
- not THE most optimal parallelism for this config, but this works well enough just for fitting a big model
- for some reason I could not set tensor_parallel: 8 and data_parallel_replicate_degree: 4 in Titan (may be user error)
as a result, the batch size we need to pass to the replay buffer becomes 64, which is local_batch_size * data_parallel_degree = 2 * 32 (since we have 32 GPUs in the trainer)

I also accidentally added in my staged Qwen MoE 30B which should also work lol

codecov-commenter · 2025-10-16T00:57:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.71%. Comparing base (633b219) to head (77bb612).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #434      +/-   ##
==========================================
+ Coverage   64.69%   64.71%   +0.01%     
==========================================
  Files          79       79              
  Lines        7775     7776       +1     
==========================================
+ Hits         5030     5032       +2     
+ Misses       2745     2744       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

example 4 host

77bb612

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2025

Jack-Khuu added the NOT_FOR_REVIEW PR's from Core Maintainers, not intended for review or landing label Oct 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Not for submission] Example of 4 node Trainer #434

[Not for submission] Example of 4 node Trainer #434

Uh oh!

allenwang28 commented Oct 16, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Not for submission] Example of 4 node Trainer #434

Are you sure you want to change the base?

[Not for submission] Example of 4 node Trainer #434

Uh oh!

Conversation

allenwang28 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 16, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

allenwang28 commented Oct 16, 2025 •

edited

Loading