Skip to content

Conversation

allenwang28
Copy link
Contributor

@allenwang28 allenwang28 commented Oct 16, 2025

for Hangoo reference

Key changes -

  • set data_parallel_shard_degree: -1 - this enables "full FSDP"
    • not THE most optimal parallelism for this config, but this works well enough just for fitting a big model
    • for some reason I could not set tensor_parallel: 8 and data_parallel_replicate_degree: 4 in Titan (may be user error)
  • as a result, the batch size we need to pass to the replay buffer becomes 64, which is local_batch_size * data_parallel_degree = 2 * 32 (since we have 32 GPUs in the trainer)

I also accidentally added in my staged Qwen MoE 30B which should also work lol

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 16, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.71%. Comparing base (633b219) to head (77bb612).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #434      +/-   ##
==========================================
+ Coverage   64.69%   64.71%   +0.01%     
==========================================
  Files          79       79              
  Lines        7775     7776       +1     
==========================================
+ Hits         5030     5032       +2     
+ Misses       2745     2744       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Jack-Khuu Jack-Khuu added the NOT_FOR_REVIEW PR's from Core Maintainers, not intended for review or landing label Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. NOT_FOR_REVIEW PR's from Core Maintainers, not intended for review or landing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants