Tensor statistic collection via NVIDIA-DLFW-Inspect & TE debug modules#3811
Tensor statistic collection via NVIDIA-DLFW-Inspect & TE debug modules#3811paul-gibbons wants to merge 8 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
|
How does this PR relate to what's in bridge? https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/6d8a1c547d2fc0a9d2812c3d6477fb0fca4d3714/src/megatron/bridge/training/tensor_inspect.py#L4 |
Brings the same functionality to MLM training loop. Needed in MLM for further development of additional set of MRs to add tensor inspect modules inside of mcore |
Can consolidate this? @maanug-nv ? Maybe bridge gets this from mcore? |
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
Signed-off-by: Paul Gibbons <pgibbons@nvidia.com>
deepakn94
left a comment
There was a problem hiding this comment.
@paul-gibbons Can you add some documentation on what the output of this looks like? You can add it somewhere here: https://github.com/NVIDIA/Megatron-LM/tree/main/docs. Screenshots, etc. useful!
What does this PR do ?
Adds support for NVIDIA DLFW Inspect into Megatron training loop, which enables the user to collect tensor statistics from TransformerEngine's debug modules.
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.