Skip to content

[doc] refactor: split technique guidance into docs, skills, and knowledge#2809

Open
yaoyu-33 wants to merge 7 commits intomainfrom
yuya/public-perf-techniques-knowledge
Open

[doc] refactor: split technique guidance into docs, skills, and knowledge#2809
yaoyu-33 wants to merge 7 commits intomainfrom
yuya/public-perf-techniques-knowledge

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Mar 15, 2026

Summary

  • move technique content to a clearer three-layer model: stable docs in docs/training, operational guides in skills/perf-techniques, and machine-readable cards in knowledge/techniques
  • rewrite the training docs to keep stable guidance there while moving enablement details, code anchors, and verification commands out of the docs pages
  • add missing knowledge cards and a dedicated hybrid context parallel doc so the technique set no longer depends on duplicated validated-knowledge pages

Test plan

  • Run pre-commit run --files on the staged docs, skills, and knowledge files
  • Verify the resulting staged tree is docs-only and does not include the roadmap proposal

Made with Cursor

Summary by CodeRabbit

  • Documentation
    • Added strategy selection guide for choosing parallelism approaches based on model type, hardware topology, and sequence length.
    • Introduced comprehensive guides for CUDA graphs optimization, hybrid context parallelism, and Megatron FSDP.
    • Simplified and reorganized communication overlap and packed sequences documentation with clearer usage guidelines.
    • Consolidated resiliency feature documentation with stable constraints and implementation guidance.
    • Established structured technique documentation with operational guides and validation cards for performance optimization strategies.

Clarify which training guidance is stable documentation versus operational enablement so technique knowledge can be maintained without duplicated docs.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33 yaoyu-33 added docs Documentation-only updates or documentation debt area:misc Cross-cutting utilities, logging, helpers, and other changes docs-only With great power comes great responsibility. labels Mar 15, 2026
@yaoyu-33
Copy link
Contributor Author

/ok to test e397f32

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add docs/skills/knowledge split for parallelism strategy selection,
based on real recipe layouts from DeepSeek-V2/V3, Qwen3, Kimi-K2, etc.
Key insight: MoE models size TP by active params (often TP=1-2) and use
EP as the primary scaling dimension, unlike dense models.

Verified TP=2+PP=2+SP on cluster with llama32_1b_pretrain_config.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
cuichenx
cuichenx previously approved these changes Mar 16, 2026
Move perf-technique skills from flat .md files to subfolders with
SKILL.md + card.yaml, consolidating knowledge/techniques/ YAML cards
alongside their skills. Add resiliency skill and card. Refactor
docs/training/resiliency.md to be concise conceptual docs, offloading
operational details to skills/resiliency/SKILL.md.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Add stable docs page, operational SKILL.md, and machine-readable
card.yaml for CUDA graph capture in Megatron Bridge. Update
performance-guide.md to reference current cuda_graph_impl/scope
config fields instead of the deprecated enable_cuda_graph.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33
Copy link
Contributor Author

/ok to test 5d8044c

@yaoyu-33 yaoyu-33 marked this pull request as ready for review March 17, 2026 21:30
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Documentation restructuring and expansion across training guides and knowledge organization. Consolidates verbose operational guidance into stable, concise overviews in docs/training/ while establishing a new skills/perf-techniques/ directory with structured SKILL.md and card.yaml pairs for performance techniques and resiliency features.

Changes

Cohort / File(s) Summary
Training Documentation Rewrites
docs/training/communication-overlap.md, docs/training/megatron-fsdp.md, docs/training/packed-sequences.md, docs/training/resiliency.md
Replaces granular, mode-specific guidance and extensive code examples with high-level overviews, stable constraints, and "When to Use It" sections. Removes detailed configuration snippets and procedural walkthroughs; consolidates into streamlined feature descriptions and references to related documentation.
New Training Documentation
docs/training/cuda-graphs.md, docs/training/hybrid-context-parallel.md
Introduces new stable documentation pages covering CUDA graphs support (overview, implementations, configuration, constraints) and hierarchical context parallelism (Bridge-facing a2a+p2p transport path and hierarchical_context_parallel_sizes).
Documentation Index Updates
docs/parallelisms.md, docs/performance-guide.md, docs/training/README.md
Adds parallelism strategy selection guide, updates CUDA graph references, and inserts new table entries for Hybrid Context Parallel and Megatron FSDP features.
Skills Infrastructure
knowledge/techniques/README.md, skills/perf-techniques/README.md
Establishes skills directory structure documentation. Technique knowledge cards moved to co-locate with operational guides in skills/perf-techniques// as card.yaml, with stable human-facing docs remaining in docs/training/.
Performance Technique Skills
skills/perf-techniques/cuda-graphs/..., skills/perf-techniques/expert-parallel-overlap/..., skills/perf-techniques/hybrid-context-parallel/..., skills/perf-techniques/megatron-fsdp/..., skills/perf-techniques/parallelism-strategies/..., skills/perf-techniques/sequence-packing/..., skills/perf-techniques/tp-dp-comm-overlap/...
Introduces seven new technique skill directories, each containing SKILL.md (operational guide with enablement steps, code anchors, pitfalls, verification procedures) and card.yaml (structured metadata with validation status, feature meanings, constraints, and evidence references).
Resiliency Skills
skills/resiliency/SKILL.md, skills/resiliency/card.yaml
Adds comprehensive resiliency documentation covering fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine features, with configuration details, code anchors, pitfalls, and verification guidance; complemented by structured card.yaml with validation status and constraints.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title '[doc] refactor: split technique guidance into docs, skills, and knowledge' accurately and concisely describes the main refactoring objective—reorganizing technique guidance into a three-layer model (docs, skills, knowledge) as confirmed by commit messages and file structure changes throughout the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR contains only documentation (.md) and configuration (.yaml) files with no code changes affecting numerics, convergence, or performance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yuya/public-perf-techniques-knowledge
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

🧹 Nitpick comments (3)
skills/perf-techniques/cuda-graphs/SKILL.md (1)

217-221: Consider hyphenating compound modifier.

Static analysis suggests "TE-scoped" for clarity when used as a compound adjective before a noun.

📝 Optional fix
 4. **Memory overhead**: CUDA graphs pin all intermediate buffers for the
-   graph's lifetime (no memory reuse). TE scoped graphs add a few GB;
+   graph's lifetime (no memory reuse). TE-scoped graphs add a few GB;
    full-iteration graphs can increase peak memory by 1.5–2×. `PP > 1`
    compounds overhead since each stage holds its own graph.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/cuda-graphs/SKILL.md` around lines 217 - 221, Change
the compound modifier "TE scoped graphs" to a hyphenated form "TE-scoped graphs"
in the paragraph discussing memory overhead (the line that currently reads "TE
scoped graphs add a few GB;"), so the compound adjective is clear when used
before the noun; update the occurrence(s) of "TE scoped" to "TE-scoped" to
maintain consistent style.
skills/resiliency/SKILL.md (1)

263-265: Consider more concise phrasing (optional).

Minor style suggestion from static analysis: "is not compatible with" could be "is incompatible with" for conciseness.

📝 Optional fix
-3. **IPR + NeMo-Run**: In-process restart is not compatible with NeMo-Run
+3. **IPR + NeMo-Run**: In-process restart is incompatible with NeMo-Run
    or Slurm preemption plugins. Requires specific PyTorch/NCCL versions
    and env vars.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/resiliency/SKILL.md` around lines 263 - 265, Replace the phrase "is
not compatible with" with the more concise "is incompatible with" in the bullet
that begins "IPR + NeMo-Run" so the line reads "IPR + NeMo-Run: In-process
restart is incompatible with NeMo-Run or Slurm preemption plugins." Keep the
rest of the sentence (the note about specific PyTorch/NCCL versions and env
vars) unchanged; update only the quoted wording in the SKILL.md bullet under
"IPR + NeMo-Run".
skills/perf-techniques/sequence-packing/card.yaml (1)

39-50: Avoid pseudo-values in this machine-readable block.

match_seq_length, ">1", and 2 * context_parallel_size are prose placeholders inside scalar YAML values. That makes recommended_path hard for tooling to consume consistently and undercuts the machine-readable-card goal. Consider moving these into explicit constraint fields or structured metadata instead.

As per coding guidelines, "Do not add arbitrary defaults for configs, be as explicit as possible".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/sequence-packing/card.yaml` around lines 39 - 50, The
YAML uses prose placeholders inside scalar values (e.g.,
recommended_path.llm_packed_sft.dataset.packed_sequence_specs.packed_sequence_size:
match_seq_length, recommended_path.vlm_in_batch_packing.train.micro_batch_size:
">1", and
recommended_path.cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult: 2
* context_parallel_size); replace these pseudo-values with explicit,
machine-readable fields or structured constraints: for example add a constraint
object like packed_sequence_specs.packed_sequence_size_mode: "match_seq_length"
or packed_sequence_specs.packed_sequence_size_constraint: {mode:
"match_seq_length"}, replace the string range with a min/max field such as
vlm_in_batch_packing.train.micro_batch_size_min: 2, and replace arithmetic with
a factor field such as
cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult_factor: 2 plus a
reference name context_parallel_size_variable so tooling can compute
pad_seq_to_mult at runtime—update llm_packed_sft, cp_finetuning, and
vlm_in_batch_packing accordingly so all values are explicit and
machine-readable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/parallelisms.md`:
- Around line 508-512: Update the broken cross-reference in
docs/parallelisms.md: replace the non-existent
../knowledge/techniques/parallelism_strategies.yaml link with the relocated card
path under skills/perf-techniques (e.g.,
../skills/perf-techniques/parallelism-strategies/card.yaml) so the "Knowledge
card" link in the "Related Artifacts" section points to the correct card.yaml;
edit the link text that currently references parallelism_strategies.yaml and
ensure it matches the directory name used by the existing skill file
(parallelism-strategies) and the card.yaml filename.

In `@docs/training/communication-overlap.md`:
- Around line 12-13: Update the stale skill file references that point to
pre-split markdown files by changing them to the new per-technique SKILL.md
layout: replace `skills/perf-techniques/tp-dp-comm-overlap.md` with
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` and replace
`skills/perf-techniques/expert-parallel-overlap.md` with
`skills/perf-techniques/expert-parallel-overlap/SKILL.md`; also apply the same
path updates to the other occurrences noted (101-102) so all references point to
the new SKILL.md locations.

In `@docs/training/cuda-graphs.md`:
- Around line 45-56: The docs use two different RNG-tracker
names—cfg.rng.te_rng_tracker in the example and use_te_rng_tracker in the
constraints—so pick one and make them consistent; update the constraint line to
refer to the same symbol used in the example (cfg.rng.te_rng_tracker) or
conversely change the example to use use_te_rng_tracker, and ensure the
constraint referencing cuda_graph_impl still reads "cfg.rng.te_rng_tracker must
be True when cfg.model.cuda_graph_impl != 'none'".
- Line 1: The new document docs/training/cuda-graphs.md is orphaned; update the
relevant toctree by adding a reference to "training/cuda-graphs.md" in the docs
index—specifically edit docs/index.md (or the appropriate docs/training index)
and insert the relative path or title entry into the toctree/list so the page is
included in the site navigation; ensure the entry matches the existing toctree
style and run a local docs build to verify the warning is resolved.

In `@docs/training/hybrid-context-parallel.md`:
- Around line 7-8: Update the opening pointer string that currently references
"skills/perf-techniques/hybrid-context-parallel.md" to use the new layout path
"skills/perf-techniques/hybrid-context-parallel/SKILL.md" so the link matches
the rest of the file; search for the literal reference
"skills/perf-techniques/hybrid-context-parallel.md" and replace it with
"skills/perf-techniques/hybrid-context-parallel/SKILL.md".
- Around line 1-8: The new doc "Hybrid / Hierarchical Context Parallel"
(hybrid-context-parallel.md) is not listed in any Sphinx toctree and will be
orphaned; update the relevant toctree (e.g., add an entry in docs/index.md or
the training toctree that aggregates docs/training/*.md) to include
"training/hybrid-context-parallel.md" so the page is reachable in the built docs
and the toc.not_included warning is resolved.

In `@docs/training/megatron-fsdp.md`:
- Line 10: The docs reference the old path
"skills/perf-techniques/megatron-fsdp.md" which is stale; update all internal
links in docs/training/megatron-fsdp.md that point to
"skills/perf-techniques/megatron-fsdp.md" (including the occurrences around the
top and near line 100) to the new path
"skills/perf-techniques/megatron-fsdp/SKILL.md" so internal navigation works;
ensure you replace every exact match of the old filename string with the new
"SKILL.md" path.

In `@docs/training/packed-sequences.md`:
- Around line 8-10: The docs reference
'skills/perf-techniques/sequence-packing.md' which no longer exists; update
every occurrence (including the one at lines ~83-89) to point to the relocated
guide 'skills/perf-techniques/sequence-packing/SKILL.md' so links resolve
correctly; search the file for the string "sequence-packing.md" and replace it
with "sequence-packing/SKILL.md" in the content and any link markup (the
affected file is docs/training/packed-sequences.md).

In `@docs/training/resiliency.md`:
- Around line 15-24: The table entry for "Local checkpointing" is missing its
corresponding documentation sections; add a matching "Local checkpointing"
feature block with "When to Use It" and "Stable Constraints" (and any other
sections used for other features like motivation/usage) following the exact
structure and tone used for entries such as "Async checkpoint save" and
"Preemption", or if this feature should not be documented yet, remove the "Local
checkpointing" row from the table to keep the table consistent; locate
references to "Local checkpointing" in the doc and implement the new sections
with clear guidance on motivation, technical approach, architecture, usage
examples, and implementation details to satisfy the project documentation
requirements.

In `@skills/perf-techniques/expert-parallel-overlap/card.yaml`:
- Around line 27-33: Update the card's known_constraints to mirror the runtime
checks in src/megatron/bridge/training/comm_overlap.py: add a constraint that
moe_shared_expert_overlap must be disabled/false (referencing the
moe_shared_expert_overlap check), require torch version >= 2.6.0 (referencing
the torch version check), and enumerate the recompute-related constraints
enforced in the file (recompute_granularity, recompute_method,
recompute_num_layers) plus the mtp_num_layers constraint so configurations
failing those checks are rejected by the card; ensure each constraint references
the exact setting names used in the code (moe_shared_expert_overlap,
recompute_granularity, recompute_method, recompute_num_layers, mtp_num_layers,
torch version) and phrase them consistently with the existing items.

In `@skills/perf-techniques/expert-parallel-overlap/SKILL.md`:
- Around line 106-110: Update the "Success criteria" in SKILL.md to remove the
hard-coded "41 passed" count and instead require that the targeted test files
pass with zero failures; specifically replace the "41 passed" bullet with a
statement that tests under test_comm_overlap.py (MoE overlap and delayed-wgrad
validation) and test_deepep.py (DeepEP/HybridEP helper activation and GPU
gating) must pass with no failures, so the criteria track those files rather
than a total pytest number.
- Around line 41-49: Add the missing hard requirement that
moe_shared_expert_overlap = False to the "Required constraints" list because the
code asserts this when overlap_moe_expert_parallel_comm is enabled; update
SKILL.md's constraints block to include the exact constraint
"moe_shared_expert_overlap = False" alongside the other items and mention it
applies whenever overlap_moe_expert_parallel_comm is true so readers know it's
not optional.

In `@skills/perf-techniques/hybrid-context-parallel/card.yaml`:
- Around line 51-60: The evidence list in card.yaml references five files that
don't exist; either add the missing files (create
scripts/verify_hybrid_context_parallel.py and the four Megatron-LM stubs under
3rdparty/Megatron-LM/megatron/core and megatron/training) or remove those
entries from the evidence array to keep the card accurate; locate the evidence
key in skills/perf-techniques/hybrid-context-parallel/card.yaml and update it so
only existing files (src/megatron/bridge/training/initialize.py, config.py,
model_load_save.py, and docs/performance-guide.md) remain or add the missing
files with the expected verification script and Megatron modules.

In `@skills/perf-techniques/hybrid-context-parallel/SKILL.md`:
- Around line 94-103: The documentation references a missing verification script
scripts/verify_hybrid_context_parallel.py in SKILL.md which causes "No such
file" errors; either add the missing script or remove/update the verification
commands. Locate the SKILL.md section that runs the verification (the
CUDA_VISIBLE_DEVICES... uv run python -m torch.distributed.run commands) and (a)
if the verification logic exists elsewhere, point the commands to the correct
script/module name and update the examples, or (b) if the verification is
intended to be provided, implement scripts/verify_hybrid_context_parallel.py
matching the CLI flags (--mode mpu/decentralized) and expected behavior, or (c)
remove the verification subsection entirely; update the README references and
any mentions of verify_hybrid_context_parallel.py accordingly.

In `@skills/perf-techniques/megatron-fsdp/card.yaml`:
- Around line 3-6: The torch_fsdp2_runtime status in the card.yaml is
inconsistent with the documented failure in known_limitations; update the
torch_fsdp2_runtime field (in skills/perf-techniques/megatron-fsdp/card.yaml)
from "unclear" to an explicit status reflecting the runtime failure (e.g.,
"failed" or "runtime_failure") and ensure the value aligns with the
known_limitations note about the Torch FSDP2 TypeError from pg_collection so
readers see the runtime failure state consistently.

In `@skills/perf-techniques/megatron-fsdp/SKILL.md`:
- Around line 27-35: The example config block for Megatron FSDP is missing the
cfg.ddp.average_in_collective = False override and thus diverges from the
recommended minimal override and perf-harness overrides; update the snippet that
sets cfg.dist.use_megatron_fsdp, cfg.ddp.use_megatron_fsdp,
cfg.ddp.data_parallel_sharding_strategy, cfg.checkpoint.ckpt_format,
cfg.checkpoint.save, and cfg.checkpoint.load to also set
cfg.ddp.average_in_collective = False so the copy-paste example matches the
recommended override set.

In `@skills/perf-techniques/parallelism-strategies/SKILL.md`:
- Around line 207-208: The cross-reference path in the paragraph mentioning
`skills/perf-techniques/tp-dp-comm-overlap.md` is incorrect; update that
reference to the correct target
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` so the link points to the
existing SKILL.md file (edit the occurrence of
`skills/perf-techniques/tp-dp-comm-overlap.md` in the current `SKILL.md` file
and replace it with `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`).

In `@skills/perf-techniques/sequence-packing/SKILL.md`:
- Around line 39-44: The doc block must mention that enabling pad_cu_seqlens
requires packed-sequence metadata to be present; update the text to explicitly
state that when cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True you must
supply packed-sequence metadata for your dataset (or disable pad_cu_seqlens),
because the dataset code (see src.megatron.bridge.data.datasets.sft assertion in
the packed-sequence handling) will assert that metadata exists; reference
pad_cu_seqlens, cfg.dataset.dataset_kwargs["pad_to_max_length"], and the
assertion in the sft dataset so users with custom packed datasets know to
generate/provide the required packed metadata.

In `@skills/perf-techniques/tp-dp-comm-overlap/card.yaml`:
- Around line 31-36: Add a new known_constraints entry stating that TP overlap
requires Transformer Engine to be present; specifically mention that TP overlap
is disabled when HAVE_TE is false (see the gating in comm_overlap where the code
checks elif not HAVE_TE), so include "TP overlap requires Transformer Engine
(HAVE_TE)". Keep the same YAML list style and place it alongside the existing TP
overlap constraints referencing tensor_model_parallel_size and
sequence_parallel.

---

Nitpick comments:
In `@skills/perf-techniques/cuda-graphs/SKILL.md`:
- Around line 217-221: Change the compound modifier "TE scoped graphs" to a
hyphenated form "TE-scoped graphs" in the paragraph discussing memory overhead
(the line that currently reads "TE scoped graphs add a few GB;"), so the
compound adjective is clear when used before the noun; update the occurrence(s)
of "TE scoped" to "TE-scoped" to maintain consistent style.

In `@skills/perf-techniques/sequence-packing/card.yaml`:
- Around line 39-50: The YAML uses prose placeholders inside scalar values
(e.g.,
recommended_path.llm_packed_sft.dataset.packed_sequence_specs.packed_sequence_size:
match_seq_length, recommended_path.vlm_in_batch_packing.train.micro_batch_size:
">1", and
recommended_path.cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult: 2
* context_parallel_size); replace these pseudo-values with explicit,
machine-readable fields or structured constraints: for example add a constraint
object like packed_sequence_specs.packed_sequence_size_mode: "match_seq_length"
or packed_sequence_specs.packed_sequence_size_constraint: {mode:
"match_seq_length"}, replace the string range with a min/max field such as
vlm_in_batch_packing.train.micro_batch_size_min: 2, and replace arithmetic with
a factor field such as
cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult_factor: 2 plus a
reference name context_parallel_size_variable so tooling can compute
pad_seq_to_mult at runtime—update llm_packed_sft, cp_finetuning, and
vlm_in_batch_packing accordingly so all values are explicit and
machine-readable.

In `@skills/resiliency/SKILL.md`:
- Around line 263-265: Replace the phrase "is not compatible with" with the more
concise "is incompatible with" in the bullet that begins "IPR + NeMo-Run" so the
line reads "IPR + NeMo-Run: In-process restart is incompatible with NeMo-Run or
Slurm preemption plugins." Keep the rest of the sentence (the note about
specific PyTorch/NCCL versions and env vars) unchanged; update only the quoted
wording in the SKILL.md bullet under "IPR + NeMo-Run".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d69fecaa-5a5e-4c83-8fbc-5de00f65b2bf

📥 Commits

Reviewing files that changed from the base of the PR and between bb5790f and 5d8044c.

📒 Files selected for processing (27)
  • docs/parallelisms.md
  • docs/performance-guide.md
  • docs/training/README.md
  • docs/training/communication-overlap.md
  • docs/training/cuda-graphs.md
  • docs/training/hybrid-context-parallel.md
  • docs/training/megatron-fsdp.md
  • docs/training/packed-sequences.md
  • docs/training/resiliency.md
  • knowledge/techniques/README.md
  • skills/perf-techniques/README.md
  • skills/perf-techniques/cuda-graphs/SKILL.md
  • skills/perf-techniques/cuda-graphs/card.yaml
  • skills/perf-techniques/expert-parallel-overlap/SKILL.md
  • skills/perf-techniques/expert-parallel-overlap/card.yaml
  • skills/perf-techniques/hybrid-context-parallel/SKILL.md
  • skills/perf-techniques/hybrid-context-parallel/card.yaml
  • skills/perf-techniques/megatron-fsdp/SKILL.md
  • skills/perf-techniques/megatron-fsdp/card.yaml
  • skills/perf-techniques/parallelism-strategies/SKILL.md
  • skills/perf-techniques/parallelism-strategies/card.yaml
  • skills/perf-techniques/sequence-packing/SKILL.md
  • skills/perf-techniques/sequence-packing/card.yaml
  • skills/perf-techniques/tp-dp-comm-overlap/SKILL.md
  • skills/perf-techniques/tp-dp-comm-overlap/card.yaml
  • skills/resiliency/SKILL.md
  • skills/resiliency/card.yaml

Comment on lines +508 to +512
## Related Artifacts

- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix broken cross-reference to knowledge card.

The pipeline failure indicates the path ../knowledge/techniques/parallelism_strategies.yaml does not exist. Based on the README at knowledge/techniques/README.md, cards have been relocated to skills/perf-techniques/<technique>/card.yaml. Update the reference to point to the correct location.

🔧 Proposed fix
 ## Related Artifacts
 
 - **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
-- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
+- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Related Artifacts
- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
## Related Artifacts
- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status
🧰 Tools
🪛 GitHub Actions: Build docs

[warning] 511-511: Sphinx: cross-reference target not found: '../knowledge/techniques/parallelism_strategies.yaml' [myst.xref_missing]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/parallelisms.md` around lines 508 - 512, Update the broken
cross-reference in docs/parallelisms.md: replace the non-existent
../knowledge/techniques/parallelism_strategies.yaml link with the relocated card
path under skills/perf-techniques (e.g.,
../skills/perf-techniques/parallelism-strategies/card.yaml) so the "Knowledge
card" link in the "Related Artifacts" section points to the correct card.yaml;
edit the link text that currently references parallelism_strategies.yaml and
ensure it matches the directory name used by the existing skill file
(parallelism-strategies) and the card.yaml filename.

Comment on lines +12 to +13
- `skills/perf-techniques/tp-dp-comm-overlap.md`
- `skills/perf-techniques/expert-parallel-overlap.md`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update these skill references to the new .../SKILL.md layout.

These pointers still use the pre-split skills/perf-techniques/*.md paths, but this PR is moving skills into per-technique directories with SKILL.md/card.yaml pairs. At least skills/perf-techniques/expert-parallel-overlap/SKILL.md already follows that layout, so the current file references will be stale immediately.

📝 Suggested path fix
- `skills/perf-techniques/tp-dp-comm-overlap.md`
- `skills/perf-techniques/expert-parallel-overlap.md`
+ `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`
+ `skills/perf-techniques/expert-parallel-overlap/SKILL.md`
...
- `skills/perf-techniques/tp-dp-comm-overlap.md`
- `skills/perf-techniques/expert-parallel-overlap.md`
+ `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`
+ `skills/perf-techniques/expert-parallel-overlap/SKILL.md`

Also applies to: 101-102

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/training/communication-overlap.md` around lines 12 - 13, Update the
stale skill file references that point to pre-split markdown files by changing
them to the new per-technique SKILL.md layout: replace
`skills/perf-techniques/tp-dp-comm-overlap.md` with
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` and replace
`skills/perf-techniques/expert-parallel-overlap.md` with
`skills/perf-techniques/expert-parallel-overlap/SKILL.md`; also apply the same
path updates to the other occurrences noted (101-102) so all references point to
the new SKILL.md locations.

@@ -0,0 +1,84 @@
# CUDA Graphs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add this page to the docs toctree.

The docs build is already warning that this page is not included in any toctree, so the new document is currently orphaned and easy to miss. Please wire it into the relevant docs index/toctree before merging.

As per coding guidelines, "Ensure docs/index.md is updated when new markdown docs are added under docs/**/*.md or markdown files are renamed".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/training/cuda-graphs.md` at line 1, The new document
docs/training/cuda-graphs.md is orphaned; update the relevant toctree by adding
a reference to "training/cuda-graphs.md" in the docs index—specifically edit
docs/index.md (or the appropriate docs/training index) and insert the relative
path or title entry into the toctree/list so the page is included in the site
navigation; ensure the entry matches the existing toctree style and run a local
docs build to verify the warning is resolved.

Comment on lines +45 to +56
```python
cfg.model.cuda_graph_impl = "transformer_engine" # or "local"
cfg.model.cuda_graph_scope = ["attn", "moe_router"] # scope list
cfg.model.cuda_graph_warmup_steps = 3 # warmup before capture
cfg.rng.te_rng_tracker = True # required
```

### Key constraints

- `use_te_rng_tracker` must be `True` when `cuda_graph_impl != "none"`.
- `full_iteration` scope requires `cuda_graph_impl = "local"` and
`rerun_state_machine.check_for_nan_in_loss = False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use one RNG-tracker setting name here.

The example uses cfg.rng.te_rng_tracker, but the constraint list switches to use_te_rng_tracker. One of those names is wrong, so the setup instructions are currently self-contradictory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/training/cuda-graphs.md` around lines 45 - 56, The docs use two
different RNG-tracker names—cfg.rng.te_rng_tracker in the example and
use_te_rng_tracker in the constraints—so pick one and make them consistent;
update the constraint line to refer to the same symbol used in the example
(cfg.rng.te_rng_tracker) or conversely change the example to use
use_te_rng_tracker, and ensure the constraint referencing cuda_graph_impl still
reads "cfg.rng.te_rng_tracker must be True when cfg.model.cuda_graph_impl !=
'none'".

Comment on lines +1 to +8
# Hybrid / Hierarchical Context Parallel

This page covers the stable Bridge-facing meaning of hierarchical context
parallelism, especially the `a2a+p2p` transport path and
`hierarchical_context_parallel_sizes`.

For operational setup, code anchors, and verification commands, see
`skills/perf-techniques/hybrid-context-parallel.md`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add this new page to a Sphinx toctree before merging.

The docs pipeline is already warning document isn't included in any toctree [toc.not_included], so this page will be orphaned in the built docs until docs/index.md or the relevant training toctree includes it. Based on learnings, "Ensure docs/index.md is updated when new markdown docs are added under docs/**/*.md or markdown files are renamed".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/training/hybrid-context-parallel.md` around lines 1 - 8, The new doc
"Hybrid / Hierarchical Context Parallel" (hybrid-context-parallel.md) is not
listed in any Sphinx toctree and will be orphaned; update the relevant toctree
(e.g., add an entry in docs/index.md or the training toctree that aggregates
docs/training/*.md) to include "training/hybrid-context-parallel.md" so the page
is reachable in the built docs and the toc.not_included warning is resolved.

Comment on lines +3 to +6
summary: >
Megatron FSDP is the practical FSDP path in Megatron-Bridge today. PyTorch
FSDP2 exists in code, but remains experimental and failed at runtime during
live validation on the current branch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat skills/perf-techniques/megatron-fsdp/card.yaml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2172


Update torch_fsdp2_runtime status to reflect the documented runtime failure.

The field is marked unclear but known_limitations documents a specific FSDP2 runtime failure: "Live validation of the current branch hit a Torch FSDP2 runtime TypeError from pg_collection." Change the status from unclear to explicitly reflect this failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/megatron-fsdp/card.yaml` around lines 3 - 6, The
torch_fsdp2_runtime status in the card.yaml is inconsistent with the documented
failure in known_limitations; update the torch_fsdp2_runtime field (in
skills/perf-techniques/megatron-fsdp/card.yaml) from "unclear" to an explicit
status reflecting the runtime failure (e.g., "failed" or "runtime_failure") and
ensure the value aligns with the known_limitations note about the Torch FSDP2
TypeError from pg_collection so readers see the runtime failure state
consistently.

Comment on lines +27 to +35
```python
cfg = llama3_8b_pretrain_config()
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.checkpoint.ckpt_format = "fsdp_dtensor"
cfg.checkpoint.save = "/tmp/fsdp_ckpts"
cfg.checkpoint.load = None
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the copy-paste recipe aligned with the recommended override set.

The example on Lines 27-35 drops cfg.ddp.average_in_collective = False, even though the minimal override above and the perf-harness overrides on Lines 91-98 both include it. As written, users who copy the example end up testing a different configuration than the one this skill recommends.

Suggested fix
 cfg = llama3_8b_pretrain_config()
 cfg.dist.use_megatron_fsdp = True
 cfg.ddp.use_megatron_fsdp = True
 cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
+cfg.ddp.average_in_collective = False
 cfg.checkpoint.ckpt_format = "fsdp_dtensor"
 cfg.checkpoint.save = "/tmp/fsdp_ckpts"
 cfg.checkpoint.load = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/megatron-fsdp/SKILL.md` around lines 27 - 35, The
example config block for Megatron FSDP is missing the
cfg.ddp.average_in_collective = False override and thus diverges from the
recommended minimal override and perf-harness overrides; update the snippet that
sets cfg.dist.use_megatron_fsdp, cfg.ddp.use_megatron_fsdp,
cfg.ddp.data_parallel_sharding_strategy, cfg.checkpoint.ckpt_format,
cfg.checkpoint.save, and cfg.checkpoint.load to also set
cfg.ddp.average_in_collective = False so the copy-paste example matches the
recommended override set.

Comment on lines +207 to +208
7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with
overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap.md`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the correct path for tp-dp-comm-overlap documentation
fd -t f "SKILL.md" skills/perf-techniques/tp-dp-comm-overlap/
fd -t f "tp-dp-comm-overlap" skills/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 121


Fix incorrect cross-reference path.

Line 208 references skills/perf-techniques/tp-dp-comm-overlap.md, but the file is located at skills/perf-techniques/tp-dp-comm-overlap/SKILL.md.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/parallelism-strategies/SKILL.md` around lines 207 -
208, The cross-reference path in the paragraph mentioning
`skills/perf-techniques/tp-dp-comm-overlap.md` is incorrect; update that
reference to the correct target
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` so the link points to the
existing SKILL.md file (edit the occurrence of
`skills/perf-techniques/tp-dp-comm-overlap.md` in the current `SKILL.md` file
and replace it with `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`).

Comment on lines +39 to +44
If CUDA graphs are enabled for this packed path:

```python
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Call out the metadata prerequisite for pad_cu_seqlens.

This block currently implies those two flags are sufficient, but src/megatron/bridge/data/datasets/sft.py:765-772 also asserts that packed-sequence metadata is present when pad_cu_seqlens=True. Without that note, users with custom packed datasets will hit an assertion after following this section.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/sequence-packing/SKILL.md` around lines 39 - 44, The
doc block must mention that enabling pad_cu_seqlens requires packed-sequence
metadata to be present; update the text to explicitly state that when
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True you must supply
packed-sequence metadata for your dataset (or disable pad_cu_seqlens), because
the dataset code (see src.megatron.bridge.data.datasets.sft assertion in the
packed-sequence handling) will assert that metadata exists; reference
pad_cu_seqlens, cfg.dataset.dataset_kwargs["pad_to_max_length"], and the
assertion in the sft dataset so users with custom packed datasets know to
generate/provide the required packed metadata.

Comment on lines +31 to +36
known_constraints:
- TP overlap requires tensor_model_parallel_size > 1.
- TP overlap requires sequence_parallel=True.
- DP overlap is tied to the distributed-optimizer path.
- PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup.
- Launch-time environment tuning is part of practical overlap behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the comm_overlap.py file and check around lines 421-465
find . -type f -name "comm_overlap.py" | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 117


🏁 Script executed:

# Find the card.yaml file mentioned in the review
find . -type f -path "*/tp-dp-comm-overlap/card.yaml"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 124


🏁 Script executed:

# Check if the line range 421-465 exists and what's in it
if [ -f "src/megatron/bridge/training/comm_overlap.py" ]; then
  wc -l src/megatron/bridge/training/comm_overlap.py
  echo "=== Lines 421-465 ===" 
  sed -n '421,465p' src/megatron/bridge/training/comm_overlap.py
fi

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2311


🏁 Script executed:

# Search for Transformer Engine checks in the comm_overlap.py file
rg "TransformerEngine|TE|transformer.engine" src/megatron/bridge/training/comm_overlap.py -A 3 -B 3 || echo "No matches found"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2366


🏁 Script executed:

# Check the card.yaml file content to see the current known_constraints
cat src/megatron/bridge/skills/perf-techniques/tp-dp-comm-overlap/card.yaml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 174


🏁 Script executed:

# Use the correct path found earlier
cat ./skills/perf-techniques/tp-dp-comm-overlap/card.yaml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2113


Add the Transformer Engine prerequisite to known_constraints.

The card marks TP overlap gating as code-verified, but known_constraints omits a gate that the code enforces. In src/megatron/bridge/training/comm_overlap.py:444-446, TP overlap is disabled when Transformer Engine is unavailable (elif not HAVE_TE). This constraint should be documented alongside the tensor-parallel size and sequence-parallel requirements.

📝 Suggested addition
 known_constraints:
   - TP overlap requires tensor_model_parallel_size > 1.
   - TP overlap requires sequence_parallel=True.
+  - TP overlap requires Transformer Engine to be available.
   - DP overlap is tied to the distributed-optimizer path.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
known_constraints:
- TP overlap requires tensor_model_parallel_size > 1.
- TP overlap requires sequence_parallel=True.
- DP overlap is tied to the distributed-optimizer path.
- PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup.
- Launch-time environment tuning is part of practical overlap behavior.
known_constraints:
- TP overlap requires tensor_model_parallel_size > 1.
- TP overlap requires sequence_parallel=True.
- TP overlap requires Transformer Engine to be available.
- DP overlap is tied to the distributed-optimizer path.
- PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup.
- Launch-time environment tuning is part of practical overlap behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/tp-dp-comm-overlap/card.yaml` around lines 31 - 36,
Add a new known_constraints entry stating that TP overlap requires Transformer
Engine to be present; specifically mention that TP overlap is disabled when
HAVE_TE is false (see the gating in comm_overlap where the code checks elif not
HAVE_TE), so include "TP overlap requires Transformer Engine (HAVE_TE)". Keep
the same YAML list style and place it alongside the existing TP overlap
constraints referencing tensor_model_parallel_size and sequence_parallel.

…h HCP docs

- Add skills/perf-techniques/moe-comm-overlap/ (SKILL.md + card.yaml)
- Add skills/perf-techniques/packed-sequences-long-context/ (SKILL.md + card.yaml)
- Enrich hybrid-context-parallel SKILL.md with implementation map and pitfalls
- Move conceptual content (what/when) from skills into docs per source-of-truth split:
  - docs/training/hybrid-context-parallel.md: CP backends table, KV-head limit, decision tree
  - docs/training/communication-overlap.md: MoE dispatch/combine overlap details
  - docs/training/packed-sequences.md: three-path table (offline SFT, VLM, CP)
- Note SP interaction with CP packing (pad_seq_to_mult = 2 * CP * TP)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33
Copy link
Contributor Author

/ok to test 75c9995

@yaoyu-33
Copy link
Contributor Author

/claude review

For operational details on configuring combined parallelism, troubleshooting
layouts, and memory estimation, see the
[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies.md).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This links to ../skills/perf-techniques/parallelism-strategies.md, but the actual file is skills/perf-techniques/parallelism-strategies/SKILL.md. The link will 404.

Suggested change
[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies/SKILL.md).

Comment on lines +509 to +510

- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two broken links here:

  1. ../skills/perf-techniques/parallelism-strategies.md should be ../skills/perf-techniques/parallelism-strategies/SKILL.md
  2. ../knowledge/techniques/parallelism_strategies.yaml — this file doesn't exist in the repo and isn't added by this PR

Comment on lines +127 to +131

Verify the supported MPU path:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification section references scripts/verify_hybrid_context_parallel.py, but this script doesn't exist in the repo and isn't added by this PR. Either add the script or replace this with a verification method that uses existing files.

- Fix broken cross-references: update all stale *.md skill paths to
  the new */SKILL.md layout used throughout this PR
- Add cuda-graphs.md and hybrid-context-parallel.md to docs/index.md
  toctree so they are no longer orphaned in the Sphinx build
- Fix RNG tracker naming: use cfg.rng.te_rng_tracker consistently
  instead of mixing with use_te_rng_tracker in cuda-graphs.md
- Add Local Checkpointing section to resiliency.md with When to Use It
  and Stable Constraints so the feature table entry is not incomplete
- Add moe_shared_expert_overlap = False as a hard required constraint
  in expert-parallel-overlap/SKILL.md (it is asserted at runtime)
- Replace hard-coded "41 passed" with "zero failures" wording in
  expert-parallel-overlap/SKILL.md so it does not drift
- Add cfg.ddp.average_in_collective = False to FSDP example recipe
  to match the minimal override set documented above it
- Update megatron-fsdp/card.yaml torch_fsdp2_runtime from unclear to
  known_failure to reflect the documented runtime TypeError
- Replace non-existent scripts/verify_hybrid_context_parallel.py in
  hybrid-context-parallel/SKILL.md and card.yaml with existing
  test_decentralized_pg.py and a manual smoke-check recipe invocation
- Add pad_cu_seqlens metadata prerequisite note in
  sequence-packing/SKILL.md (asserted in sft.py:765-772)
- Fix cross-reference in parallelism-strategies/SKILL.md line 208

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Contributor Author

/ok to test 4f6cca7

All skill and cross-doc references in docs/training/ were plain
backtick literals. Convert them to [text](relative-path) hyperlinks
so they are clickable in GitHub and rendered Sphinx docs.

Covers communication-overlap.md, cuda-graphs.md, hybrid-context-parallel.md,
megatron-fsdp.md, packed-sequences.md, and resiliency.md.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33
Copy link
Contributor Author

/ok to test f543ff5

@malay-nagda malay-nagda requested a review from erhoo82 March 18, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:misc Cross-cutting utilities, logging, helpers, and other changes docs Documentation-only updates or documentation debt docs-only With great power comes great responsibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants