Skip to content

Commit 4ec4001

Browse files
committed
refactor: move NTP config to separate NonuniformTPConfig class
- Create NonuniformTPConfig dataclass in nonuniform_tp.py - Remove NTP fields from DistributedDataParallelConfig (non-intrusive) - Update all NTP functions/classes to use NonuniformTPConfig - Update all tests to use NonuniformTPConfig - Update CLAUDE.md documentation This makes the NTP implementation completely self-contained with zero modifications to core Megatron files.
1 parent 3708a47 commit 4ec4001

File tree

3 files changed

+125
-96
lines changed

3 files changed

+125
-96
lines changed

megatron/core/distributed/distributed_data_parallel_config.py

Lines changed: 0 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -162,28 +162,6 @@ class DistributedDataParallelConfig:
162162
delay_wgrad_compute: bool = False
163163
"""Delay the weight gradient computation to improve batch-level communication overlapping"""
164164

165-
tp_base: int = 8
166-
"""Base for tensor parallelism. This is the number of ranks in healthy tensor parallel groups.
167-
Used for nonuniform tensor parallelism."""
168-
169-
tp_spares: int = 0
170-
"""Number of spares for nonuniform tensor parallelism. When > 0, enables nonuniform TP mode
171-
where (tp_base - tp_spares) ranks handle computation and tp_spares ranks provide fault tolerance."""
172-
173-
num_reduced_tp_dp_ranks: int = 1
174-
"""Number of DP ranks that use reduced TP (tp_base - tp_spares). The remaining DP ranks use
175-
full tp_base. Reduced TP ranks are assumed to come first in the global rank ordering."""
176-
177-
non_active_ranks_per_dp: Optional[Dict[Tuple[int, int, int], List[int]]] = None
178-
"""Mapping of (DP rank, CP rank, PP rank) to list of non-active (spare) local TP rank IDs.
179-
This allows specifying arbitrary GPU failures across all parallelism dimensions.
180-
Example: {(0,0,0): [0,3], (0,1,0): [1,2], (1,0,0): [0,3]} means:
181-
- DP rank 0, CP rank 0, PP rank 0 has local TP ranks 0,3 as spares
182-
- DP rank 0, CP rank 1, PP rank 0 has local TP ranks 1,2 as spares
183-
- DP rank 1, CP rank 0, PP rank 0 has local TP ranks 0,3 as spares
184-
The number of non-active ranks must be consistent across CP replicas within each DP rank.
185-
If None, defaults to last tp_spares ranks as non-active."""
186-
187165
def __post_init__(self):
188166
import os
189167

0 commit comments

Comments
 (0)