NVIDIA
diff --git a/‎CLAUDE.md‎
Lines changed: 7 additions & 11 deletions b/‎CLAUDE.md‎
Lines changed: 7 additions & 11 deletions
diff --git a/‎megatron/core/distributed/distributed_data_parallel_config.py‎
Lines changed: 0 additions & 22 deletions b/‎megatron/core/distributed/distributed_data_parallel_config.py‎
Lines changed: 0 additions & 22 deletions
@@ -17,20 +17,16 @@ This branch implements **Nonuniform Tensor Parallelism (NTP)**, a fault toleranc
 
 ### Key Changes
 
-**New Module**: `megatron/core/distributed/nonuniform_tp.py` (404 lines)
+**New Module**: `megatron/core/distributed/nonuniform_tp.py` (699 lines)
 - Implements nonuniform TP where a subset of TP ranks ("spares") provide fault tolerance
 - Supports arbitrary non-contiguous GPU failures across all parallelism dimensions (DP, CP, PP)
 - Core ranks handle computation; spare ranks enable recovery from failures
+- Defines `NonuniformTPConfig` dataclass for NTP configuration
+- Contains all NTP logic in subclasses: `NonuniformTPDistributedDataParallel`, `NonuniformTPParamAndGradBuffer`, `NonuniformTPOptimizer`
+- **Non-intrusive design**: All NTP functionality is self-contained, no modifications to core Megatron files required
 
 **Modified Files**:
-- `megatron/core/parallel_state.py`: Added NTP configuration support to `initialize_model_parallel()`
-- `megatron/core/distributed/distributed_data_parallel_config.py`: New fields for NTP config
-  - `tp_base`: Base tensor parallel size (e.g., 8)
-  - `tp_spares`: Number of spare ranks (e.g., 2 for reduced TP=6)
-  - `num_reduced_tp_dp_ranks`: How many DP ranks use reduced TP
-  - `non_active_ranks_per_dp`: Mapping of (DP, CP, PP) rank to list of non-active local TP ranks
-- `megatron/core/distributed/param_and_grad_buffer.py`: Parameter resharding for NTP
-- `megatron/core/optimizer/optimizer.py`: Optimizer integration
+- **None** - All NTP code is self-contained in `nonuniform_tp.py`
 
 ### NTP Concepts
 
@@ -44,10 +40,10 @@ This branch implements **Nonuniform Tensor Parallelism (NTP)**, a fault toleranc
 ### Example NTP Configuration
 
 ```python
-from megatron.core.distributed import DistributedDataParallelConfig
+from megatron.core.distributed.nonuniform_tp import NonuniformTPConfig
 
 # Configure NTP with 2 spare ranks out of 8
-ddp_config = DistributedDataParallelConfig(
+ntp_config = NonuniformTPConfig(
     tp_base=8,              # Original TP size
     tp_spares=2,            # 2 spares = 6 active ranks
     num_reduced_tp_dp_ranks=1,  # First DP rank uses reduced TP
 
@@ -162,28 +162,6 @@ class DistributedDataParallelConfig:
     delay_wgrad_compute: bool = False
     """Delay the weight gradient computation to improve batch-level communication overlapping"""
 
-    tp_base: int = 8
-    """Base for tensor parallelism. This is the number of ranks in healthy tensor parallel groups.
-       Used for nonuniform tensor parallelism."""
-
-    tp_spares: int = 0
-    """Number of spares for nonuniform tensor parallelism. When > 0, enables nonuniform TP mode
-       where (tp_base - tp_spares) ranks handle computation and tp_spares ranks provide fault tolerance."""
-
-    num_reduced_tp_dp_ranks: int = 1
-    """Number of DP ranks that use reduced TP (tp_base - tp_spares). The remaining DP ranks use
-       full tp_base. Reduced TP ranks are assumed to come first in the global rank ordering."""
-
-    non_active_ranks_per_dp: Optional[Dict[Tuple[int, int, int], List[int]]] = None
-    """Mapping of (DP rank, CP rank, PP rank) to list of non-active (spare) local TP rank IDs.
-       This allows specifying arbitrary GPU failures across all parallelism dimensions.
-       Example: {(0,0,0): [0,3], (0,1,0): [1,2], (1,0,0): [0,3]} means:
-         - DP rank 0, CP rank 0, PP rank 0 has local TP ranks 0,3 as spares
-         - DP rank 0, CP rank 1, PP rank 0 has local TP ranks 1,2 as spares
-         - DP rank 1, CP rank 0, PP rank 0 has local TP ranks 0,3 as spares
-       The number of non-active ranks must be consistent across CP replicas within each DP rank.
-       If None, defaults to last tp_spares ranks as non-active."""
-
     def __post_init__(self):
         import os