Fix Typo (#1611)

wwwjn · BioGeek · web-flow · commit fd230800d238 · 2025-08-21T10:12:28.000-07:00
We want to include this PR in our next release ASAP. Created another branch and revert CODE_OF_CONDUCT.md from @BioGeek 's #1583 . Much appreciated for @BioGeek's contribution! --------- Co-authored-by: Jeroen Van Goey <jeroen.vangoey@gmail.com>
diff --git a/docs/composability.md b/docs/composability.md
@@ -14,7 +14,7 @@ Example ([PR #322](https://github.com/pytorch/torchtitan/pull/322)):
 We decided to actually reuse the top-level model object on every PP stage, just delete the layers we don't want, and make sure that the top-level forward would do the right thing.  This means we don't have to make a separate runtime pp_forward that glues together child modules per stage.  The first change was using a moduledict instead of modulelist to store layers. This preserves layer Fully Qualified Names (FQNs) even when deleting some layers - e.g. layers.1 stays layers.1 even if you remove layers.0, which isn't true for a list- this matters for checkpoint save/load.  Preserving FQNs is a requirement for using Distributed Checkpointing (DCP) since it uses FQNs as globally unique IDs for sharding metadata. The second change was making the input and output layers optional- if the layer exists, we run it, otherwise we feed the input through to bypass it.  With these two changes, we can just (meta)-initialize the whole model, delete the unused parts per stage, then materialize the remaining part on GPU before loading a checkpoint.
 
 ## Using a seed checkpoint for init
-Initializing the pipeline-parallel model is challenging becuase we assume the model could be so large as to not fit on local GPU (or possibly, even on CPU), and we also want to use the (bitwise) same initialization as we use for 1D or 2D parallel models, to ease debugging or comparisons between runs. It's not that easy to rewrite the original model's `init_weights` function to be tolerant of initializing only some layers, and also serializing initialization operations globally for consistent RNG order.
+Initializing the pipeline-parallel model is challenging because we assume the model could be so large as to not fit on local GPU (or possibly, even on CPU), and we also want to use the (bitwise) same initialization as we use for 1D or 2D parallel models, to ease debugging or comparisons between runs. It's not that easy to rewrite the original model's `init_weights` function to be tolerant of initializing only some layers, and also serializing initialization operations globally for consistent RNG order.
 
 For now, we sidestep all these problems with a simple but brutal solution: Initialize the whole model on some CPU instance, save a checkpoint file, and then lean on Distributed Checkpointing's "load" functionality to initialize the FQNs that are present on a given PP stage after stage creation.  For future work, we consider adding a more elaborate initialization scheme to `torch.pipelining`.
 
diff --git a/docs/debugging.md b/docs/debugging.md
@@ -116,4 +116,4 @@ Here's a typical comparison setup (maintaining an overall DP degree of 4):
 
 To reproduce loss curves across above runs, you'll need to create a seed checkpoint, and then load the same seed checkpoint for all runs to ensure consistent model initialization on each rank. You might also need to set the `deterministic` mode to ensure consistent training behavior.
 
-We also provided an example of verifying the numerical consistency across parallism plans configs on Llama 3 in https://github.com/pytorch/torchtitan/blob/main/docs/converging.md.
+We also provided an example of verifying the numerical consistency across parallelism plans configs on Llama 3 in https://github.com/pytorch/torchtitan/blob/main/docs/converging.md.
diff --git a/tests/unit_tests/test_activation_checkpoint.py b/tests/unit_tests/test_activation_checkpoint.py
@@ -171,7 +171,7 @@ def get_act_mem(model_fn):
         self.assertEqual(mem_with_force_last, 1.0)
         self.assertEqual(mem_full_ac, 0.0)
         # Note: SAC > no-AC here because it unnecessarily saves "output"
-        # even that is not needed for recomputaion and output is double
+        # even that is not needed for recomputation and output is double
         # the size of the other two mms.
 
     def test_correctness(self):
diff --git a/tests/unit_tests/test_lr_scheduler.py b/tests/unit_tests/test_lr_scheduler.py
@@ -256,7 +256,7 @@ def test_warmup_stable_only(self):
     def test_warmup_plus_decay_exceeds_training(self):
         """Test when warmup + decay steps exceed training steps."""
         # Create a job config where warmup + decay steps > training steps
-        # Expected behaviro: warmup steps = 5, decay steps = 5
+        # Expected behavior: warmup steps = 5, decay steps = 5
         config = self.create_job_config(
             training_steps=10,
             warmup_steps=5,
diff --git a/torchtitan/components/checkpoint.py b/torchtitan/components/checkpoint.py
@@ -138,20 +138,20 @@ class CheckpointManager:
 
         We solve this in the Model and Optimizer wrapper classes by flattening the state dicts
         from each object into one state dict before saving/loading. We rely on the individual
-        state_dicts to not collide, which is gauranteed for the model by correct pipeline
+        state_dicts to not collide, which is guaranteed for the model by correct pipeline
         splitting and for the optimizer by the flattening support described in (1).
 
     3. LR schedulers also index model states like optimizers. Here we flatten the lr_schedulers
     with the assumption that all lr_schedulers have the same state_dict.
 
     Note: TorchFT checkpointing flow
 
-    There are two types of checkpoints: when TorchFT is enabled: 1) the full perisistent
+    There are two types of checkpoints: when TorchFT is enabled: 1) the full persistent
     checkpoint, 2) the per-replica checkpoint.
 
-    The full perisistent checkpoint is saved by the replica with
+    The full persistent checkpoint is saved by the replica with
     ``ft_manager.participating_rank() == 0``. It contains everything including the model,
-    optimizer, lr_scheduler, dataloader, and train_state. Right now the full perisistent
+    optimizer, lr_scheduler, dataloader, and train_state. Right now the full persistent
     checkpoint is loaded by all replicas. However, we can optimize it to only load if
     there are no other alive replicas.
 
@@ -294,7 +294,7 @@ def load_state_dict(state_dict):
             self.async_mode = AsyncMode.ASYNC_WITH_PINNED_MEM
         else:
             raise ValueError(
-                f"Unkown checkpoint async_mode {checkpoint_config.async_mode}"
+                f"Unknown checkpoint async_mode {checkpoint_config.async_mode}"
             )
 
         logger.info(
diff --git a/torchtitan/components/ft/config/job_config.py b/torchtitan/components/ft/config/job_config.py
@@ -52,7 +52,7 @@ class FaultTolerance(BaseFaultTolerance):
     Determines how to mix the local and global optimized parameters
 
     By default, we just use the global parameters. This ensures all
-    DDP replicas have the same parameters after syncrhonizing on
+    DDP replicas have the same parameters after synchronizing on
     the fragment. Tuning this can also affect the model quality.
 
     This is only used when "semi_sync_method" is set.
diff --git a/torchtitan/components/ft/manager.py b/torchtitan/components/ft/manager.py
@@ -49,7 +49,7 @@ def __init__(
         elif ft_config.process_group == "nccl":
             pg = ft.ProcessGroupNCCL(timeout=process_group_timeout)
         else:
-            raise ValueError(f"Unsuported process group: {ft_config.process_group}")
+            raise ValueError(f"Unsupported process group: {ft_config.process_group}")
 
         # If the training method is specific, then the quorum should be synchronous
         self.use_async_quorum = ft_config.semi_sync_method is None
diff --git a/torchtitan/components/lr_scheduler.py b/torchtitan/components/lr_scheduler.py
@@ -124,7 +124,7 @@ def build_lr_schedulers(
             decay_steps = training_steps - warmup_steps
     else:
         decay_steps = training_steps - warmup_steps
-    # Add a vitual last step to prevent the learning rate from dropping to 0
+    # Add a virtual last step to prevent the learning rate from dropping to 0
     stable_steps = training_steps + 1 - warmup_steps - decay_steps
     lr_decay_type = lr_scheduler_config.decay_type
     min_lr_factor = lr_scheduler_config.min_lr_factor
diff --git a/torchtitan/config/job_config.py b/torchtitan/config/job_config.py
@@ -35,7 +35,7 @@ class Profiling:
     """Trace files location"""
 
     profile_freq: int = 10
-    """How often to collect profile traces, in interations"""
+    """How often to collect profile traces, in iterations"""
 
     enable_memory_snapshot: bool = False
     """Whether to dump memory snapshot"""
@@ -381,7 +381,7 @@ class Parallelism:
       - cp * tp <= ep <= dp_shard * cp * tp
       - ep % (cp * tp) == 0
       - dp_shard * cp * tp % ep == 0
-    Note that this is still an experimental feature. Some contrains will be
+    Note that this is still an experimental feature. Some constraints will be
     relaxed soon when we have more flexible DeviceMesh support.
     """
 
diff --git a/torchtitan/config/manager.py b/torchtitan/config/manager.py
@@ -204,7 +204,7 @@ def _validate_config(self) -> None:
     def register_tyro_rules(registry: tyro.constructors.ConstructorRegistry) -> None:
         @registry.primitive_rule
         def list_str_rule(type_info: tyro.constructors.PrimitiveTypeInfo):
-            """Support for comma seperated string parsing"""
+            """Support for comma separate string parsing"""
             if type_info.type != list[str]:
                 return None
             return tyro.constructors.PrimitiveConstructorSpec(
diff --git a/torchtitan/distributed/utils.py b/torchtitan/distributed/utils.py
@@ -122,10 +122,10 @@ def set_determinism(
         torch.distributed.broadcast(seed_tensor, src=0)
         seed = seed_tensor.to("cpu").view(torch.uint64).item()
 
-    # Set distinct seed for each rank in mesh dimensions, with dimension name provdied by `distinct_seed_mesh_dim`
+    # Set distinct seed for each rank in mesh dimensions, with dimension name provided by `distinct_seed_mesh_dim`
     # For PP + SPMD cases, we want to separate the world into the SPMD mesh and the PP mesh,
     # and choose a unique seed for each rank on the PP mesh.
-    # TODO(jianiw): We could further extend this to support mutiple distinct dimensions instead of just one.
+    # TODO(jianiw): We could further extend this to support multiple distinct dimensions instead of just one.
     if (
         c10d.get_world_size() > 1
         and distinct_seed_mesh_dim in world_mesh.mesh_dim_names
diff --git a/torchtitan/experiments/deepseek_v3/group_gemms.py b/torchtitan/experiments/deepseek_v3/group_gemms.py
@@ -403,7 +403,7 @@ def arrange_expert_weights(self, all_weights, submod_name, module):
         fp8, scales = dsgemm_utils.prepare_fp8_weight(combined_weights)
 
         # prescale weights
-        # TODO - this creates 2 sets of weights, need to resolve this for traiing aspect.
+        # TODO - this creates 2 sets of weights, need to resolve this for training aspect.
         module.register_parameter(
             f"{submod_name}_fp8",
             nn.Parameter(
diff --git a/torchtitan/experiments/deepseek_v3/model.py b/torchtitan/experiments/deepseek_v3/model.py
@@ -382,7 +382,7 @@ def __init__(self, config):
         if self.topk_method == "noaux_tc":
             self.e_score_correction_bias = nn.Parameter(
                 # Changed from torch.empty to torch.rand to avoid non-even
-                # distribution for runs without actual weigths
+                # distribution for runs without actual weights
                 torch.rand((self.n_routed_experts))
             )
         self.reset_parameters()
@@ -519,7 +519,7 @@ def __init__(self, config):
 
         assert (
             MoE.group_mm in MoE.group_gemm_strategies
-        ), f"selected group gemm {self.group_mm} is not avaiable!"
+        ), f"selected group gemm {self.group_mm} is not available!"
         # keep active gg ready
         self.group_gemm_instance = MoE.group_gemm_strategies[MoE.group_mm]
         self._buffer_initialized = False
@@ -695,7 +695,7 @@ def moe_forward(self, x, topk_ids, topk_weight):
             # TODO: don't use `received`
             gathered_tokens = token_gather_buf[:received]
         else:  # "torch_all_to_all"
-            # Prepare input ans output splits
+            # Prepare input and output splits
             with torch.no_grad():
                 output_splits = tokens_per_expert_group.view(self.ep_size, -1).sum(
                     dim=1
@@ -1349,7 +1349,7 @@ def prepare_inputs_for_generation(
 
             # Keep only the unprocessed tokens:
             # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
-            # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as
+            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
             # input)
             if (
                 attention_mask is not None
diff --git a/torchtitan/experiments/flux/README.md b/torchtitan/experiments/flux/README.md
@@ -53,7 +53,7 @@ python -m torchtitan.experiments.flux.tests.integration_tests <output_dir>
 - Parallelism: The model supports FSDP, HSDP for training on multiple GPUs.
 - Activation checkpointing: The model uses activation checkpointing to reduce memory usage during training.
 - Distributed checkpointing and loading.
-    - Notes on the current checkpointing implementation: To keep the model wieghts are sharded the same way as checkpointing, we need to shard the model weights before saving the checkpoint. This is done by checking each module at the end of envaluation, and sharding the weights of the module if it is a FSDPModule.
+    - Notes on the current checkpointing implementation: To keep the model weights are sharded the same way as checkpointing, we need to shard the model weights before saving the checkpoint. This is done by checking each module at the end of evaluation, and sharding the weights of the module if it is a FSDPModule.
 - CI for FLUX model. Supported periodically running integration tests on 8 GPUs, and unittests.
 
 
diff --git a/torchtitan/experiments/flux/dataset/flux_dataset.py b/torchtitan/experiments/flux/dataset/flux_dataset.py
@@ -246,8 +246,8 @@ def __iter__(self):
             # TODO: Add support for robust data loading and error handling.
             # Currently, we assume the dataset is well-formed and does not contain corrupted samples.
             # If a corrupted sample is encountered, the program will crash and throw an exception.
-            # You can NOT try to catch the exception and continue, becuase the iterator within dataset
-            # is not broken after raising an exception, so calling next() will thorw StopIteration and might cause re-loop.
+            # You can NOT try to catch the exception and continue, because the iterator within dataset
+            # is not broken after raising an exception, so calling next() will throw StopIteration and might cause re-loop.
             try:
                 sample = next(dataset_iterator)
             except StopIteration:
diff --git a/torchtitan/experiments/flux/job_config.py b/torchtitan/experiments/flux/job_config.py
@@ -16,7 +16,7 @@ class Training:
     img_size: int = 256
     """Image width to sample"""
     test_mode: bool = False
-    """Whether to use intergration test mode, which will randomly initialize the encoder and use a dummy tokenizer"""
+    """Whether to use integration test mode, which will randomly initialize the encoder and use a dummy tokenizer"""
 
 
 @dataclass
@@ -71,7 +71,7 @@ class Inference:
 @dataclass
 class JobConfig:
     """
-    Extend the tyro parser with custom config classe for Flux model.
+    Extend the tyro parser with custom config classes for Flux model.
     """
 
     training: Training = field(default_factory=Training)
diff --git a/torchtitan/experiments/flux/model/layers.py b/torchtitan/experiments/flux/model/layers.py
@@ -232,13 +232,13 @@ def forward(
         attn = attention(q, k, v, pe=pe)
         txt_attn, img_attn = attn[:, : txt.shape[1]], attn[:, txt.shape[1] :]
 
-        # calculate the img bloks
+        # calculate the img blocks
         img = img + img_mod1.gate * self.img_attn.proj(img_attn)
         img = img + img_mod2.gate * self.img_mlp(
             (1 + img_mod2.scale) * self.img_norm2(img) + img_mod2.shift
         )
 
-        # calculate the txt bloks
+        # calculate the txt blocks
         txt = txt + txt_mod1.gate * self.txt_attn.proj(txt_attn)
         txt = txt + txt_mod2.gate * self.txt_mlp(
             (1 + txt_mod2.scale) * self.txt_norm2(txt) + txt_mod2.shift
diff --git a/torchtitan/experiments/flux/model/model.py b/torchtitan/experiments/flux/model/model.py
@@ -25,7 +25,7 @@ class FluxModel(nn.Module, ModelProtocol):
     """
     Transformer model for flow matching on sequences.
 
-    Agrs:
+    Args:
         model_args: FluxModelArgs.
 
     Attributes:
diff --git a/torchtitan/experiments/flux/sampling.py b/torchtitan/experiments/flux/sampling.py
@@ -85,7 +85,7 @@ def generate_image(
 ) -> torch.Tensor:
     """
     Sampling and save a single images from noise using a given prompt.
-    For randomized noise generation, the random seend should already be set at the begining of training.
+    For randomized noise generation, the random seend should already be set at the beginning of training.
     Since we will always use the local random seed on this rank, we don't need to pass in the seed again.
     """
 
diff --git a/torchtitan/experiments/flux/train.py b/torchtitan/experiments/flux/train.py
@@ -104,7 +104,7 @@ def forward_backward_step(
 
         # Keep these variables local to shorten the code as these are
         # the major variables that are used in the training loop.
-        # explicitely convert flux model to be Bfloat16 no matter FSDP is applied or not
+        # explicitly convert flux model to be Bfloat16 no matter FSDP is applied or not
         model = self.model_parts[0]
 
         # image in latent space transformed by self.auto_encoder
diff --git a/torchtitan/experiments/flux/utils.py b/torchtitan/experiments/flux/utils.py
@@ -65,7 +65,7 @@ def generate_noise_latent(
     dtype: torch.dtype,
     seed: int | None = None,
 ) -> Tensor:
-    """Generate noise latents for the Flux flow model. The random seed will be set at the begining of training.
+    """Generate noise latents for the Flux flow model. The random seed will be set at the beginning of training.
 
     Args:
         bsz (int): batch_size.
diff --git a/torchtitan/experiments/kernels/triton_contiguous_group_gemm/cg_forward.py b/torchtitan/experiments/kernels/triton_contiguous_group_gemm/cg_forward.py
@@ -29,7 +29,7 @@
 
 # ============ Triton kernel for contiguous grouped GEMM ============
 
-# L2 Caching optmization
+# L2 Caching optimization
 
 
 @triton.jit
diff --git a/torchtitan/experiments/llama4/README.md b/torchtitan/experiments/llama4/README.md
@@ -26,5 +26,5 @@ python scripts/download_hf_assets.py --assets tokenizer --repo_id meta-llama/Lla
 - Quantization
     - efficient float8 Grouped MM kernels (from torchao)
 - Testing
-    - perfomance and loss converging tests
+    - performance and loss converging tests
     - CI integration
diff --git a/torchtitan/experiments/multimodal/model.py b/torchtitan/experiments/multimodal/model.py
@@ -839,7 +839,7 @@ def forward(
         Processes images and returns the tokens and hidden states.
 
         Multiple images per sample: we add a dimension num_imgs to the input. This is useful when a single
-        sample constains multiple images, for example:
+        sample contains multiple images, for example:
 
         - sample 1: "<image> what animal is this?"
         - sample 2: "I like <image> more than <image>"
@@ -999,7 +999,7 @@ def forward(
 class FeedForwardForDecoder(nn.Module):
     """
     FeedForward module for the decoder. It's different from the one in the encoder.
-    This is the component which is orignally used in llama3.
+    This is the component which is originally used in llama3.
     """
 
     def __init__(
@@ -1301,7 +1301,7 @@ class FusionLayer(nn.Module):
     """
     Deep Fusion model architectures combine pretrained encoder models with pretrained
     language models by infusing the encoder outputs into the middle layers of the LLM.
-    This allows the language model to interpret the enocder outputs as text and
+    This allows the language model to interpret the encoder outputs as text and
     "understand" any modality for which you can train an decoder. To enable the language model
     to adapt to the encoder outputs, the FusionLayer fuses a new learnable layer to an existing
     decoder (language model) layer. This additional layer can take the encoder embeddings and
diff --git a/torchtitan/experiments/simple_fsdp/simple_fsdp.py b/torchtitan/experiments/simple_fsdp/simple_fsdp.py
@@ -227,7 +227,7 @@ def replicate_compute(self, x):
             )
 
             # re-wrap 1D all-gathered DTensor on dp_mesh to 1D DTensor on tp_mesh
-            # TODO: DTensor should support this mesh collasping operation
+            # TODO: DTensor should support this mesh collapsing operation
             replicated_local_tensor = replicated_dtensor.to_local(
                 grad_placements=self.grad_placements
             )
diff --git a/torchtitan/models/README.md b/torchtitan/models/README.md
@@ -4,7 +4,7 @@ For offline explorations, we recommend the same steps, unless otherwise noted.
 
 ## Adding the model
 
-Please refer to the [Llama 3 folder](.llama3) as an example.
+Please refer to the [Llama 3 folder](llama3) as an example.
 
 The folder should be organized as follows
 - `model` folder: a self-contained folder of model definition and args
diff --git a/torchtitan/models/attention.py b/torchtitan/models/attention.py
@@ -118,7 +118,7 @@ def _fixed_block_mask_mod(
         mask_mod: _mask_mod_signature, fixed_block_size: int
     ) -> _mask_mod_signature:
         """
-        Given an arbirary mask_mod, divide the input sequence to blocks
+        Given an arbitrary mask_mod, divide the input sequence to blocks
         and only allow attention within the same block.
 
         Args:
diff --git a/torchtitan/models/deepseek_v3/README.md b/torchtitan/models/deepseek_v3/README.md
diff --git a/torchtitan/models/deepseek_v3/model/state_dict_adapter.py b/torchtitan/models/deepseek_v3/model/state_dict_adapter.py
diff --git a/torchtitan/models/moe.py b/torchtitan/models/moe.py
diff --git a/torchtitan/protocols/model_converter.py b/torchtitan/protocols/model_converter.py
diff --git a/torchtitan/tools/utils.py b/torchtitan/tools/utils.py

Original file line number	Diff line number	Diff line change
`@@ -116,4 +116,4 @@ Here's a typical comparison setup (maintaining an overall DP degree of 4):`
`116`	`116`
`117`	`117`	To reproduce loss curves across above runs, you'll need to create a seed checkpoint, and then load the same seed checkpoint for all runs to ensure consistent model initialization on each rank. You might also need to set the `deterministic` mode to ensure consistent training behavior.
`118`	`118`
`119`		`-We also provided an example of verifying the numerical consistency across parallism plans configs on Llama 3 in https://github.com/pytorch/torchtitan/blob/main/docs/converging.md.`
	`119`	`+We also provided an example of verifying the numerical consistency across parallelism plans configs on Llama 3 in https://github.com/pytorch/torchtitan/blob/main/docs/converging.md.`