Skip to content

Commit fd23080

Browse files
wwwjnBioGeek
andauthored
Fix Typo (#1611)
We want to include this PR in our next release ASAP. Created another branch and revert CODE_OF_CONDUCT.md from @BioGeek 's #1583 . Much appreciated for @BioGeek's contribution! --------- Co-authored-by: Jeroen Van Goey <[email protected]>
1 parent 82d6c3b commit fd23080

File tree

32 files changed

+52
-52
lines changed

32 files changed

+52
-52
lines changed

docs/composability.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Example ([PR #322](https://github.com/pytorch/torchtitan/pull/322)):
1414
We decided to actually reuse the top-level model object on every PP stage, just delete the layers we don't want, and make sure that the top-level forward would do the right thing. This means we don't have to make a separate runtime pp_forward that glues together child modules per stage. The first change was using a moduledict instead of modulelist to store layers. This preserves layer Fully Qualified Names (FQNs) even when deleting some layers - e.g. layers.1 stays layers.1 even if you remove layers.0, which isn't true for a list- this matters for checkpoint save/load. Preserving FQNs is a requirement for using Distributed Checkpointing (DCP) since it uses FQNs as globally unique IDs for sharding metadata. The second change was making the input and output layers optional- if the layer exists, we run it, otherwise we feed the input through to bypass it. With these two changes, we can just (meta)-initialize the whole model, delete the unused parts per stage, then materialize the remaining part on GPU before loading a checkpoint.
1515

1616
## Using a seed checkpoint for init
17-
Initializing the pipeline-parallel model is challenging becuase we assume the model could be so large as to not fit on local GPU (or possibly, even on CPU), and we also want to use the (bitwise) same initialization as we use for 1D or 2D parallel models, to ease debugging or comparisons between runs. It's not that easy to rewrite the original model's `init_weights` function to be tolerant of initializing only some layers, and also serializing initialization operations globally for consistent RNG order.
17+
Initializing the pipeline-parallel model is challenging because we assume the model could be so large as to not fit on local GPU (or possibly, even on CPU), and we also want to use the (bitwise) same initialization as we use for 1D or 2D parallel models, to ease debugging or comparisons between runs. It's not that easy to rewrite the original model's `init_weights` function to be tolerant of initializing only some layers, and also serializing initialization operations globally for consistent RNG order.
1818

1919
For now, we sidestep all these problems with a simple but brutal solution: Initialize the whole model on some CPU instance, save a checkpoint file, and then lean on Distributed Checkpointing's "load" functionality to initialize the FQNs that are present on a given PP stage after stage creation. For future work, we consider adding a more elaborate initialization scheme to `torch.pipelining`.
2020

docs/debugging.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,4 +116,4 @@ Here's a typical comparison setup (maintaining an overall DP degree of 4):
116116

117117
To reproduce loss curves across above runs, you'll need to create a seed checkpoint, and then load the same seed checkpoint for all runs to ensure consistent model initialization on each rank. You might also need to set the `deterministic` mode to ensure consistent training behavior.
118118

119-
We also provided an example of verifying the numerical consistency across parallism plans configs on Llama 3 in https://github.com/pytorch/torchtitan/blob/main/docs/converging.md.
119+
We also provided an example of verifying the numerical consistency across parallelism plans configs on Llama 3 in https://github.com/pytorch/torchtitan/blob/main/docs/converging.md.

tests/unit_tests/test_activation_checkpoint.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ def get_act_mem(model_fn):
171171
self.assertEqual(mem_with_force_last, 1.0)
172172
self.assertEqual(mem_full_ac, 0.0)
173173
# Note: SAC > no-AC here because it unnecessarily saves "output"
174-
# even that is not needed for recomputaion and output is double
174+
# even that is not needed for recomputation and output is double
175175
# the size of the other two mms.
176176

177177
def test_correctness(self):

tests/unit_tests/test_lr_scheduler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -256,7 +256,7 @@ def test_warmup_stable_only(self):
256256
def test_warmup_plus_decay_exceeds_training(self):
257257
"""Test when warmup + decay steps exceed training steps."""
258258
# Create a job config where warmup + decay steps > training steps
259-
# Expected behaviro: warmup steps = 5, decay steps = 5
259+
# Expected behavior: warmup steps = 5, decay steps = 5
260260
config = self.create_job_config(
261261
training_steps=10,
262262
warmup_steps=5,

torchtitan/components/checkpoint.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -138,20 +138,20 @@ class CheckpointManager:
138138
139139
We solve this in the Model and Optimizer wrapper classes by flattening the state dicts
140140
from each object into one state dict before saving/loading. We rely on the individual
141-
state_dicts to not collide, which is gauranteed for the model by correct pipeline
141+
state_dicts to not collide, which is guaranteed for the model by correct pipeline
142142
splitting and for the optimizer by the flattening support described in (1).
143143
144144
3. LR schedulers also index model states like optimizers. Here we flatten the lr_schedulers
145145
with the assumption that all lr_schedulers have the same state_dict.
146146
147147
Note: TorchFT checkpointing flow
148148
149-
There are two types of checkpoints: when TorchFT is enabled: 1) the full perisistent
149+
There are two types of checkpoints: when TorchFT is enabled: 1) the full persistent
150150
checkpoint, 2) the per-replica checkpoint.
151151
152-
The full perisistent checkpoint is saved by the replica with
152+
The full persistent checkpoint is saved by the replica with
153153
``ft_manager.participating_rank() == 0``. It contains everything including the model,
154-
optimizer, lr_scheduler, dataloader, and train_state. Right now the full perisistent
154+
optimizer, lr_scheduler, dataloader, and train_state. Right now the full persistent
155155
checkpoint is loaded by all replicas. However, we can optimize it to only load if
156156
there are no other alive replicas.
157157
@@ -294,7 +294,7 @@ def load_state_dict(state_dict):
294294
self.async_mode = AsyncMode.ASYNC_WITH_PINNED_MEM
295295
else:
296296
raise ValueError(
297-
f"Unkown checkpoint async_mode {checkpoint_config.async_mode}"
297+
f"Unknown checkpoint async_mode {checkpoint_config.async_mode}"
298298
)
299299

300300
logger.info(

torchtitan/components/ft/config/job_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ class FaultTolerance(BaseFaultTolerance):
5252
Determines how to mix the local and global optimized parameters
5353
5454
By default, we just use the global parameters. This ensures all
55-
DDP replicas have the same parameters after syncrhonizing on
55+
DDP replicas have the same parameters after synchronizing on
5656
the fragment. Tuning this can also affect the model quality.
5757
5858
This is only used when "semi_sync_method" is set.

torchtitan/components/ft/manager.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def __init__(
4949
elif ft_config.process_group == "nccl":
5050
pg = ft.ProcessGroupNCCL(timeout=process_group_timeout)
5151
else:
52-
raise ValueError(f"Unsuported process group: {ft_config.process_group}")
52+
raise ValueError(f"Unsupported process group: {ft_config.process_group}")
5353

5454
# If the training method is specific, then the quorum should be synchronous
5555
self.use_async_quorum = ft_config.semi_sync_method is None

torchtitan/components/lr_scheduler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ def build_lr_schedulers(
124124
decay_steps = training_steps - warmup_steps
125125
else:
126126
decay_steps = training_steps - warmup_steps
127-
# Add a vitual last step to prevent the learning rate from dropping to 0
127+
# Add a virtual last step to prevent the learning rate from dropping to 0
128128
stable_steps = training_steps + 1 - warmup_steps - decay_steps
129129
lr_decay_type = lr_scheduler_config.decay_type
130130
min_lr_factor = lr_scheduler_config.min_lr_factor

torchtitan/config/job_config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ class Profiling:
3535
"""Trace files location"""
3636

3737
profile_freq: int = 10
38-
"""How often to collect profile traces, in interations"""
38+
"""How often to collect profile traces, in iterations"""
3939

4040
enable_memory_snapshot: bool = False
4141
"""Whether to dump memory snapshot"""
@@ -381,7 +381,7 @@ class Parallelism:
381381
- cp * tp <= ep <= dp_shard * cp * tp
382382
- ep % (cp * tp) == 0
383383
- dp_shard * cp * tp % ep == 0
384-
Note that this is still an experimental feature. Some contrains will be
384+
Note that this is still an experimental feature. Some constraints will be
385385
relaxed soon when we have more flexible DeviceMesh support.
386386
"""
387387

torchtitan/config/manager.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ def _validate_config(self) -> None:
204204
def register_tyro_rules(registry: tyro.constructors.ConstructorRegistry) -> None:
205205
@registry.primitive_rule
206206
def list_str_rule(type_info: tyro.constructors.PrimitiveTypeInfo):
207-
"""Support for comma seperated string parsing"""
207+
"""Support for comma separate string parsing"""
208208
if type_info.type != list[str]:
209209
return None
210210
return tyro.constructors.PrimitiveConstructorSpec(

0 commit comments

Comments
 (0)