pytorch
diff --git a/‎.ci/docker/requirements-dev.txt‎
Lines changed: 0 additions & 1 deletion b/‎.ci/docker/requirements-dev.txt‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎.ci/docker/requirements.txt‎
Lines changed: 0 additions & 2 deletions b/‎.ci/docker/requirements.txt‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎.github/workflows/integration_test_8gpu_torchft.yaml‎
Lines changed: 5 additions & 3 deletions b/‎.github/workflows/integration_test_8gpu_torchft.yaml‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 2 deletions b/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎benchmarks/README.md‎
Lines changed: 2 additions & 2 deletions b/‎benchmarks/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎benchmarks/llama3-8b_h200_202506_trainy-whitefiber.md‎
Lines changed: 1 addition & 2 deletions b/‎benchmarks/llama3-8b_h200_202506_trainy-whitefiber.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/checkpoint.md‎
Lines changed: 39 additions & 39 deletions b/‎docs/checkpoint.md‎
Lines changed: 39 additions & 39 deletions
diff --git a/‎docs/converging.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/converging.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/datasets.md‎
Lines changed: 5 additions & 3 deletions b/‎docs/datasets.md‎
Lines changed: 5 additions & 3 deletions
@@ -3,5 +3,4 @@ pytest==7.3.2
 pytest-cov
 pre-commit
 pyrefly==0.45.1
-tomli-w >= 1.1.0
 transformers
@@ -1,8 +1,6 @@
 torchdata >= 0.8.0
 datasets >= 3.6.0
-tomli >= 1.1.0 ; python_version < "3.11"
 tensorboard
-tabulate
 wandb
 fsspec
 tyro
 
@@ -6,11 +6,13 @@ on:
     tags:
       - ciflow/8gpu/*
     paths:
-      - 'torchtitan/components/ft.py'
+      - 'torchtitan/experiments/ft/**'
+      - 'torchtitan/components/checkpoint.py'
       - '.github/workflows/integration_test_8gpu_torchft.yaml'
   pull_request:
     paths:
-      - 'torchtitan/components/ft.py'
+      - 'torchtitan/experiments/ft/**'
+      - 'torchtitan/components/checkpoint.py'
       - '.github/workflows/integration_test_8gpu_torchft.yaml'
   schedule:
     # Runs every 6 hours
@@ -71,6 +73,6 @@ jobs:
         RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 > /dev/null 2>&1 &
         echo "ft_integration_test"
         # Getting error - Cuda failure 217 'peer access is not supported between these two devices'
-        python -m tests.integration_tests.ft $RUNNER_TEMP/artifacts-to-be-uploaded --ngpu 8
+        python -m torchtitan.experiments.ft.tests.integration_tests $RUNNER_TEMP/artifacts-to-be-uploaded --ngpu 8
         # pkill -9 torchft_lighthouse
         rm -rf $RUNNER_TEMP/artifacts-to-be-uploaded/*/checkpoint
@@ -51,7 +51,7 @@ Note: To accelerate contributions to and innovations around `torchtitan`, we are
   - After the model change, it should still load the original checkpoint correctly.
   - Document the reasons for the code change, similar to [composability.md](docs/composability.md).
 - Keep code modularized, especially for [train.py](train.py), so that it remains easy to copy-paste into a minimal code example. If necessary:
-  - Introduce new config options/category in [job_config.py](torchtitan/config/job_config.py).
+  - Introduce new config options/category in [configs.py](torchtitan/config/configs.py).
   - Create separate functions/files.
 
 ### Proof of Value
@@ -75,7 +75,7 @@ When appropriate, one should consider
 
 - Adding CPU/GPU unit/integration tests.
   - To add a unit test, put it in the [tests](tests/) folder and follow the existing test files.
-  - To add a GPU integration test, create a new `OverrideDefinitions` in [integration_tests.py](tests/integration_tests.py). It will override the default config to run on the Llama 3 [debug model](torchtitan/models/llama/train_configs/debug_model.toml).
+  - To add a GPU integration test, create a new `OverrideDefinitions` in [integration_tests](tests/integration_tests/). It will override the default config to run on the Llama 3 debug model (see [config_registry.py](torchtitan/models/llama3/config_registry.py)).
 - Updating [README](README.md) and writing a new note in the [docs](docs/) folder on installation and usage, similar to [float8.md](docs/float8.md).
 - Updating [performance.md](docs/performance.md) with new performance results.
 - Creating GitHub issues for things that cannot be addressed at the moment.
 
@@ -68,11 +68,11 @@ We look forward to your contributions!
 7. DDP and HSDP
 8. [TorchFT](https://github.com/pytorch/torchft) integration
 9. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) and support for [custom datasets](docs/datasets.md)
-10. Gradient accumulation, enabled by giving an additional `--training.global_batch_size` argument in configuration
+10. Gradient accumulation, enabled by giving an additional `--training.global_batch_size` argument on the CLI
 11. Flexible learning rate scheduler (warmup-stable-decay)
 12. Loss, GPU memory, throughput (tokens/sec), TFLOPs, and MFU displayed and logged via [Tensorboard or Weights & Biases](/docs/metrics.md)
 13. [Debugging tools](docs/debugging.md) including CPU/GPU profiling, memory profiling, Flight Recorder, etc.
-14. All options easily configured via [toml files](torchtitan/models/llama3/train_configs/)
+14. All options easily configured via [Python config registry](torchtitan/models/llama3/config_registry.py) with `--module` and `--config` CLI flags
 15. [Helper scripts](scripts/) to
     - download tokenizers from Hugging Face
     - convert original Llama 3 checkpoints into the expected DCP format
@@ -142,7 +142,7 @@ python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets
 Llama 3 8B model locally on 8 GPUs
 
 ```bash
-CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+MODULE=llama3 CONFIG=llama3_8b ./run_train.sh
 ```
 
 ### Multi-Node Training
 
@@ -8,8 +8,8 @@ A submission should be a file / files including the following information
 2. The model or theme of benchmarking, e.g. Llama 3.1, Async TP.
 3. The hardware setup, including the types of GPUs, interconnections, etc.
 4. The actual performance report with training configs, e.g. via
-   - `.toml` files / commandline arguments
-   - complete configs, which can be found in the log with [`--print_config`](https://github.com/pytorch/torchtitan/blob/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc/torchtitan/config_manager.py#L47) turned on (preferred as the default value not shown in `.toml` or specified in commandline could change from time to time)
+   - Python config files / commandline arguments
+   - complete configs, which can be found in the log with [`--print_config`](https://github.com/pytorch/torchtitan/blob/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc/torchtitan/config_manager.py#L47) turned on (preferred as the default value not shown in config files or specified in commandline could change from time to time)
 5. The versions and date/time of `torchtitan`, `torch`, `torchao`, or any relevant dependencies.
 6. Other notes which could help reproduce the results.
 
 
@@ -16,8 +16,7 @@ Each host has
 
 Runs were invoked with the following, where `NUM_NODES` was `4` and `8`.
 
-**Warning**: the command here has been updated to use the latest version of torchtitan, which has had API changes since this benchmark was ran.
-To reproduce the results using the original torchtitan commit, change all instances of `quantize.linear.float8` to `float8` in the command below.
+**Warning**: the command below reflects the original invocation at the time of this benchmark. The torchtitan CLI has since changed to use `--module` and `--config` flags instead of `--job.config-file`. See the current [README](/README.md) for up-to-date usage.
 ```
   torchrun \
     --nnodes $NUM_NODES  \
 
@@ -5,52 +5,53 @@ You may want to enable checkpointing in `torchtitan` for better fault tolerance
 ## A general guide to use checkpoints during training
 
 1. ENABLE CHECKPOINTING
-In your `torchtitan` training config, ensure that under `[checkpoint]`, `enable` is set to True.
-```
-[checkpoint]
-enable = true
-folder = "checkpoint"
-interval = 500
+In your config_registry function, configure the checkpoint settings:
+```python
+checkpoint=CheckpointManager.Config(
+    interval=500,
+),
 ```
+Or via CLI: `--checkpoint.interval 500`
+
 2. SAVE MODEL ONLY
 By setting `last_save_model_only` to `True`, the checkpoint will only contain the model and exclude the optimizer state and extra train states, resulting in a smaller checkpoint size.
-```
-[checkpoint]
-enable = true
-last_save_model_only = true
+```python
+checkpoint=CheckpointManager.Config(
+    interval=500,
+    last_save_model_only=True,
+),
 ```
 
 3. CHOOSE DESIRED EXPORT PRECISION
 The default model states are in `float32`. You can choose to export the checkpoint in a lower precision format such as `bfloat16`.
-```
-[checkpoint]
-enable = true
-last_save_model_only = true
-export_dtype = "bfloat16"
+```python
+checkpoint=CheckpointManager.Config(
+    interval=500,
+    last_save_model_only=True,
+    export_dtype="bfloat16",
+),
 ```
 
 4. EXCLUDING SPECIFIC KEYS FROM CHECKPOINT LOADING
 In some cases, you may want to partially load from a previous-trained checkpoint and modify certain settings, such as the number of GPUs or the current step. To achieve this, you can use the `exclude_from_loading` parameter to specify which keys should be excluded from loading.
-This parameter takes a list of string that should be excluded from loading.
+```python
+checkpoint=CheckpointManager.Config(
+    exclude_from_loading=["data_loader", "lr_scheduler"],
+),
 ```
-[checkpoint]
-enable = true
-exclude_from_loading = ["data_loader", "lr_scheduler"]
-```
-When used in command line, the parameter should be a comma-separated list of strings. For example: `--checkpoint.exclude_from_loading data_loader,lr_scheduler`.
+When used in command line: `--checkpoint.exclude_from_loading data_loader,lr_scheduler`.
 
 5. EXAMPLE CHECKPOINT CONFIGURATION
-```
-[checkpoint]
-enable = true
-folder = "checkpoint"
-interval = 10
-load_step = 5
-last_save_model_only = true
-export_dtype = "bfloat16"
+```python
+checkpoint=CheckpointManager.Config(
+    interval=10,
+    load_step=5,
+    last_save_model_only=True,
+    export_dtype="bfloat16",
+),
 ```
 
-A more exhaustive and up-to-date list of checkpoint config options can be found in `torchtitan/config/job_config.py`
+A more exhaustive and up-to-date list of checkpoint config options can be found in `torchtitan/components/checkpoint.py` (`CheckpointManager.Config`).
 
 ## Creating a seed checkpoint
 Sometimes one needs to create a seed checkpoint to initialize a model from step 0.
@@ -60,15 +61,15 @@ A seed checkpoint does initialization of the model on a single CPU, and can be l
 To create a seed checkpoint, use the same model config as you use for training.
 e.g.
 ```bash
-NGPU=1 CONFIG_FILE=<path_to_model_config> ./run_train.sh --checkpoint.enable --checkpoint.create_seed_checkpoint --parallelism.data_parallel_replicate_degree 1 --parallelism.data_parallel_shard_degree 1 --parallelism.tensor_parallel_degree 1 --parallelism.pipeline_parallel_degree 1 --parallelism.context_parallel_degree 1 --parallelism.expert_parallel_degree 1
+NGPU=1 ./run_train.sh --module <module_name> --config <config_name> --checkpoint.create_seed_checkpoint --parallelism.data_parallel_replicate_degree 1 --parallelism.data_parallel_shard_degree 1 --parallelism.tensor_parallel_degree 1 --parallelism.pipeline_parallel_degree 1 --parallelism.context_parallel_degree 1 --parallelism.expert_parallel_degree 1
 ```
 
 ## Conversion support
 
 ### HuggingFace
 `torchtitan` offers two ways to work with Hugging Face models: either by directly saving and loading a Hugging Face checkpoint during training, or by using an example conversion script to directly reformat the model weights on cpu.
 
-1. You can directly save huggingface model weights during training by using the `--checkpoint.last_save_in_hf` and `--checkpoint.last_save_model_only` options together. To directly load a `torchtitan` training session from a huggingface safetensors file, enable `--checkpoint.initial_load_in_hf`, and set either `--model.hf_assets_path` or `--checkpoint.initial_load_path` to the directory containing the huggingface checkpoint. `--checkpoint.initial_load_path` overrides `--model.hf_assets_path` if both are set.
+1. You can directly save huggingface model weights during training by using the `--checkpoint.last_save_in_hf` and `--checkpoint.last_save_model_only` options together. To directly load a `torchtitan` training session from a huggingface safetensors file, enable `--checkpoint.initial_load_in_hf`, and set either `--hf_assets_path` or `--checkpoint.initial_load_path` to the directory containing the huggingface checkpoint. `--checkpoint.initial_load_path` overrides `--hf_assets_path` if both are set.
 
 2. To directly reformat the weights without the need to run a training loop, run the corresponding conversion script. The naming scheme is `torchtitan`-centric, e.g. convert_from_hf means convert hf->tt.
 
@@ -84,13 +85,12 @@ python ./scripts/convert_from_hf.py ~/.cache/huggingface/hub/models--meta-llama-
 This guide will walk you through the steps required to convert a checkpoint from `torchtitan` so that it can be loaded into pt format.
 
 1. CHECKPOINT CONFIGURATION
-```
-[checkpoint]
-enable = true
-folder = "checkpoint"
-interval = 10
-last_save_model_only = true
-export_dtype = "bfloat16"
+```python
+checkpoint=CheckpointManager.Config(
+    interval=10,
+    last_save_model_only=True,
+    export_dtype="bfloat16",
+),
 ```
 
 2. SAVE THE FINAL CHECKPOINT\
 
@@ -12,10 +12,10 @@ This note clarifies the recommended practices to follow when testing the loss co
 
 ## Guidelines
 
-To validate the correctness of a distributed training technique, one should try to **keep the determinism in the input data to minimize the differences it could cause**. To make sure the global batch size and in general #tokens per iteration stay the same, one can fix the local batch size (`training.local_batch_size`) in the toml config, and at the same time fix the data parallel degree.
+To validate the correctness of a distributed training technique, one should try to **keep the determinism in the input data to minimize the differences it could cause**. To make sure the global batch size and in general #tokens per iteration stay the same, one can fix the local batch size (`training.local_batch_size`) in the config_registry function, and at the same time fix the data parallel degree.
 
 If the technique is a parallelism (TP/PP/CP/etc)
-- The control set is a 1D FSDP job on `dp` GPUs (or any other verified setups), with a trusted training config (e.g. those under train_configs).
+- The control set is a 1D FSDP job on `dp` GPUs (or any other verified setups), with a trusted training config (e.g. those in config_registry.py).
 - The minimal test set is a 2D job on `dp*p` GPUs, where `p >= 2` is the degree of the experimented parallelism.
   - For some parallelisms, larger `p` may cause larger discrepancies in numeric due to various reasons. For example, current implementation of CP uses `torch.bfloat16` (under default mixed precision training configs) when accumulating intermediate results. A higher `p` is desired to ensure the parallelism works properly, at the cost of more hardware resources.
   - Certain parallelisms may impose additional requirements on the batch size. For instance, PP requires local batch size to be at least the number of microbatches (or equivalently, the number of pipeline stages) to reduce bubbles. A valid comparison example would be 1D FSDP on N GPUs with local batch size 8, and 2D FSDP + PP on 4N GPUs (DP N, PP 4) with Interleaved 1F1B schedule (also with local batch size 8), where each PP rank gets two pipeline stages.
@@ -39,7 +39,7 @@ This is a series of loss-converging tests on Llama 3.1, covering both parallelis
 Results are obtained on 2025/01/21, with the latest `torch`, `torchao`, and `torchtitan`.
 
 ### Setup
-- Base config: [torchtitan/models/llama3/train_configs/llama3_8b.toml](../torchtitan/models/llama3/train_configs/llama3_8b.toml)
+- Base config: `llama3_8b` (from [config_registry.py](../torchtitan/models/llama3/config_registry.py))
 - `training.local_batch_size = 4`, which is a minimum for Pipeline Parallel with `pipeline_parallel_degree = 2` and `pipeline_parallel_schedule = "Interleaved1F1B"`
 - `training.data_parallel_shard_degree = 8`, resulting in global batch size 32
 - `training.steps = 3000`, `lr_scheduler.warmup_steps = 600`
 
@@ -54,10 +54,12 @@ DATASETS = {
 ```
 
 ### 4. Configure Your Training
-In your training configuration file (`.toml`), set your dataset:
+In your config_registry function, set your dataset:
 
-```toml
-dataset = "wikipedia"
+```python
+dataloader=HuggingFaceTextDataLoader.Config(
+    dataset="wikipedia",
+),
 ```
 
 That's it! Your custom dataset is now ready to use with `torchtitan`.