InfiniTensor
diff --git a/‎README.md‎
Lines changed: 160 additions & 9 deletions b/‎README.md‎
Lines changed: 160 additions & 9 deletions
diff --git a/‎example/gpt2/main.cc‎
Lines changed: 29 additions & 21 deletions b/‎example/gpt2/main.cc‎
Lines changed: 29 additions & 21 deletions
@@ -1,23 +1,174 @@
 # InfiniTrain
 
-## 🚀 Getting Started
+[![CI](https://github.com/InfiniTensor/InfiniTrain/actions/workflows/format-check.yaml/badge.svg)](
+https://github.com/InfiniTensor/InfiniTrain/actions
+)
+[![Issues](https://img.shields.io/github/issues/InfiniTensor/InfiniTrain)](
+https://github.com/InfiniTensor/InfiniTrain/issues
+)
+[![PR](https://img.shields.io/github/issues-pr/InfiniTensor/InfiniTrain)](
+https://github.com/InfiniTensor/InfiniTrain/pulls
+)
+[![License](https://img.shields.io/github/license/InfiniTensor/InfiniTrain)](
+https://github.com/InfiniTensor/InfiniTrain/blob/master/LICENSE
+)
 
-### 🛠️ Build Instructions
+A from-scratch C++ training framework for large-scale models with multi-dimensional distributed parallelism.
+
+## 🚀 Quick Start
+
+### System Requirements
+
+#### Hardware Requirements
+
+- **Recommended**: NVIDIA Ampere-class GPUs (A100/A800) or newer
+
+#### Software Requirements
+
+- **CUDA / NCCL**: Latest stable versions
+- **gcc / g++**: Version **13+**
+- **CMake**: Version **3.13+**
+
+### Installation
 
 ```bash
 mkdir build
 cd build
-cmake ..           # Use -DUSE_CUDA=ON to enable CUDA support
-make
+cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON
+make -j
 ```
 
-## 🧪 Running Examples
+Build Options:
+
+- `USE_CUDA=ON`
+
+  Enable CUDA backend support.
+
+- `USE_NCCL=ON`
+
+  Enable NCCL-based distributed communication.
+
+> Both options are optional and can be disabled for CPU-only builds.
+
+## ✨ InfiniTrain Overview
+
+### ✔ Support Matrix
+
+| Category                  | Feature                         | Description                                          | Status         |
+| ------------------------- | ------------------------------- | ---------------------------------------------------- | -------------- |
+| Model Support             | GPT-2                           | Decoder-only Transformer language model              | ✔ Supported    |
+|                           | LLaMA 3                         | Modern LLaMA-family Transformer architecture         | ✔ Supported    |
+|                           | DeepSeek-V3                     | Large-scale MoE-based language model                 | 🗓 Planned     |
+| Precision                 | Multiple Data Type              | FP32, BF16                                           | ✔ Supported    |
+|                           | Mixed Precision                 | Autocast-based BF16 compute with FP32 accumulation   | ✔ Supported    |
+| Distributed Training      | Data Parallel (DP)              | Parameter-server-style data parallelism              | ✔ Supported    |
+|                           | Distributed Data Parallel (DDP) | Collective-based data parallelism                    | ✔ Supported    |
+|                           | Tensor Parallelism (TP)         | Intra-layer tensor sharding                          | ✔ Supported    |
+|                           | Sequence Parallelism (SP)       | Sequence dimension sharding                          | ✔ Supported    |
+|                           | Pipeline Parallelism (PP)       | GPipe, 1F1B scheduling, Virtual Pipeline (vPP)       | ✔ Supported    |
+|                           | Hybrid Parallelism              | Arbitrary combination of DDP + TP + SP + PP          | ✔ Supported    |
+| Core Components           | Multi-backend                   | CPU and CUDA execution backends                      | ✔ Supported    |
+|                           | Multi-node Distributed Training | Distributed execution across multiple nodes          | ✔ Supported    |
+|                           | Kernel Dispatcher               | Kernel registration and dynamic dispatch mechanism   | ✔ Supported    |
+|                           | Autograd                        | Automatic differentiation engine                     | ✔ Supported    |
+|                           | Autocast                        | Automatic mixed precision runtime                    | ✔ Supported    |
+| Performance Optimizations | Compute–Comm Overlap            | Explicit scheduling to hide communication latency    | ✔ Supported    |
+|                           | DDP Gradient Bucketing          | Deferred and bucketed gradient synchronization       | ✔ Supported    |
+|                           | ZeRO-DP                         | DistributedOptimizer-based ZeRO-1                    | 🚧 In Progress |
+| Execution Mode            | Training Mode                   | Full forward–backward training with autograd         | ✔ Supported    |
+|                           | `no_grad` Inference             | Forward-only execution without gradient tracking     | ✔ Supported    |
+| Debugging & Tooling       | Built-in Profiler               | Kernel-level performance profiling                   | ✔ Supported    |
+|                           | Automated Benchmarking          | One-click execution, log analysis and Feishu export  | ✔ Supported    |
+
+## 🏋️ Training
 
 Each model in the `example/` directory is compiled into an independent executable.  
-For instance, the `mnist` example will produce a binary named `mnist`.
+For example, the `llama3` example produces a binary named `llama3`.
+
+To view available runtime options:
+
+```bash
+./llama3 --help
+```
+
+### Getting Started
+
+The following examples demonstrate **LLaMA 3 supervised fine-tuning (SFT)** using InfiniTrain.
+
+#### Single-node Training Example
+
+```bash
+./llama3 \
+  --device cuda \
+  --input_bin [training_data_path] \
+  --llmc_filepath [model_path] \
+  --num_iteration 10
+
+```
+
+#### Multi-nodes Training Example (3D parallel)
+
+```bash
+./infini_run \
+  --nnodes=2 \
+  --nproc_per_node=1 \
+  --node_rank=[rank_id] \
+  -- ./llama3 \
+     --device cuda \
+     --input_bin [training_data_path] \
+     --llmc_filepath [model_path] \
+     --num_iteration 10 \
+     --nthread_per_process 8 \
+     --batch_size 40 \
+     --total_batch_size 10240 \
+     --tensor_parallel 2 \
+     --pipeline_parallel 2 \
+     --sequence_parallel
+```
+
+### Parallelism Strategies
+
+#### Distributed Data Parallelism (DDP)
+
+```bash
+--nthread_per_process 8 	# ddp_size = nthread_per_process / (tensor_parallel × pipeline_parallel)
+```
+
+#### Tensor Parallelism (TP)
+
+```bash
+--tensor_parallel 4        # 4-way tensor parallelism
+--sequence_parallel        # Enable sequence parallelism (requires TP > 1)
+```
 
-You can view the available runtime options by executing:
+#### Pipeline Parallelism (PP)
 
 ```bash
-./mnist --help
-```
+--pipeline_parallel 8     		# 8 pipeline stages
+--virtual_pipeline_parallel 4  	# Virtual pipeline for better load balancing
+```
+
+#### Combining Parallelism Strategies
+
+Multiple parallelism strategies (DDP, TP, SP, PP) can be freely combined to scale training across devices and nodes.
+
+## 🗺 Roadmap
+
+- **2025/03/10** — InfiniTrain **v0.1.0**
+
+  Initial framework prototype with MNIST CPU training.
+
+- **2025/04/30** — InfiniTrain **v0.3.0**
+
+  Added Autograd support and GPT-2 training on CPU/CUDA.
+
+- **2025/07/09** — InfiniTrain **v0.4.0**
+
+  Introduced kernel registration, LLaMA training on CPU/CUDA, BF16 precision, and Data Parallelism.
+
+- **2025/12/31** — InfiniTrain **v0.5.0**
+
+  Added Autocast, multi-dimensional distributed parallelism
+   (DDP, TP, SP, PP with GPipe / 1F1B / vPP),
+   multi-node training, `no_grad` mode,
+   and communication–computation overlap with bucketed gradient synchronization.
@@ -64,6 +64,7 @@ DEFINE_int32(
 DEFINE_uint32(tensor_parallel, 1, "Tensor Parallel world size");
 DEFINE_bool(sequence_parallel, false, "Whether to enable Sequence Parallel");
 DEFINE_uint32(pipeline_parallel, 1, "Pipeline Parallel world size, specified the number of PP stages.");
+DEFINE_uint32(virtual_pipeline_parallel, 1, "Number of chunks in PP stage.");
 
 // precision
 DEFINE_string(dtype, "float32", "precision used in training (float32/bfloat16)");
@@ -187,15 +188,35 @@ void Train(const nn::parallel::Rank &rank) {
         LOG(FATAL) << "Rank " << rank.GlobalRank() << ": Datatype " << FLAGS_dtype << " not supported.";
     }
 
-    // NOTE(dcj): Complete all device (.to(device)) and dtype (.to(dtype)) conversions
-    // before wrapping the model with DistributedDataParallel (DDP).
-    // Otherwise, DDP’s gradient hooks may be lost because new parameter tensors
-    // are created during the conversion.
-    if (ddp_world_size > 1) {
+    auto num_micro_batches = FLAGS_total_batch_size / (FLAGS_batch_size * FLAGS_sequence_length * ddp_world_size);
+
+    // TODO(dcj): support more complex optimizer later
+    auto optimizer = optimizers::SGD(model->Parameters(), FLAGS_learning_rate);
+
+    if (pp_world_size > 1) {
+        // NOTE(dcj): To ensure that the tensor shapes at the pipeline stage boundaries remain correct
+        // when sequence parallelism (SP) is enabled, we need to divide by sp_world_size.
+        auto shapes = std::vector<std::vector<int64_t>>{
+            {FLAGS_batch_size, FLAGS_sequence_length / sp_world_size, model_config.n_embd}};
+
+        model = std::make_shared<nn::parallel::PipelineParallel>(
+            model, pp_world_size, num_micro_batches, shapes, pp_rank, std::make_shared<optimizers::SGD>(optimizer),
+            rank.thread_rank(), std::dynamic_pointer_cast<GPT2>(model)->GetChunkSize());
+        if (ddp_world_size > 1) {
+            auto *mutable_chunks = dynamic_cast<nn::parallel::PipelineParallel *>(model.get())->mutable_chunks();
+            for (int chunk_id = 0; chunk_id < mutable_chunks->size(); ++chunk_id) {
+                (*mutable_chunks)[chunk_id]
+                    = std::make_shared<DistributedDataParallel>(mutable_chunks->at(chunk_id), rank.thread_rank());
+            }
+        }
+    } else if (ddp_world_size > 1) {
+        // NOTE(dcj): Complete all device (.to(device)) and dtype (.to(dtype)) conversions
+        // before wrapping the model with DistributedDataParallel (DDP).
+        // Otherwise, DDP’s gradient hooks may be lost because new parameter tensors
+        // are created during the conversion.
         model = std::make_shared<DistributedDataParallel>(model, rank.thread_rank());
     }
 
-    auto num_micro_batches = FLAGS_total_batch_size / (FLAGS_batch_size * FLAGS_sequence_length * ddp_world_size);
     DistributedDataLoader train_loader(std::make_shared<TinyShakespeareDataset>(FLAGS_input_bin, FLAGS_sequence_length),
                                        pp_world_size > 1 ? FLAGS_batch_size * num_micro_batches : FLAGS_batch_size,
                                        ddp_rank, ddp_world_size);
@@ -216,9 +237,6 @@ void Train(const nn::parallel::Rank &rank) {
         tokenizer = std::make_unique<Tokenizer>(FLAGS_tokenizer_bin);
     }
 
-    // TODO(dcj): support more complex optimizer later
-    auto optimizer = optimizers::SGD(model->Parameters(), FLAGS_learning_rate);
-
     auto train_iter = train_loader.begin();
     std::shared_ptr<nn::Module> loss_fn
         = (tp_world_size > 1) ? std::static_pointer_cast<nn::Module>(
@@ -227,17 +245,6 @@ void Train(const nn::parallel::Rank &rank) {
     loss_fn->To(device);
     LOG(INFO) << "Rank " << rank.GlobalRank() << ": start training";
 
-    if (pp_world_size > 1) {
-        // NOTE(dcj): To ensure that the tensor shapes at the pipeline stage boundaries remain correct
-        // when sequence parallelism (SP) is enabled, we need to divide by sp_world_size.
-        auto shapes = std::vector<std::vector<int64_t>>{
-            {FLAGS_batch_size, FLAGS_sequence_length / sp_world_size, model_config.n_embd}};
-
-        model = std::make_shared<nn::parallel::PipelineParallel>(model, pp_world_size, num_micro_batches, shapes,
-                                                                 pp_rank, std::make_shared<optimizers::SGD>(optimizer),
-                                                                 rank.thread_rank());
-    }
-
     LOG(INFO) << "start training";
 
     for (int step = 0; step < FLAGS_num_iteration + 1; ++step) {
@@ -293,6 +300,7 @@ void Train(const nn::parallel::Rank &rank) {
                 auto logits = model->Forward({x, y})[0];
                 LOG(INFO) << "Rank " << rank.GlobalRank() << ": finish model forward, start loss forward";
                 auto loss = loss_fn->Forward({logits, y})[0];
+                // FIXME(jym): verify gradient accumulation precision
                 loss = loss / grad_accum_steps;
 
                 // disable autocast for the current step (backward is not under autocast)
@@ -356,7 +364,7 @@ int main(int argc, char *argv[]) {
     google::InitGoogleLogging(argv[0]);
 
     nn::parallel::global::InitAllEnv(FLAGS_nthread_per_process, FLAGS_tensor_parallel, FLAGS_sequence_parallel,
-                                     FLAGS_pipeline_parallel);
+                                     FLAGS_pipeline_parallel, FLAGS_virtual_pipeline_parallel);
 
     LOG(INFO) << nn::parallel::global::ProcessGroupOverview();