Skip to content

Commit 514f1e6

Browse files
authored
Merge branch 'InfiniTensor:master' into test0105
2 parents e701194 + 83d11cc commit 514f1e6

31 files changed

+1447
-651
lines changed

README.md

Lines changed: 160 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,174 @@
11
# InfiniTrain
22

3-
## 🚀 Getting Started
3+
[![CI](https://github.com/InfiniTensor/InfiniTrain/actions/workflows/format-check.yaml/badge.svg)](
4+
https://github.com/InfiniTensor/InfiniTrain/actions
5+
)
6+
[![Issues](https://img.shields.io/github/issues/InfiniTensor/InfiniTrain)](
7+
https://github.com/InfiniTensor/InfiniTrain/issues
8+
)
9+
[![PR](https://img.shields.io/github/issues-pr/InfiniTensor/InfiniTrain)](
10+
https://github.com/InfiniTensor/InfiniTrain/pulls
11+
)
12+
[![License](https://img.shields.io/github/license/InfiniTensor/InfiniTrain)](
13+
https://github.com/InfiniTensor/InfiniTrain/blob/master/LICENSE
14+
)
415

5-
### 🛠️ Build Instructions
16+
A from-scratch C++ training framework for large-scale models with multi-dimensional distributed parallelism.
17+
18+
## 🚀 Quick Start
19+
20+
### System Requirements
21+
22+
#### Hardware Requirements
23+
24+
- **Recommended**: NVIDIA Ampere-class GPUs (A100/A800) or newer
25+
26+
#### Software Requirements
27+
28+
- **CUDA / NCCL**: Latest stable versions
29+
- **gcc / g++**: Version **13+**
30+
- **CMake**: Version **3.13+**
31+
32+
### Installation
633

734
```bash
835
mkdir build
936
cd build
10-
cmake .. # Use -DUSE_CUDA=ON to enable CUDA support
11-
make
37+
cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON
38+
make -j
1239
```
1340

14-
## 🧪 Running Examples
41+
Build Options:
42+
43+
- `USE_CUDA=ON`
44+
45+
Enable CUDA backend support.
46+
47+
- `USE_NCCL=ON`
48+
49+
Enable NCCL-based distributed communication.
50+
51+
> Both options are optional and can be disabled for CPU-only builds.
52+
53+
## ✨ InfiniTrain Overview
54+
55+
### ✔ Support Matrix
56+
57+
| Category | Feature | Description | Status |
58+
| ------------------------- | ------------------------------- | ---------------------------------------------------- | -------------- |
59+
| Model Support | GPT-2 | Decoder-only Transformer language model | ✔ Supported |
60+
| | LLaMA 3 | Modern LLaMA-family Transformer architecture | ✔ Supported |
61+
| | DeepSeek-V3 | Large-scale MoE-based language model | 🗓 Planned |
62+
| Precision | Multiple Data Type | FP32, BF16 | ✔ Supported |
63+
| | Mixed Precision | Autocast-based BF16 compute with FP32 accumulation | ✔ Supported |
64+
| Distributed Training | Data Parallel (DP) | Parameter-server-style data parallelism | ✔ Supported |
65+
| | Distributed Data Parallel (DDP) | Collective-based data parallelism | ✔ Supported |
66+
| | Tensor Parallelism (TP) | Intra-layer tensor sharding | ✔ Supported |
67+
| | Sequence Parallelism (SP) | Sequence dimension sharding | ✔ Supported |
68+
| | Pipeline Parallelism (PP) | GPipe, 1F1B scheduling, Virtual Pipeline (vPP) | ✔ Supported |
69+
| | Hybrid Parallelism | Arbitrary combination of DDP + TP + SP + PP | ✔ Supported |
70+
| Core Components | Multi-backend | CPU and CUDA execution backends | ✔ Supported |
71+
| | Multi-node Distributed Training | Distributed execution across multiple nodes | ✔ Supported |
72+
| | Kernel Dispatcher | Kernel registration and dynamic dispatch mechanism | ✔ Supported |
73+
| | Autograd | Automatic differentiation engine | ✔ Supported |
74+
| | Autocast | Automatic mixed precision runtime | ✔ Supported |
75+
| Performance Optimizations | Compute–Comm Overlap | Explicit scheduling to hide communication latency | ✔ Supported |
76+
| | DDP Gradient Bucketing | Deferred and bucketed gradient synchronization | ✔ Supported |
77+
| | ZeRO-DP | DistributedOptimizer-based ZeRO-1 | 🚧 In Progress |
78+
| Execution Mode | Training Mode | Full forward–backward training with autograd | ✔ Supported |
79+
| | `no_grad` Inference | Forward-only execution without gradient tracking | ✔ Supported |
80+
| Debugging & Tooling | Built-in Profiler | Kernel-level performance profiling | ✔ Supported |
81+
| | Automated Benchmarking | One-click execution, log analysis and Feishu export | ✔ Supported |
82+
83+
## 🏋️ Training
1584

1685
Each model in the `example/` directory is compiled into an independent executable.
17-
For instance, the `mnist` example will produce a binary named `mnist`.
86+
For example, the `llama3` example produces a binary named `llama3`.
87+
88+
To view available runtime options:
89+
90+
```bash
91+
./llama3 --help
92+
```
93+
94+
### Getting Started
95+
96+
The following examples demonstrate **LLaMA 3 supervised fine-tuning (SFT)** using InfiniTrain.
97+
98+
#### Single-node Training Example
99+
100+
```bash
101+
./llama3 \
102+
--device cuda \
103+
--input_bin [training_data_path] \
104+
--llmc_filepath [model_path] \
105+
--num_iteration 10
106+
107+
```
108+
109+
#### Multi-nodes Training Example (3D parallel)
110+
111+
```bash
112+
./infini_run \
113+
--nnodes=2 \
114+
--nproc_per_node=1 \
115+
--node_rank=[rank_id] \
116+
-- ./llama3 \
117+
--device cuda \
118+
--input_bin [training_data_path] \
119+
--llmc_filepath [model_path] \
120+
--num_iteration 10 \
121+
--nthread_per_process 8 \
122+
--batch_size 40 \
123+
--total_batch_size 10240 \
124+
--tensor_parallel 2 \
125+
--pipeline_parallel 2 \
126+
--sequence_parallel
127+
```
128+
129+
### Parallelism Strategies
130+
131+
#### Distributed Data Parallelism (DDP)
132+
133+
```bash
134+
--nthread_per_process 8 # ddp_size = nthread_per_process / (tensor_parallel × pipeline_parallel)
135+
```
136+
137+
#### Tensor Parallelism (TP)
138+
139+
```bash
140+
--tensor_parallel 4 # 4-way tensor parallelism
141+
--sequence_parallel # Enable sequence parallelism (requires TP > 1)
142+
```
18143

19-
You can view the available runtime options by executing:
144+
#### Pipeline Parallelism (PP)
20145

21146
```bash
22-
./mnist --help
23-
```
147+
--pipeline_parallel 8 # 8 pipeline stages
148+
--virtual_pipeline_parallel 4 # Virtual pipeline for better load balancing
149+
```
150+
151+
#### Combining Parallelism Strategies
152+
153+
Multiple parallelism strategies (DDP, TP, SP, PP) can be freely combined to scale training across devices and nodes.
154+
155+
## 🗺 Roadmap
156+
157+
- **2025/03/10** — InfiniTrain **v0.1.0**
158+
159+
Initial framework prototype with MNIST CPU training.
160+
161+
- **2025/04/30** — InfiniTrain **v0.3.0**
162+
163+
Added Autograd support and GPT-2 training on CPU/CUDA.
164+
165+
- **2025/07/09** — InfiniTrain **v0.4.0**
166+
167+
Introduced kernel registration, LLaMA training on CPU/CUDA, BF16 precision, and Data Parallelism.
168+
169+
- **2025/12/31** — InfiniTrain **v0.5.0**
170+
171+
Added Autocast, multi-dimensional distributed parallelism
172+
(DDP, TP, SP, PP with GPipe / 1F1B / vPP),
173+
multi-node training, `no_grad` mode,
174+
and communication–computation overlap with bucketed gradient synchronization.

example/gpt2/main.cc

Lines changed: 29 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ DEFINE_int32(
6464
DEFINE_uint32(tensor_parallel, 1, "Tensor Parallel world size");
6565
DEFINE_bool(sequence_parallel, false, "Whether to enable Sequence Parallel");
6666
DEFINE_uint32(pipeline_parallel, 1, "Pipeline Parallel world size, specified the number of PP stages.");
67+
DEFINE_uint32(virtual_pipeline_parallel, 1, "Number of chunks in PP stage.");
6768

6869
// precision
6970
DEFINE_string(dtype, "float32", "precision used in training (float32/bfloat16)");
@@ -187,15 +188,35 @@ void Train(const nn::parallel::Rank &rank) {
187188
LOG(FATAL) << "Rank " << rank.GlobalRank() << ": Datatype " << FLAGS_dtype << " not supported.";
188189
}
189190

190-
// NOTE(dcj): Complete all device (.to(device)) and dtype (.to(dtype)) conversions
191-
// before wrapping the model with DistributedDataParallel (DDP).
192-
// Otherwise, DDP’s gradient hooks may be lost because new parameter tensors
193-
// are created during the conversion.
194-
if (ddp_world_size > 1) {
191+
auto num_micro_batches = FLAGS_total_batch_size / (FLAGS_batch_size * FLAGS_sequence_length * ddp_world_size);
192+
193+
// TODO(dcj): support more complex optimizer later
194+
auto optimizer = optimizers::SGD(model->Parameters(), FLAGS_learning_rate);
195+
196+
if (pp_world_size > 1) {
197+
// NOTE(dcj): To ensure that the tensor shapes at the pipeline stage boundaries remain correct
198+
// when sequence parallelism (SP) is enabled, we need to divide by sp_world_size.
199+
auto shapes = std::vector<std::vector<int64_t>>{
200+
{FLAGS_batch_size, FLAGS_sequence_length / sp_world_size, model_config.n_embd}};
201+
202+
model = std::make_shared<nn::parallel::PipelineParallel>(
203+
model, pp_world_size, num_micro_batches, shapes, pp_rank, std::make_shared<optimizers::SGD>(optimizer),
204+
rank.thread_rank(), std::dynamic_pointer_cast<GPT2>(model)->GetChunkSize());
205+
if (ddp_world_size > 1) {
206+
auto *mutable_chunks = dynamic_cast<nn::parallel::PipelineParallel *>(model.get())->mutable_chunks();
207+
for (int chunk_id = 0; chunk_id < mutable_chunks->size(); ++chunk_id) {
208+
(*mutable_chunks)[chunk_id]
209+
= std::make_shared<DistributedDataParallel>(mutable_chunks->at(chunk_id), rank.thread_rank());
210+
}
211+
}
212+
} else if (ddp_world_size > 1) {
213+
// NOTE(dcj): Complete all device (.to(device)) and dtype (.to(dtype)) conversions
214+
// before wrapping the model with DistributedDataParallel (DDP).
215+
// Otherwise, DDP’s gradient hooks may be lost because new parameter tensors
216+
// are created during the conversion.
195217
model = std::make_shared<DistributedDataParallel>(model, rank.thread_rank());
196218
}
197219

198-
auto num_micro_batches = FLAGS_total_batch_size / (FLAGS_batch_size * FLAGS_sequence_length * ddp_world_size);
199220
DistributedDataLoader train_loader(std::make_shared<TinyShakespeareDataset>(FLAGS_input_bin, FLAGS_sequence_length),
200221
pp_world_size > 1 ? FLAGS_batch_size * num_micro_batches : FLAGS_batch_size,
201222
ddp_rank, ddp_world_size);
@@ -216,9 +237,6 @@ void Train(const nn::parallel::Rank &rank) {
216237
tokenizer = std::make_unique<Tokenizer>(FLAGS_tokenizer_bin);
217238
}
218239

219-
// TODO(dcj): support more complex optimizer later
220-
auto optimizer = optimizers::SGD(model->Parameters(), FLAGS_learning_rate);
221-
222240
auto train_iter = train_loader.begin();
223241
std::shared_ptr<nn::Module> loss_fn
224242
= (tp_world_size > 1) ? std::static_pointer_cast<nn::Module>(
@@ -227,17 +245,6 @@ void Train(const nn::parallel::Rank &rank) {
227245
loss_fn->To(device);
228246
LOG(INFO) << "Rank " << rank.GlobalRank() << ": start training";
229247

230-
if (pp_world_size > 1) {
231-
// NOTE(dcj): To ensure that the tensor shapes at the pipeline stage boundaries remain correct
232-
// when sequence parallelism (SP) is enabled, we need to divide by sp_world_size.
233-
auto shapes = std::vector<std::vector<int64_t>>{
234-
{FLAGS_batch_size, FLAGS_sequence_length / sp_world_size, model_config.n_embd}};
235-
236-
model = std::make_shared<nn::parallel::PipelineParallel>(model, pp_world_size, num_micro_batches, shapes,
237-
pp_rank, std::make_shared<optimizers::SGD>(optimizer),
238-
rank.thread_rank());
239-
}
240-
241248
LOG(INFO) << "start training";
242249

243250
for (int step = 0; step < FLAGS_num_iteration + 1; ++step) {
@@ -293,6 +300,7 @@ void Train(const nn::parallel::Rank &rank) {
293300
auto logits = model->Forward({x, y})[0];
294301
LOG(INFO) << "Rank " << rank.GlobalRank() << ": finish model forward, start loss forward";
295302
auto loss = loss_fn->Forward({logits, y})[0];
303+
// FIXME(jym): verify gradient accumulation precision
296304
loss = loss / grad_accum_steps;
297305

298306
// disable autocast for the current step (backward is not under autocast)
@@ -356,7 +364,7 @@ int main(int argc, char *argv[]) {
356364
google::InitGoogleLogging(argv[0]);
357365

358366
nn::parallel::global::InitAllEnv(FLAGS_nthread_per_process, FLAGS_tensor_parallel, FLAGS_sequence_parallel,
359-
FLAGS_pipeline_parallel);
367+
FLAGS_pipeline_parallel, FLAGS_virtual_pipeline_parallel);
360368

361369
LOG(INFO) << nn::parallel::global::ProcessGroupOverview();
362370

0 commit comments

Comments
 (0)