You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Training Speed | 2.93s/it | 6.02s/it | 24.30s/it |
142
-
| GPU Memory Usage | 8\*66GB| 8\*72GB | 8\*50GB |
144
+
| Training Speed | 2.95s/it | 6.02s/it | 24.30s/it |
145
+
| GPU Memory Usage | 8\*57GB| 8\*72GB | 8\*50GB |
143
146
144
147
## Command Line Arguments
145
148
@@ -239,8 +242,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
239
242
- 🔥sequence_parallel: Enable sequence parallel optimization. Default is False.
240
243
- 🔥context_parallel_size: CP (Context Parallelism) size, default is 1.
241
244
- tp_comm_overlap: Overlap tensor parallel communication with GEMM (General Matrix Multiplication) kernels (to reduce communication time). Default is False.
242
-
- overlap_grad_reduce: Overlap grad reduction operations in DDP (to reduce DP communication time). Default is False.
243
-
- overlap_param_gather: Overlap all-gather of parameters in the distributed optimizer (to reduce DP communication time). Default is False.
245
+
-🔥overlap_grad_reduce: Overlap grad reduction operations in DDP (to reduce DP communication time). Default is False.
246
+
-🔥overlap_param_gather: Overlap all-gather of parameters in the distributed optimizer (to reduce DP communication time). Default is False.
244
247
- distributed_timeout_minutes: The timeout duration for torch.distributed (in minutes). This parameter is deprecated and is now controlled by the `ddp_timeout` in the [Base Arguments](./Command-line-parameters.md#base-arguments), with a default value of 300000 minutes.
0 commit comments