You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/training.md
+21Lines changed: 21 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -427,6 +427,27 @@ For MoE models, expert parallel with MoE kernels can be enabled using the `--fas
427
427
2. Data parallel degree multiplied by context parallel degree should be equal to total number of GPUs being used.
428
428
3. Context parallel degree determinies number of chunks sequence has to be divided and distributed across GPUs, therefore it has to be choosen as minimium as needed to accommodate a sequence length.
429
429
430
+
Further, below ablations can be used as reference configurations.
431
+
432
+
#### Ablations
433
+
434
+
##### Parity Experiments
435
+
436
+
| model | experiment setting | loss | tps per gpu |
1. load balancing is removed given limited support on mamba cp implementation. This could lead to potential throughput drops for trainings using causal mask.
0 commit comments