Skip to content

Commit 80d25b9

Browse files
committed
docs: add documentation
Signed-off-by: Mehant Kammakomati <[email protected]>
1 parent c9fcdfe commit 80d25b9

File tree

1 file changed

+21
-0
lines changed

1 file changed

+21
-0
lines changed

docs/training.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -427,6 +427,27 @@ For MoE models, expert parallel with MoE kernels can be enabled using the `--fas
427427
2. Data parallel degree multiplied by context parallel degree should be equal to total number of GPUs being used.
428428
3. Context parallel degree determinies number of chunks sequence has to be divided and distributed across GPUs, therefore it has to be choosen as minimium as needed to accommodate a sequence length.
429429

430+
Further, below ablations can be used as reference configurations.
431+
432+
#### Ablations
433+
434+
##### Parity Experiments
435+
436+
| model | experiment setting | loss | tps per gpu |
437+
| -------- | -------- | ------- | ------- |
438+
| ibm-granite/granite-4.0-h-tiny | cp8-ebs4-s8192-gas1 | 0.8059140625 | 973.6 |
439+
| ibm-granite/granite-4.0-h-tiny | cp8-ebs4-s8192-gas1-ep8 | 0.80224609375 | 2367.6 |
440+
| ibm-granite/granite-4.0-h-tiny | cp8-ebs4-s8192-gas2 | 0.8059765625 | NA |
441+
| ibm-granite/granite-4.0-h-tiny | cp4-dp2-ebs4-s8192-gas1 | 0.802953125 | 953.4 |
442+
| ibm-granite/granite-4.0-h-tiny | cp1-dp4-ep4-ebs4-s8192-gas1 | 0.7967056884765625 | 2576 |
443+
444+
##### Long Context (sequence length is 131072 (128k))
445+
446+
| model | experiment setting | tps per gpu | GPU memory util ratio |
447+
| -------- | -------- | ------- | ------- |
448+
| ibm-granite/granite-4.0-h-tiny | cp8-ebs1-s131072-gas1-ep8 | 1462.8 | 0.5140136719 |
449+
| ibm-granite/granite-4.0-h-small | cp8-ebs1-s131072-gas1-ep8 | 682.7 | 0.9887207031 |
450+
430451
### Known Limitations
431452

432453
1. load balancing is removed given limited support on mamba cp implementation. This could lead to potential throughput drops for trainings using causal mask.

0 commit comments

Comments
 (0)