You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/ml/pytorch.md
+7-3Lines changed: 7 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -176,11 +176,11 @@ For further details on execution logic, job monitoring and data management, plea
176
176
Use of `--environment` is currently only recommended for the `srun` command.
177
177
178
178
!!! note "Optimizing large-scale training jobs"
179
-
The following settings were established to **improve compute throughput** of LLM training in `Megatron-LM`:
179
+
The following settings were established to **improve compute throughput** of LLM training in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM):
180
180
181
-
* Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. In `Megatron-LM`, avoid using the option '--defer-embedding-wgrad-compute` to defer the embedding gradient computation. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark.
181
+
* Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark.
182
182
183
-
* Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in `Megatron-LM`. This can be achieved by adding these constraints to the sbatch script:
183
+
* Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in Megatron-LM. This can be achieved by adding these constraints to the sbatch script:
184
184
```bash
185
185
#SBATCH -C thp_never&nvidia_vboost_enabled
186
186
```
@@ -206,6 +206,8 @@ For further details on execution logic, job monitoring and data management, plea
206
206
207
207
### Known Issues
208
208
209
+
The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) of every NGC PyTorch container contain a selected list of known issues.
210
+
209
211
??? info "Errors hidden by failures in UCX signal handler"
210
212
Application errors may trigger the UCX signal handler in the NGC container, which has caused secondary failures in the past, shadowing the initial error trace. These secondary failures may be significantly harder to fix than the initial problem.
211
213
@@ -228,6 +230,8 @@ For further details on execution logic, job monitoring and data management, plea
228
230
export UCX_HANDLE_ERRORS=none
229
231
```
230
232
233
+
??? info "Avoid `--defer-embedding-wgrad-compute` in Megatron-LM"
234
+
In Megatron-LM, avoid using the option `--defer-embedding-wgrad-compute` to delay the embedding gradient computation as it can lead to an incorrect gradient norm that changes upon resuming at different scale.
0 commit comments