Skip to content

Commit 2c7904c

Browse files
committed
Updating known issues
1 parent 205d4b5 commit 2c7904c

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

docs/software/ml/pytorch.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -176,11 +176,11 @@ For further details on execution logic, job monitoring and data management, plea
176176
Use of `--environment` is currently only recommended for the `srun` command.
177177

178178
!!! note "Optimizing large-scale training jobs"
179-
The following settings were established to **improve compute throughput** of LLM training in `Megatron-LM`:
179+
The following settings were established to **improve compute throughput** of LLM training in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM):
180180

181-
* Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. In `Megatron-LM`, avoid using the option '--defer-embedding-wgrad-compute` to defer the embedding gradient computation. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark.
181+
* Extensively evaluate all possible parallelization dimensions, including data-, tensor- and pipeline parallelism (including virtual pipeline parallelism) and more, when available. Identify storage-related bottlenecks by isolating data loading/generation operations into a separate benchmark.
182182

183-
* Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in `Megatron-LM`. This can be achieved by adding these constraints to the sbatch script:
183+
* Disabling transparent huge pages and enabling the Nvidia [vboost](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#gpu-core-clock-optimization) feature has been observed to improve performance in large-scale LLM training in Megatron-LM. This can be achieved by adding these constraints to the sbatch script:
184184
```bash
185185
#SBATCH -C thp_never&nvidia_vboost_enabled
186186
```
@@ -206,6 +206,8 @@ For further details on execution logic, job monitoring and data management, plea
206206

207207
### Known Issues
208208

209+
The [Release Notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) of every NGC PyTorch container contain a selected list of known issues.
210+
209211
??? info "Errors hidden by failures in UCX signal handler"
210212
Application errors may trigger the UCX signal handler in the NGC container, which has caused secondary failures in the past, shadowing the initial error trace. These secondary failures may be significantly harder to fix than the initial problem.
211213

@@ -228,6 +230,8 @@ For further details on execution logic, job monitoring and data management, plea
228230
export UCX_HANDLE_ERRORS=none
229231
```
230232

233+
??? info "Avoid `--defer-embedding-wgrad-compute` in Megatron-LM"
234+
In Megatron-LM, avoid using the option `--defer-embedding-wgrad-compute` to delay the embedding gradient computation as it can lead to an incorrect gradient norm that changes upon resuming at different scale.
231235

232236
[](){#ref-uenv-pytorch}
233237
## Running PyTorch with a uenv

0 commit comments

Comments
 (0)