Skip to content

Commit b1d2573

Browse files
authored
Fixing error in deployment config for in-framework deployment (#381)
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
1 parent eb64179 commit b1d2573

File tree

1 file changed

+2
-14
lines changed

1 file changed

+2
-14
lines changed

docs/llm/nemo_models/in-framework-ray.md

Lines changed: 2 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -173,19 +173,7 @@ Deploy multiple replicas of your NeMo model for increased throughput:
173173

174174
Enable performance optimizations for faster inference:
175175

176-
1. **CUDA Graphs**: Reduces kernel launch overhead:
177-
178-
```shell
179-
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
180-
--nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \
181-
--model_id llama \
182-
--enable_cuda_graphs \
183-
--num_gpus 2 \
184-
--tensor_model_parallel_size 2 \
185-
--cuda_visible_devices "0,1"
186-
```
187-
188-
2. **Flash Attention Decode**: Optimizes attention computation:
176+
1. **Flash Attention Decode**: Optimizes attention computation:
189177

190178
```shell
191179
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
@@ -197,7 +185,7 @@ Enable performance optimizations for faster inference:
197185
--cuda_visible_devices "0,1"
198186
```
199187

200-
3. **Combined Optimizations**:
188+
2. **Flash Attention Decode and Cuda Graphs**:
201189

202190
```shell
203191
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \

0 commit comments

Comments
 (0)