File tree Expand file tree Collapse file tree 1 file changed +2
-14
lines changed
Expand file tree Collapse file tree 1 file changed +2
-14
lines changed Original file line number Diff line number Diff line change @@ -173,19 +173,7 @@ Deploy multiple replicas of your NeMo model for increased throughput:
173173
174174Enable performance optimizations for faster inference:
175175
176- 1 . ** CUDA Graphs** : Reduces kernel launch overhead:
177-
178- ``` shell
179- python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
180- --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \
181- --model_id llama \
182- --enable_cuda_graphs \
183- --num_gpus 2 \
184- --tensor_model_parallel_size 2 \
185- --cuda_visible_devices " 0,1"
186- ```
187-
188- 2 . ** Flash Attention Decode** : Optimizes attention computation:
176+ 1 . ** Flash Attention Decode** : Optimizes attention computation:
189177
190178 ``` shell
191179 python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
@@ -197,7 +185,7 @@ Enable performance optimizations for faster inference:
197185 --cuda_visible_devices " 0,1"
198186 ```
199187
200- 3 . ** Combined Optimizations ** :
188+ 2 . ** Flash Attention Decode and Cuda Graphs ** :
201189
202190 ``` shell
203191 python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
You can’t perform that action at this time.
0 commit comments