You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#### Install TensorRT-LLM, vLLM, or TRT-ONNX backend
102
-
103
-
Starting with version 25.07, the NeMo FW container no longer includes TensorRT-LLM and vLLM pre-installed. Please run the following command inside the container:
For containerized development, use our Dockerfile for building your own container. There are three flavors: `INFERENCE_FRAMEWORK=inframework`, `INFERENCE_FRAMEWORK=trtllm` and `INFERENCE_FRAMEWORK=vllm`:
3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
18
+
2. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
26
+
3. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
34
27
35
-
6. In a separate terminal, access the running container as follows:
28
+
4. In a separate terminal, access the running container as follows:
36
29
37
30
```shell
38
31
docker exec -it nemo-fw bash
39
32
```
40
33
41
-
7. To send a query to the Triton server, run the following script:
34
+
5. To send a query to the Triton server, run the following script:
42
35
43
36
```shell
44
37
python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50
45
38
```
46
39
47
40
48
-
**Note:** The documentation for Automodel LLM deployment using vLLM is almost the same with the one for NeMo 2.0. Please check the [NeMo 2.0 documentation here](../../nemo_2/optimized/vllm.md).
41
+
**Note:** The documentation for Automodel LLM deployment using vLLM is almost the same with the one for NeMo 2.0. Please check the [NeMo 2.0 documentation here](../../nemo_2/optimized/vllm.md).
4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
21
+
3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
21
+
3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
Copy file name to clipboardExpand all lines: tutorials/onnx_tensorrt/embedding/llama_embedding.ipynb
+2-9Lines changed: 2 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -21,20 +21,13 @@
21
21
"source": [
22
22
"#### Launch the NeMo Framework container as follows:\n",
23
23
"\n",
24
-
"1. Run the following command in the NeMo Framework container in a terminal before starting the jupyter notebook if you are using the container version 25.07 and above.\n",
Copy file name to clipboardExpand all lines: tutorials/onnx_tensorrt/reranker/llama_reranker.ipynb
+2-9Lines changed: 2 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -21,20 +21,13 @@
21
21
"source": [
22
22
"#### Launch the NeMo Framework container as follows: \n",
23
23
"\n",
24
-
"1. Run the following command in the NeMo Framework container in a terminal before starting the jupyter notebook if you are using the container version 25.07 and above.\n",
0 commit comments