Skip to content

Commit 7872cef

Browse files
authored
docs: Remove uv sync with uv_args (#586)
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
1 parent 2a22a76 commit 7872cef

File tree

7 files changed

+23
-94
lines changed

7 files changed

+23
-94
lines changed

README.md

Lines changed: 0 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -97,32 +97,6 @@ docker run --rm -it -w /workdir -v $(pwd):/workdir \
9797
nvcr.io/nvidia/nemo:${TAG}
9898
```
9999

100-
<a id="install-tensorrt-llm-vllm-or-trt-onnx-backend"></a>
101-
#### Install TensorRT-LLM, vLLM, or TRT-ONNX backend
102-
103-
Starting with version 25.07, the NeMo FW container no longer includes TensorRT-LLM and vLLM pre-installed. Please run the following command inside the container:
104-
105-
For TensorRT-LLM:
106-
107-
```bash
108-
cd /opt/Export-Deploy
109-
uv sync --inexact --link-mode symlink --locked --extra trtllm $(cat /opt/uv_args.txt)
110-
```
111-
112-
For vLLM:
113-
114-
```bash
115-
cd /opt/Export-Deploy
116-
uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
117-
```
118-
119-
For TRT-ONNX:
120-
121-
```bash
122-
cd /opt/Export-Deploy
123-
uv sync --inexact --link-mode symlink --locked --extra trt-onnx $(cat /opt/uv_args.txt)
124-
```
125-
126100
### Build with Dockerfile
127101

128102
For containerized development, use our Dockerfile for building your own container. There are three flavors: `INFERENCE_FRAMEWORK=inframework`, `INFERENCE_FRAMEWORK=trtllm` and `INFERENCE_FRAMEWORK=vllm`:

docs/llm/automodel/optimized/automodel-trtllm.md

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -27,23 +27,15 @@ This section shows how to use scripts and APIs to export a [NeMo AutoModel](http
2727
--tensor_parallelism_size 1
2828
```
2929

30-
3. Install TensorRT-LLM by executing the following command inside the container:
30+
3. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
3131

32-
```shell
33-
cd /opt/Export-Deploy
34-
uv sync --inexact --link-mode symlink --locked --extra trtllm $(cat /opt/uv_args.txt)
35-
36-
```
37-
38-
4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
39-
40-
5. In a separate terminal, access the running container as follows:
32+
4. In a separate terminal, access the running container as follows:
4133

4234
```shell
4335
docker exec -it nemo-fw bash
4436
```
4537

46-
6. To send a query to the Triton server, run the following script:
38+
5. To send a query to the Triton server, run the following script:
4739

4840
```shell
4941
python /opt/Export-Deploy/scripts/deploy/nlp/query.py -mn llama -p "What is the color of a banana?" -mol 5
@@ -307,4 +299,4 @@ You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Trit
307299
nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="llama", http_port=8000)
308300
nm.deploy()
309301
nm.serve()
310-
```
302+
```

docs/llm/automodel/optimized/automodel-vllm.md

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,34 +15,27 @@ This section shows how to use scripts and APIs to export a [NeMo AutoModel](http
1515
nvcr.io/nvidia/nemo:vr
1616
```
1717

18-
2. Install vLLM by executing the following command inside the container:
19-
20-
```shell
21-
cd /opt/Export-Deploy
22-
uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
23-
```
24-
25-
3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
18+
2. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
2619

2720
```shell
2821
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
2922
--model_path_id meta-llama/Llama-3.2-1B \
3023
--triton_model_name llama
3124
```
3225

33-
5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
26+
3. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
3427

35-
6. In a separate terminal, access the running container as follows:
28+
4. In a separate terminal, access the running container as follows:
3629

3730
```shell
3831
docker exec -it nemo-fw bash
3932
```
4033

41-
7. To send a query to the Triton server, run the following script:
34+
5. To send a query to the Triton server, run the following script:
4235

4336
```shell
4437
python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50
4538
```
4639

4740

48-
**Note:** The documentation for Automodel LLM deployment using vLLM is almost the same with the one for NeMo 2.0. Please check the [NeMo 2.0 documentation here](../../nemo_2/optimized/vllm.md).
41+
**Note:** The documentation for Automodel LLM deployment using vLLM is almost the same with the one for NeMo 2.0. Please check the [NeMo 2.0 documentation here](../../nemo_2/optimized/vllm.md).

docs/llm/nemo_2/optimized/tensorrt-llm.md

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,7 @@ This section shows how to use scripts and APIs to export a NeMo 2.0 LLM to Tenso
1818
nvcr.io/nvidia/nemo:vr
1919
```
2020

21-
3. Install TensorRT-LLM by executing the following command inside the container:
22-
23-
```shell
24-
cd /opt/Export-Deploy
25-
uv sync --inexact --link-mode symlink --locked --extra trtllm $(cat /opt/uv_args.txt)
26-
27-
```
28-
29-
4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
21+
3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
3022

3123
```shell
3224
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_triton.py \
@@ -36,15 +28,15 @@ This section shows how to use scripts and APIs to export a NeMo 2.0 LLM to Tenso
3628
--tensor_parallelism_size 1
3729
```
3830

39-
5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
31+
4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
4032

41-
6. In a separate terminal, access the running container as follows:
33+
5. In a separate terminal, access the running container as follows:
4234

4335
```shell
4436
docker exec -it nemo-fw bash
4537
```
4638

47-
7. To send a query to the Triton server, run the following script:
39+
6. To send a query to the Triton server, run the following script:
4840

4941
```shell
5042
python /opt/Export-Deploy/scripts/deploy/nlp/query.py -mn llama -p "What is the color of a banana?" -mol 5
@@ -295,4 +287,4 @@ output = nq.query_llm(
295287
temperature=1.0,
296288
)
297289
print(output)
298-
```
290+
```

docs/llm/nemo_2/optimized/vllm.md

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,7 @@ This section shows how to use scripts and APIs to export a NeMo LLM to vLLM and
1818
nvcr.io/nvidia/nemo:vr
1919
```
2020

21-
3. Install vLLM by executing the following command inside the container:
22-
23-
```shell
24-
cd /opt/Export-Deploy
25-
uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
26-
27-
```
28-
29-
4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
21+
3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
3022

3123
```shell
3224
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
@@ -35,15 +27,15 @@ This section shows how to use scripts and APIs to export a NeMo LLM to vLLM and
3527
--tensor_parallelism_size 1
3628
```
3729

38-
5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
30+
4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
3931

40-
6. In a separate terminal, access the running container as follows:
32+
5. In a separate terminal, access the running container as follows:
4133

4234
```shell
4335
docker exec -it nemo-fw bash
4436
```
4537

46-
7. To send a query to the Triton server, run the following script:
38+
6. To send a query to the Triton server, run the following script:
4739

4840
```shell
4941
python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50
@@ -185,4 +177,4 @@ output = nq.query_llm(
185177
temperature=1.0,
186178
)
187179
print("output: ", output)
188-
```
180+
```

tutorials/onnx_tensorrt/embedding/llama_embedding.ipynb

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,20 +21,13 @@
2121
"source": [
2222
"#### Launch the NeMo Framework container as follows:\n",
2323
"\n",
24-
"1. Run the following command in the NeMo Framework container in a terminal before starting the jupyter notebook if you are using the container version 25.07 and above.\n",
25-
"\n",
26-
"```\n",
27-
"cd /opt/Export-Deploy\n",
28-
"uv sync --inexact --link-mode symlink --locked --extra trt-onnx $(cat /opt/uv_args.txt)\n",
29-
"```\n",
30-
"\n",
31-
"2. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
24+
"1. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
3225
"\n",
3326
"```\n",
3427
"docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus device=0 --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.07\n",
3528
"```\n",
3629
"\n",
37-
"3. Launch Jupyter Notebook as follows:\n",
30+
"2. Launch Jupyter Notebook as follows:\n",
3831
"```\n",
3932
"jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''\n",
4033
"\n",

tutorials/onnx_tensorrt/reranker/llama_reranker.ipynb

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,20 +21,13 @@
2121
"source": [
2222
"#### Launch the NeMo Framework container as follows: \n",
2323
"\n",
24-
"1. Run the following command in the NeMo Framework container in a terminal before starting the jupyter notebook if you are using the container version 25.07 and above.\n",
25-
"\n",
26-
"```\n",
27-
"cd /opt/Export-Deploy\n",
28-
"uv sync --inexact --link-mode symlink --locked --extra trt-onnx $(cat /opt/uv_args.txt)\n",
29-
"```\n",
30-
"\n",
31-
"2. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
24+
"1. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
3225
"\n",
3326
"```\n",
3427
"docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus device=0 --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.07\n",
3528
"```\n",
3629
"\n",
37-
"3. Launch Jupyter Notebook as follows:\n",
30+
"2. Launch Jupyter Notebook as follows:\n",
3831
"```\n",
3932
"jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''\n",
4033
"\n",

0 commit comments

Comments
 (0)