docs: Remove uv sync with uv_args (#586)

thomasdhc · web-flow · commit 7872cefa3d90 · 2026-02-09T12:36:37.000-05:00
Signed-off-by: Dong Hyuk Chang &lt;donghyukc@nvidia.com&gt;
diff --git a/README.md b/README.md
@@ -97,32 +97,6 @@ docker run --rm -it -w /workdir -v $(pwd):/workdir \
   nvcr.io/nvidia/nemo:${TAG}
 ```
 
-<a id="install-tensorrt-llm-vllm-or-trt-onnx-backend"></a>
-#### Install TensorRT-LLM, vLLM, or TRT-ONNX backend
-
-Starting with version 25.07, the NeMo FW container no longer includes TensorRT-LLM and vLLM pre-installed. Please run the following command inside the container:
-
-For TensorRT-LLM:
-
-```bash
-cd /opt/Export-Deploy
-uv sync --inexact --link-mode symlink --locked --extra trtllm $(cat /opt/uv_args.txt)
-```
-
-For vLLM:
-
-```bash
-cd /opt/Export-Deploy
-uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
-```
-
-For TRT-ONNX:
-
-```bash
-cd /opt/Export-Deploy
-uv sync --inexact --link-mode symlink --locked --extra trt-onnx $(cat /opt/uv_args.txt)
-```
-
 ### Build with Dockerfile
 
 For containerized development, use our Dockerfile for building your own container. There are three flavors: `INFERENCE_FRAMEWORK=inframework`, `INFERENCE_FRAMEWORK=trtllm` and `INFERENCE_FRAMEWORK=vllm`:
diff --git a/docs/llm/automodel/optimized/automodel-trtllm.md b/docs/llm/automodel/optimized/automodel-trtllm.md
@@ -27,23 +27,15 @@ This section shows how to use scripts and APIs to export a [NeMo AutoModel](http
       --tensor_parallelism_size 1
    ```
 
-3. Install TensorRT-LLM by executing the following command inside the container:
+3. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
 
-   ```shell
-   cd /opt/Export-Deploy
-   uv sync --inexact --link-mode symlink --locked --extra trtllm $(cat /opt/uv_args.txt)
-
-   ```
-
-4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
-
-5. In a separate terminal, access the running container as follows:
+4. In a separate terminal, access the running container as follows:
 
    ```shell
    docker exec -it nemo-fw bash
    ```
 
-6. To send a query to the Triton server, run the following script:
+5. To send a query to the Triton server, run the following script:
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/query.py -mn llama -p "What is the color of a banana?" -mol 5
@@ -307,4 +299,4 @@ You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Trit
    nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="llama", http_port=8000)
    nm.deploy()
    nm.serve() 
-   ```
+   ```
diff --git a/docs/llm/automodel/optimized/automodel-vllm.md b/docs/llm/automodel/optimized/automodel-vllm.md
@@ -15,34 +15,27 @@ This section shows how to use scripts and APIs to export a [NeMo AutoModel](http
       nvcr.io/nvidia/nemo:vr
    ```
 
-2. Install vLLM by executing the following command inside the container:
-
-   ```shell
-   cd /opt/Export-Deploy
-   uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
-   ```
-
-3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
+2. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
        --model_path_id meta-llama/Llama-3.2-1B \
        --triton_model_name llama   
    ```
 
-5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
+3. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
 
-6. In a separate terminal, access the running container as follows:
+4. In a separate terminal, access the running container as follows:
 
    ```shell
    docker exec -it nemo-fw bash
    ```
 
-7. To send a query to the Triton server, run the following script:
+5. To send a query to the Triton server, run the following script:
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50
    ```
 
 
-**Note:** The documentation for Automodel LLM deployment using vLLM is almost the same with the one for NeMo 2.0. Please check the [NeMo 2.0 documentation here](../../nemo_2/optimized/vllm.md).
+**Note:** The documentation for Automodel LLM deployment using vLLM is almost the same with the one for NeMo 2.0. Please check the [NeMo 2.0 documentation here](../../nemo_2/optimized/vllm.md).
diff --git a/docs/llm/nemo_2/optimized/tensorrt-llm.md b/docs/llm/nemo_2/optimized/tensorrt-llm.md
@@ -18,15 +18,7 @@ This section shows how to use scripts and APIs to export a NeMo 2.0 LLM to Tenso
        nvcr.io/nvidia/nemo:vr
    ```
 
-3. Install TensorRT-LLM by executing the following command inside the container:
-
-   ```shell
-   cd /opt/Export-Deploy
-   uv sync --inexact --link-mode symlink --locked --extra trtllm $(cat /opt/uv_args.txt)
-
-   ```
-
-4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
+3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
 
     ```shell
     python /opt/Export-Deploy/scripts/deploy/nlp/deploy_triton.py \
@@ -36,15 +28,15 @@ This section shows how to use scripts and APIs to export a NeMo 2.0 LLM to Tenso
         --tensor_parallelism_size 1
    ```
 
-5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
+4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
 
-6. In a separate terminal, access the running container as follows:
+5. In a separate terminal, access the running container as follows:
 
    ```shell
    docker exec -it nemo-fw bash
    ```
 
-7. To send a query to the Triton server, run the following script:
+6. To send a query to the Triton server, run the following script:
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/query.py -mn llama -p "What is the color of a banana?" -mol 5
@@ -295,4 +287,4 @@ output = nq.query_llm(
     temperature=1.0,
 )
 print(output)
-```
+```
diff --git a/docs/llm/nemo_2/optimized/vllm.md b/docs/llm/nemo_2/optimized/vllm.md
@@ -18,15 +18,7 @@ This section shows how to use scripts and APIs to export a NeMo LLM to vLLM and
        nvcr.io/nvidia/nemo:vr
    ```
 
-3. Install vLLM by executing the following command inside the container:
-
-   ```shell
-   cd /opt/Export-Deploy
-   uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
-
-   ```
-
-4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
+3. Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
@@ -35,15 +27,15 @@ This section shows how to use scripts and APIs to export a NeMo LLM to vLLM and
        --tensor_parallelism_size 1
    ```
 
-5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
+4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
 
-6. In a separate terminal, access the running container as follows:
+5. In a separate terminal, access the running container as follows:
 
    ```shell
    docker exec -it nemo-fw bash
    ```
 
-7. To send a query to the Triton server, run the following script:
+6. To send a query to the Triton server, run the following script:
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50
@@ -185,4 +177,4 @@ output = nq.query_llm(
     temperature=1.0,
 )
 print("output: ", output)
-```
+```
diff --git a/tutorials/onnx_tensorrt/embedding/llama_embedding.ipynb b/tutorials/onnx_tensorrt/embedding/llama_embedding.ipynb
@@ -21,20 +21,13 @@
    "source": [
     "#### Launch the NeMo Framework container as follows:\n",
     "\n",
-    "1. Run the following command in the NeMo Framework container in a terminal before starting the jupyter notebook if you are using the container version 25.07 and above.\n",
-    "\n",
-    "```\n",
-    "cd /opt/Export-Deploy\n",
-    "uv sync --inexact --link-mode symlink --locked --extra trt-onnx $(cat /opt/uv_args.txt)\n",
-    "```\n",
-    "\n",
-    "2. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
+    "1. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
     "\n",
     "```\n",
     "docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus device=0 --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.07\n",
     "```\n",
     "\n",
-    "3. Launch Jupyter Notebook as follows:\n",
+    "2. Launch Jupyter Notebook as follows:\n",
     "```\n",
     "jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''\n",
     "\n",
diff --git a/tutorials/onnx_tensorrt/reranker/llama_reranker.ipynb b/tutorials/onnx_tensorrt/reranker/llama_reranker.ipynb
@@ -21,20 +21,13 @@
    "source": [
     "#### Launch the NeMo Framework container as follows: \n",
     "\n",
-    "1. Run the following command in the NeMo Framework container in a terminal before starting the jupyter notebook if you are using the container version 25.07 and above.\n",
-    "\n",
-    "```\n",
-    "cd /opt/Export-Deploy\n",
-    "uv sync --inexact --link-mode symlink --locked --extra trt-onnx $(cat /opt/uv_args.txt)\n",
-    "```\n",
-    "\n",
-    "2. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
+    "1. Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n",
     "\n",
     "```\n",
     "docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus device=0 --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.07\n",
     "```\n",
     "\n",
-    "3. Launch Jupyter Notebook as follows:\n",
+    "2. Launch Jupyter Notebook as follows:\n",
     "```\n",
     "jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''\n",
     "\n",