File tree Expand file tree Collapse file tree 1 file changed +4
-3
lines changed Expand file tree Collapse file tree 1 file changed +4
-3
lines changed Original file line number Diff line number Diff line change @@ -144,8 +144,9 @@ Once you have the model repository set up, it is time to launch the Triton serve
144144We will use the [ pre-built Triton container with vLLM backend] ( #option-1-use-the-pre-built-docker-container ) from
145145[ NGC] ( https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver ) in this example.
146146
147+ Run the following command inside the ` vllm_backend ` directory:
147148```
148- docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 tritonserver --model-repository ./model_repository
149+ docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 tritonserver --model-repository ./samples/ model_repository
149150```
150151
151152Replace \< xx.yy\> with the version of Triton that you want to use.
@@ -171,10 +172,10 @@ with the
171172you can quickly run your first inference request with the
172173[ generate endpoint] ( https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md ) .
173174
174- Try out the command below.
175+ Try out the command below from another terminal:
175176
176177```
177- $ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
178+ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
178179```
179180
180181Upon success, you should see a response from the server like this one:
You can’t perform that action at this time.
0 commit comments