@@ -114,10 +114,12 @@ cd server
114114 --upstream-container-version=${TRITON_CONTAINER_VERSION}
115115 --backend=python:r${TRITON_CONTAINER_VERSION}
116116 --backend=vllm:r${TRITON_CONTAINER_VERSION}
117+ --backend=ensemble
117118 --vllm-version=${VLLM_VERSION}
118119# Build Triton Server
119120cd build
120121bash -x ./docker_build
122+
121123```
122124
123125### Option 3. Add the vLLM Backend to the Default Triton Container
@@ -129,7 +131,8 @@ container with the following commands:
129131
130132```
131133mkdir -p /opt/tritonserver/backends/vllm
132- wget -P /opt/tritonserver/backends/vllm https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/src/model.py
134+ git clone https://github.com/triton-inference-server/vllm_backend.git /tmp/vllm_backend
135+ cp -r /tmp/vllm_backend/src/* /opt/tritonserver/backends/vllm
133136```
134137
135138## Using the vLLM Backend
@@ -212,14 +215,121 @@ starting from 23.10 release.
212215
213216You can use ` pip install ... ` within the container to upgrade vLLM version.
214217
215-
216218## Running Multiple Instances of Triton Server
217219
218220If you are running multiple instances of Triton server with a Python-based backend,
219221you need to specify a different ` shm-region-prefix-name ` for each server. See
220222[ here] ( https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server )
221223for more information.
222224
225+ ## Triton Metrics
226+ Starting with the 24.08 release of Triton, users can now obtain specific
227+ vLLM metrics by querying the Triton metrics endpoint (see complete vLLM metrics
228+ [ here] ( https://docs.vllm.ai/en/latest/serving/metrics.html ) ). This can be
229+ accomplished by launching a Triton server in any of the ways described above
230+ (ensuring the build code / container is 24.08 or later) and querying the server.
231+ Upon receiving a successful response, you can query the metrics endpoint by entering
232+ the following:
233+ ``` bash
234+ curl localhost:8002/metrics
235+ ```
236+ VLLM stats are reported by the metrics endpoint in fields that are prefixed with
237+ ` vllm: ` . Triton currently supports reporting of the following metrics from vLLM.
238+ ``` bash
239+ # Number of prefill tokens processed.
240+ counter_prompt_tokens
241+ # Number of generation tokens processed.
242+ counter_generation_tokens
243+ # Histogram of time to first token in seconds.
244+ histogram_time_to_first_token
245+ # Histogram of time per output token in seconds.
246+ histogram_time_per_output_token
247+ # Histogram of end to end request latency in seconds.
248+ histogram_e2e_time_request
249+ # Number of prefill tokens processed.
250+ histogram_num_prompt_tokens_request
251+ # Number of generation tokens processed.
252+ histogram_num_generation_tokens_request
253+ # Histogram of the best_of request parameter.
254+ histogram_best_of_request
255+ # Histogram of the n request parameter.
256+ histogram_n_request
257+ ```
258+ Your output for these fields should look similar to the following:
259+ ``` bash
260+ # HELP vllm:prompt_tokens_total Number of prefill tokens processed.
261+ # TYPE vllm:prompt_tokens_total counter
262+ vllm:prompt_tokens_total{model=" vllm_model" ,version=" 1" } 10
263+ # HELP vllm:generation_tokens_total Number of generation tokens processed.
264+ # TYPE vllm:generation_tokens_total counter
265+ vllm:generation_tokens_total{model=" vllm_model" ,version=" 1" } 16
266+ # HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
267+ # TYPE vllm:time_to_first_token_seconds histogram
268+ vllm:time_to_first_token_seconds_count{model=" vllm_model" ,version=" 1" } 1
269+ vllm:time_to_first_token_seconds_sum{model=" vllm_model" ,version=" 1" } 0.03233122825622559
270+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.001" } 0
271+ ...
272+ vllm:time_to_first_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
273+ # HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
274+ # TYPE vllm:time_per_output_token_seconds histogram
275+ vllm:time_per_output_token_seconds_count{model=" vllm_model" ,version=" 1" } 15
276+ vllm:time_per_output_token_seconds_sum{model=" vllm_model" ,version=" 1" } 0.04501533508300781
277+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 0.01" } 14
278+ ...
279+ vllm:time_per_output_token_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 15
280+ # HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
281+ # TYPE vllm:e2e_request_latency_seconds histogram
282+ vllm:e2e_request_latency_seconds_count{model=" vllm_model" ,version=" 1" } 1
283+ vllm:e2e_request_latency_seconds_sum{model=" vllm_model" ,version=" 1" } 0.08686184883117676
284+ vllm:e2e_request_latency_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" 1" } 1
285+ ...
286+ vllm:e2e_request_latency_seconds_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
287+ # HELP vllm:request_prompt_tokens Number of prefill tokens processed.
288+ # TYPE vllm:request_prompt_tokens histogram
289+ vllm:request_prompt_tokens_count{model=" vllm_model" ,version=" 1" } 1
290+ vllm:request_prompt_tokens_sum{model=" vllm_model" ,version=" 1" } 10
291+ vllm:request_prompt_tokens_bucket{model=" vllm_model" ,version=" 1" ,le=" 1" } 0
292+ ...
293+ vllm:request_prompt_tokens_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
294+ # HELP vllm:request_generation_tokens Number of generation tokens processed.
295+ # TYPE vllm:request_generation_tokens histogram
296+ vllm:request_generation_tokens_count{model=" vllm_model" ,version=" 1" } 1
297+ vllm:request_generation_tokens_sum{model=" vllm_model" ,version=" 1" } 16
298+ vllm:request_generation_tokens_bucket{model=" vllm_model" ,version=" 1" ,le=" 1" } 0
299+ ...
300+ vllm:request_generation_tokens_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
301+ # HELP vllm:request_params_best_of Histogram of the best_of request parameter.
302+ # TYPE vllm:request_params_best_of histogram
303+ vllm:request_params_best_of_count{model=" vllm_model" ,version=" 1" } 1
304+ vllm:request_params_best_of_sum{model=" vllm_model" ,version=" 1" } 1
305+ vllm:request_params_best_of_bucket{model=" vllm_model" ,version=" 1" ,le=" 1" } 1
306+ ...
307+ vllm:request_params_best_of_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
308+ # HELP vllm:request_params_n Histogram of the n request parameter.
309+ # TYPE vllm:request_params_n histogram
310+ vllm:request_params_n_count{model=" vllm_model" ,version=" 1" } 1
311+ vllm:request_params_n_sum{model=" vllm_model" ,version=" 1" } 1
312+ vllm:request_params_n_bucket{model=" vllm_model" ,version=" 1" ,le=" 1" } 1
313+ ...
314+ vllm:request_params_n_bucket{model=" vllm_model" ,version=" 1" ,le=" +Inf" } 1
315+ ```
316+ To enable vLLM engine colleting metrics, "disable_log_stats" option need to be either false
317+ or left empty (false by default) in [ model.json] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json ) .
318+ ``` bash
319+ " disable_log_stats" : false
320+ ```
321+ * Note:* vLLM metrics are not reported to Triton metrics server by default
322+ due to potential performance slowdowns. To enable vLLM model's metrics
323+ reporting, please add following lines to its config.pbtxt as well.
324+ ``` bash
325+ parameters: {
326+ key: " REPORT_CUSTOM_METRICS"
327+ value: {
328+ string_value:" yes"
329+ }
330+ }
331+ ```
332+
223333## Referencing the Tutorial
224334
225335You can read further in the
0 commit comments