Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0686a7c
Add first supported metrics
yinggeh Jul 29, 2024
21e2356
Update comments
yinggeh Jul 30, 2024
d95bb2c
Minor update
yinggeh Aug 1, 2024
321faa0
Add metrics test
yinggeh Aug 3, 2024
468539f
Fix copyright
yinggeh Aug 5, 2024
8eba2f0
Remove unused metrics and update comments
yinggeh Aug 6, 2024
6f97f6f
Minor update
yinggeh Aug 6, 2024
bf7669e
Minor updates
yinggeh Aug 6, 2024
e9d0dbb
Minor fix
yinggeh Aug 7, 2024
7d0dc5b
Remove unused module
yinggeh Aug 7, 2024
979dc02
Fix "metrics not supported error" when building with TRITON_ENABLE_ME…
yinggeh Aug 8, 2024
3dd04c5
Fix "metrics not supported error" when building with TRITON_ENABLE_ME…
yinggeh Aug 8, 2024
07f2575
Simply test
yinggeh Aug 8, 2024
2135145
Completely turn off metrics
yinggeh Aug 9, 2024
56aea05
Add vLLM disable_log_stats config test
yinggeh Aug 9, 2024
0dadc8e
Test metrics are enabled by default if disable_log_stats is not set.
yinggeh Aug 9, 2024
8d8fd2a
Update tests based on comments
yinggeh Aug 9, 2024
4f2e217
Remove _log_gauge
yinggeh Aug 9, 2024
d22fd03
Resolve comments
yinggeh Aug 9, 2024
c8bdb6e
Merge branch 'main' of github.com:triton-inference-server/vllm_backen…
yinggeh Aug 9, 2024
8280d26
Update
yinggeh Aug 9, 2024
6fa7ae3
Change temp directory
yinggeh Aug 9, 2024
89ca6f4
Disable metrics report by default. Controlled by parameter "REPORT_ME…
yinggeh Aug 15, 2024
1158fee
Test server option set --allow-metrics=false
yinggeh Aug 15, 2024
a99d38b
Add docs
yinggeh Aug 15, 2024
de8f25b
Minor update
yinggeh Aug 15, 2024
b1333ce
Both args checking
yinggeh Aug 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
301 changes: 301 additions & 0 deletions src/metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

from typing import Dict, Union

import triton_python_backend_utils as pb_utils
from vllm.engine.metrics import StatLoggerBase as VllmStatLoggerBase
from vllm.engine.metrics import Stats as VllmStats
from vllm.engine.metrics import SupportsMetricsInfo


# begin-metrics-definitions
class TritonMetrics:
def __init__(self, labels):
# System stats
# Scheduler State
self.gauge_scheduler_running_family = pb_utils.MetricFamily(
name="vllm:num_requests_running",
description="Number of requests currently running on GPU.",
kind=pb_utils.MetricFamily.GAUGE,
)
self.gauge_scheduler_waiting_family = pb_utils.MetricFamily(
name="vllm:num_requests_waiting",
description="Number of requests waiting to be processed.",
kind=pb_utils.MetricFamily.GAUGE,
)
self.gauge_scheduler_swapped_family = pb_utils.MetricFamily(
name="vllm:num_requests_swapped",
description="Number of requests swapped to CPU.",
kind=pb_utils.MetricFamily.GAUGE,
)
# KV Cache Usage in %
self.gauge_gpu_cache_usage_family = pb_utils.MetricFamily(
name="vllm:gpu_cache_usage_perc",
description="GPU KV-cache usage. 1 means 100 percent usage.",
kind=pb_utils.MetricFamily.GAUGE,
)
self.gauge_cpu_cache_usage_family = pb_utils.MetricFamily(
name="vllm:cpu_cache_usage_perc",
description="CPU KV-cache usage. 1 means 100 percent usage.",
kind=pb_utils.MetricFamily.GAUGE,
)

# Iteration stats
self.counter_num_preemption_family = pb_utils.MetricFamily(
name="vllm:num_preemptions_total",
description="Cumulative number of preemption from the engine.",
kind=pb_utils.MetricFamily.COUNTER,
)
self.counter_prompt_tokens_family = pb_utils.MetricFamily(
name="vllm:prompt_tokens_total",
description="Number of prefill tokens processed.",
kind=pb_utils.MetricFamily.COUNTER,
)
self.counter_generation_tokens_family = pb_utils.MetricFamily(
name="vllm:generation_tokens_total",
description="Number of generation tokens processed.",
kind=pb_utils.MetricFamily.COUNTER,
)
# self.histogram_time_to_first_token_family = pb_utils.MetricFamily(
# name="vllm:time_to_first_token_seconds",
# description="Histogram of time to first token in seconds.",
# kind=pb_utils.MetricFamily.HISTOGRAM,
# buckets=[
# 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5,
# 0.75, 1.0, 2.5, 5.0, 7.5, 10.0
# ])
# self.histogram_time_per_output_token_family = pb_utils.MetricFamily(
# name="vllm:time_per_output_token_seconds",
# description="Histogram of time per output token in seconds.",
# kind=pb_utils.MetricFamily.HISTOGRAM,
# buckets=[
# 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75,
# 1.0, 2.5
# ])

# Request stats
# Latency
# self.histogram_e2e_time_request_family = pb_utils.MetricFamily(
# name="vllm:e2e_request_latency_seconds",
# description="Histogram of end to end request latency in seconds.",
# kind=pb_utils.MetricFamily.HISTOGRAM,
# buckets=[1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0])
# # Metadata
# self.histogram_num_prompt_tokens_request_family = pb_utils.MetricFamily(
# name="vllm:request_prompt_tokens",
# description="Number of prefill tokens processed.",
# kind=pb_utils.MetricFamily.HISTOGRAM,
# buckets=build_1_2_5_buckets(max_model_len),
# )
# self.histogram_num_generation_tokens_request_family = \
# pb_utils.MetricFamily(
# name="vllm:request_generation_tokens",
# description="Number of generation tokens processed.",
# kind=pb_utils.MetricFamily.HISTOGRAM,
# buckets=build_1_2_5_buckets(max_model_len),
# )
# self.histogram_best_of_request_family = pb_utils.MetricFamily(
# name="vllm:request_params_best_of",
# description="Histogram of the best_of request parameter.",
# kind=pb_utils.MetricFamily.HISTOGRAM,
# buckets=[1, 2, 5, 10, 20],
# )
# self.histogram_n_request_family = pb_utils.MetricFamily(
# name="vllm:request_params_n",
# description="Histogram of the n request parameter.",
# kind=pb_utils.MetricFamily.HISTOGRAM,
# buckets=[1, 2, 5, 10, 20],
# )
# self.counter_request_success_family = pb_utils.MetricFamily(
# name="vllm:request_success_total",
# description="Count of successfully processed requests.",
# kind=pb_utils.MetricFamily.COUNTER)

# Speculatie decoding stats
# self.gauge_spec_decode_draft_acceptance_rate_family = pb_utils.MetricFamily(
# name="vllm:spec_decode_draft_acceptance_rate",
# description="Speculative token acceptance rate.",
# kind=pb_utils.MetricFamily.GAUGE)
# self.gauge_spec_decode_efficiency_family = pb_utils.MetricFamily(
# name="vllm:spec_decode_efficiency",
# description="Speculative decoding system efficiency.",
# kind=pb_utils.MetricFamily.GAUGE)
# self.counter_spec_decode_num_accepted_tokens_family = pb_utils.MetricFamily(
# name="vllm:spec_decode_num_accepted_tokens_total",
# description="Number of accepted tokens.",
# kind=pb_utils.MetricFamily.COUNTER)
# self.counter_spec_decode_num_draft_tokens_family = pb_utils.MetricFamily(
# name="vllm:spec_decode_num_draft_tokens_total",
# description="Number of draft tokens.",
# kind=pb_utils.MetricFamily.COUNTER)
# self.counter_spec_decode_num_emitted_tokens_family = pb_utils.MetricFamily(
# name="vllm:spec_decode_num_emitted_tokens_total",
# description="Number of emitted tokens.",
# kind=pb_utils.MetricFamily.COUNTER)

# System stats
# Scheduler State
self.gauge_scheduler_running = self.gauge_scheduler_running_family.Metric(
labels=labels
)
self.gauge_scheduler_waiting = self.gauge_scheduler_waiting_family.Metric(
labels=labels
)
self.gauge_scheduler_swapped = self.gauge_scheduler_swapped_family.Metric(
labels=labels
)
# KV Cache Usage in %
self.gauge_gpu_cache_usage = self.gauge_gpu_cache_usage_family.Metric(
labels=labels
)
self.gauge_cpu_cache_usage = self.gauge_cpu_cache_usage_family.Metric(
labels=labels
)

# Iteration stats
self.counter_num_preemption = self.counter_num_preemption_family.Metric(
labels=labels
)
self.counter_prompt_tokens = self.counter_prompt_tokens_family.Metric(
labels=labels
)
self.counter_generation_tokens = self.counter_generation_tokens_family.Metric(
labels=labels
)
# self.histogram_time_to_first_token = self.histogram_time_to_first_token_family.Metric(
# labels=labels
# )
# self.histogram_time_per_output_token = self.histogram_time_per_output_token_family.Metric(
# labels=labels
# )

# Request stats
# Latency
# self.histogram_e2e_time_request = self.histogram_e2e_time_request_family.Metric(
# labels=labels
# )
# # Metadata
# self.histogram_num_prompt_tokens_request = self.histogram_num_prompt_tokens_request_family.Metric(
# labels=labels
# )
# self.histogram_num_generation_tokens_request = self.histogram_num_generation_tokens_request_family.Metric(
# labels=labels
# )
# self.histogram_best_of_request = self.histogram_best_of_request_family.Metric(
# labels=labels
# )
# self.histogram_n_request = self.histogram_n_request_family.Metric(
# labels=labels
# )
# self.counter_request_success = self.counter_request_success_family.Metric(
# labels=labels
# )

# Speculatie decoding stats
# self.gauge_spec_decode_draft_acceptance_rate_ = self.gauge_spec_decode_draft_acceptance_rate_family.Metric(
# labels=labels
# )
# self.gauge_spec_decode_efficiency = self.gauge_spec_decode_efficiency_family.Metric(
# labels=labels
# )
# self.counter_spec_decode_num_accepted_tokens = self.counter_spec_decode_num_accepted_tokens_family.Metric(
# labels=labels
# )
# self.counter_spec_decode_num_draft_tokens = self.counter_spec_decode_num_draft_tokens_family.Metric(
# labels=labels
# )
# self.counter_spec_decode_num_emitted_tokens = self.counter_spec_decode_num_emitted_tokens_family.Metric(
# labels=labels
# )


class VllmStatLogger(VllmStatLoggerBase):
"""StatLogger is used as an adapter between vLLM stats collector and Triton metrics provider."""

# local_interval not used here. It's for vLLM logs to stdout.
def __init__(self, labels: Dict, local_interval: float = 0) -> None:
# Tracked stats over current local logging interval.
super().__init__(local_interval)
self.metrics = TritonMetrics(labels=labels)

def info(self, type: str, obj: SupportsMetricsInfo) -> None:
raise NotImplementedError

def _log_gauge(self, gauge, data: Union[int, float]) -> None:
# Convenience function for logging to gauge.
gauge.set(data)

def _log_counter(self, counter, data: Union[int, float]) -> None:
# Convenience function for logging to counter.
counter.increment(data)

# def _log_histogram(self, histogram, data: Union[List[int],
# List[float]]) -> None:
# # Convenience function for logging list to histogram.
# for datum in data:
# histogram.labels(**self.labels).observe(datum)

def log(self, stats: VllmStats) -> None:
# self.maybe_update_spec_decode_metrics(stats)

# System state data
self._log_gauge(self.metrics.gauge_scheduler_running, stats.num_running_sys)
self._log_gauge(self.metrics.gauge_scheduler_waiting, stats.num_waiting_sys)
self._log_gauge(self.metrics.gauge_scheduler_swapped, stats.num_swapped_sys)
self._log_gauge(self.metrics.gauge_gpu_cache_usage, stats.gpu_cache_usage_sys)
self._log_gauge(self.metrics.gauge_cpu_cache_usage, stats.cpu_cache_usage_sys)

# Iteration level data
self._log_counter(
self.metrics.counter_num_preemption, stats.num_preemption_iter
)
self._log_counter(
self.metrics.counter_prompt_tokens, stats.num_prompt_tokens_iter
)
self._log_counter(
self.metrics.counter_generation_tokens, stats.num_generation_tokens_iter
)
# self._log_histogram(self.metrics.histogram_time_to_first_token, stats.time_to_first_tokens_iter)
# self._log_histogram(self.metrics.histogram_time_per_output_token, stats.time_per_output_tokens_iter)

# Request level data
# Latency
# self._log_histogram(self.metrics.histogram_e2e_time_request, stats.time_e2e_requests)
# Metadata
# self._log_histogram(self.metrics.histogram_num_prompt_tokens_request, stats.num_prompt_tokens_requests)
# self._log_histogram(self.metrics.histogram_num_generation_tokens_request, stats.num_generation_tokens_requests)
# self._log_histogram(self.metrics.histogram_best_of_request, stats.best_of_requests)
# self._log_histogram(self.metrics.histogram_n_request, stats.n_requests)
# self._log_histogram(self.metrics.counter_request_success, stats.finished_reason_requests)

# Speculatie decoding stats
# if self.spec_decode_metrics is not None:
# self._log_gauge(self.metrics.gauge_spec_decode_draft_acceptance_rate, self.spec_decode_metrics.draft_acceptance_rate)
# self._log_gauge(self.metrics.gauge_spec_decode_efficiency, self.spec_decode_metrics.system_efficiency)
# self._log_counter(self.metrics.counter_spec_decode_num_accepted_tokens, self.spec_decode_metrics.accepted_tokens)
# self._log_counter(self.metrics.counter_spec_decode_num_draft_tokens, self.spec_decode_metrics.draft_tokens)
# self._log_counter(self.metrics.counter_spec_decode_num_emitted_tokens, self.spec_decode_metrics.emitted_tokens)
10 changes: 10 additions & 0 deletions src/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid

from metrics import VllmStatLogger

_VLLM_ENGINE_ARGS_FILENAME = "model.json"
_MULTI_LORA_ARGS_FILENAME = "multi_lora.json"

Expand Down Expand Up @@ -151,6 +153,14 @@ def init_engine(self):
AsyncEngineArgs(**self.vllm_engine_config)
)

# Create vLLM custom Metrics
labels = {
"model": self.args["model_name"],
"version": self.args["model_version"],
}
logger = VllmStatLogger(labels=labels)
self.llm_engine.add_logger("triton", logger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the cadence that the logger gets called? CC @kthui as this will involve round trips with core, similar to your investigation with request cancellation frequency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often will the metrics get updated? Every request, every token, every full response, etc. ? in other words, how often will vLLM engine call this attached triton stats logger?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every iteration

Copy link
Contributor

@rmccorm4 rmccorm4 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will probably significantly affect the total throughput then if the core round trip communication will interrupt the generation at every iteration based on Jacky and Iman's recent findings. We probably want this feature either way - just calling out that we'll likely need to make similar optimizations for this feature that @kthui is working on right now. Please work together to align on the best path forward for metrics feature + parity with vllm performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kthui will run benchmarks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current path forward is to allow metrics to be turned off. There is still room to improve in the future, i.e. perform the core round-trip communication on a side branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the impact of having metrics (counter and gauge) on performance with --disable-log-stats flag set on FastAPI completion vs Triton generate_stream is negligible. The delta between FastAPI completion and Triton generate_stream without any metrics functionality added is approximately the same with metrics and having the --disable-log-stats flag set.


def setup_lora(self):
self.enable_lora = False

Expand Down
Loading