Skip to content

Conversation

Pavloveuge
Copy link

What does the PR do?

Report more counter, histogram, gauge metrics from vLLM to Triton metrics server.

Checklist:

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request
  • Related issues are referenced
  • Populated github labels field
  • Added test plan and verified test passes
  • Verified that the PR passes existing CI
  • Verified copyright is correct on all changed files
  • Added succinct git squash message before merging ref.
  • All template sections are filled out

Commit Type:
Check the conventional commit type
box here and add the label to the github PR.

  • feat

Related PRs:
n/a

Where should the reviewer start?
n/a

Test plan:
ci/L0_backend_vllm/metrics_test

CI Pipeline ID:
n/a

Caveats:
n/a

Background
Customers requested additional histogram metrics from vLLM.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
n/a

@Pavloveuge
Copy link
Author

@yinggeh
Hello, can you check this, please?

@yinggeh
Copy link
Contributor

yinggeh commented Jun 5, 2025

@yinggeh Hello, can you check this, please?

@Pavloveuge Thanks for contributing to Triton. I'll keep you posted once there's an update.

@kanlas-net
Copy link

@yinggeh any update? These metrics are important for production environments

@yinggeh yinggeh requested review from yinggeh and oandreeva-nv July 17, 2025 22:56
@yinggeh yinggeh self-assigned this Jul 17, 2025
@yinggeh yinggeh added the enhancement New feature or request label Jul 17, 2025
README.md Outdated
Comment on lines 256 to 269
# Number of requests currently running on GPU.
gauge_scheduler_running
# Number of requests waiting to be processed.
gauge_scheduler_waiting
# Number of requests swapped to CPU.
gauge_scheduler_swapped
# GPU KV-cache usage. 1 means 100 percent usage.
gauge_gpu_cache_usage
# CPU KV-cache usage. 1 means 100 percent usage.
gauge_cpu_cache_usage
# CPU prefix cache block hit rate.
gauge_cpu_prefix_cache_hit_rate
# GPU prefix cache block hit rate.
gauge_gpu_prefix_cache_hit_rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move gauges before counters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 400 to 401
(self.metrics.gauge_gpu_cache_usage, stats.gpu_cache_usage_sys),
(self.metrics.gauge_cpu_cache_usage, stats.cpu_cache_usage_sys),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecated

Comment on lines 397 to 400
(self.metrics.gauge_num_requests_running, stats.num_running_sys),
(self.metrics.gauge_num_requests_waiting, stats.num_waiting_sys),
(self.metrics.gauge_num_requests_swapped, stats.num_swapped_sys),
(self.metrics.gauge_gpu_cache_usage, stats.gpu_cache_usage_sys),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update namings to be consistent with https://github.com/vllm-project/vllm/blob/main/vllm/engine/metrics.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update

gauge_metrics = [
(self.metrics.gauge_num_requests_running, stats.num_running_sys),
(self.metrics.gauge_num_requests_waiting, stats.num_waiting_sys),
(self.metrics.gauge_num_requests_swapped, stats.num_swapped_sys),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Deprecated in 0.8 - KV cache offloading is not used in V1  
# Hidden in 0.9, due to be removed in 0.10

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove deprecated metrics. thanks

@yinggeh
Copy link
Contributor

yinggeh commented Jul 24, 2025

@Pavloveuge Left comments

@Pavloveuge
Copy link
Author

@yinggeh Done

@Pavloveuge Pavloveuge force-pushed the report_more_vllm_metric branch 2 times, most recently from 1cb26b7 to 9ad54c2 Compare July 24, 2025 18:31
@Pavloveuge Pavloveuge force-pushed the report_more_vllm_metric branch from 9ad54c2 to 9b96279 Compare July 24, 2025 18:32
@yinggeh
Copy link
Contributor

yinggeh commented Jul 30, 2025

@Pavloveuge Thanks. There is a concern over the performance degradation with more metrics being reported. While I am getting the data, can you continue your work by applying the change to /opt/tritonserver/backends/vllm/utils/metrics.py in your container?

@Pavloveuge
Copy link
Author

Yes, I've already applied the change to /opt/tritonserver/backends/vllm/utils/metrics.py in my container. Ran my scenarios and haven't noticed any performance degradation on my end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants