feat: Add histogram metric type #386

yinggeh · 2024-08-07T05:14:36Z

What does the PR do?

Support histogram metric type and add tests.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

feat

Related PRs:

triton-inference-server/vllm_backend#56
triton-inference-server/python_backend#374
triton-inference-server/server#7525

Where should the reviewer start?

n/a

Test plan:

n/a

CI Pipeline ID:
17487728

Caveats:

n/a

Background

Customer requested histogram metrics from vLLM.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

n/a

include/triton/core/tritonserver.h

GuanLuo · 2024-08-07T21:12:35Z

include/triton/core/tritonserver.h

+/// \param metric The metric object to update.
+/// \param value The amount to observe metric's value to.
+/// \return a TRITONSERVER_Error indicating success or failure.
+TRITONSERVER_DECLSPEC struct TRITONSERVER_Error* TRITONSERVER_MetricObserve(


Reuse TRITONSERVER_MetricSet?

@rmccorm4 Any thought? Basically we need to merge two funcions below into one Metric::Set(double value). It works but may add confusion.

core/src/metric_family.cc

Lines 338 to 404 in fd5c44b

TRITONSERVER_Error*

Metric::Set(double value)

{

if (metric_ == nullptr) {

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_INTERNAL,

"Could not set metric value. Metric has been invalidated.");

}

switch (kind_) {

case TRITONSERVER_METRIC_KIND_COUNTER: {

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_UNSUPPORTED,

"TRITONSERVER_METRIC_KIND_COUNTER does not support Set");

}

case TRITONSERVER_METRIC_KIND_GAUGE: {

auto gauge_ptr = reinterpret_cast<prometheus::Gauge*>(metric_);

gauge_ptr->Set(value);

break;

}

case TRITONSERVER_METRIC_KIND_HISTOGRAM: {

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_UNSUPPORTED,

"TRITONSERVER_METRIC_KIND_HISTOGRAM does not support Set");

}

default:

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_UNSUPPORTED,

"Unsupported TRITONSERVER_MetricKind");

}

return nullptr; // Success

}

TRITONSERVER_Error*

Metric::Observe(double value)

{

if (metric_ == nullptr) {

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_INTERNAL,

"Could not set metric value. Metric has been invalidated.");

}

switch (kind_) {

case TRITONSERVER_METRIC_KIND_COUNTER: {

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_UNSUPPORTED,

"TRITONSERVER_METRIC_KIND_COUNTER does not support Observe");

}

case TRITONSERVER_METRIC_KIND_GAUGE: {

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_UNSUPPORTED,

"TRITONSERVER_METRIC_KIND_GAUGE does not support Observe");

}

case TRITONSERVER_METRIC_KIND_HISTOGRAM: {

auto histogram_ptr = reinterpret_cast<prometheus::Histogram*>(metric_);

histogram_ptr->Observe(value);

break;

}

default:

return TRITONSERVER_ErrorNew(

TRITONSERVER_ERROR_UNSUPPORTED,

"Unsupported TRITONSERVER_MetricKind");

}

return nullptr; // Success

}

I would need to take a closer look, but my gut reaction is that Guan is probably right and we can probably just reuse MetricValue and MetricSet which will call Collect and Observe internally when kind == KIND_HISTOGRAM if functionally equivalent

MetricValue may not work if Collect returns multiple values (one per bucket?), but again will need to take a closer look. Let me know if you already know more details on this from your research.

But similar to the new C API for MetricV2, keep in mind how this would work if we added support for Summary metric and wanted to get the values for each quantile, which is basically same as values for each bucket. Ideally the same API would work for both or all types.

MetricValue cannot be reused for histogram.

I would like to revisit this change. The consensus is to keep C API and python_backend API 1:1 matched. I am inclined to add a new C API TRITONSERVER_MetricObserve for histogram instead of reusing TRITONSERVER_MetricSet for three reasons.

Both Histogram and Summary types call Observe to record new value. We can reuse observe for Summary type if we add it in the future.

Histogram also has ObserveMultiple API which may be added in the future. I don't like the idea that Histogram.Set and Histogram.ObserveMultiple coexist.

Setting histogram to a value aka Histogram.set(val) is semantically wrong. It is confusing to users familiar with Prometheus APIs. The description of TRITONSERVER_MetricSet can be verbose as well in order to describe different behaviors for counter/gauge and histogram/summary.

cc @Tabrizian @rmccorm4 @GuanLuo

FYI
https://github.com/jupp0r/prometheus-cpp/blob/master/core/include/prometheus/histogram.h
https://github.com/jupp0r/prometheus-cpp/blob/master/core/include/prometheus/summary.h

Well.. Triton metrics API is not supposed to be mirroring "Prometheus API", Prometheus is one of the "forms" we can exhibit the metrics as. So we should design the API to be using generic terms for statistics, the meaning of gauge/counter/histogram (and summary?) is not affected by the fact that Prometheus or other metrics libraries are used.

Thinking from this mindset, my question is if observe is the generic verb for recording the statistics of histogram. If that is the case, then I am fine to add XXXObserve, otherwise, we should use the proper verb

I think either sample or observe. Voting for Observe for simplicity.

A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.

Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.

include/triton/core/tritonserver.h

src/test/metrics_api_test.cc

include/triton/core/tritonserver.h

Tabrizian · 2024-08-13T14:58:07Z

include/triton/core/tritonserver.h

-/// Supports metrics of kind TRITONSERVER_METRIC_KIND_GAUGE and returns
+/// Set the current value of metric to value or observe the value to metric.
+/// Supports metrics of kind TRITONSERVER_METRIC_KIND_GAUGE and
+/// TRITONSERVER_METRIC_KIND_HISTOGRAM. Returns


Do we want to explain what it does when TRITONSERVER_METRIC_KIND_HISTOGRAM is histogram (i.e. increment the counter for the bucket that value matches)?

What does "observe" mean? Can we add more details?

That's why I still think we need a new C API TRITONSERVER_MetricObserve.

I updated the comment. https://github.com/triton-inference-server/core/pull/386/files#diff-ce70b44ec4760c05dc91fc32ae46c5c3d9fa27edb61b960bf299b5960b38fc9fR2770

GuanLuo · 2024-08-16T17:14:23Z

src/metric_family.h

+    buckets_.resize(bucket_count);
+    std::memcpy(buckets_.data(), buckets, sizeof(double) * bucket_count);


Suggested change

buckets_.resize(bucket_count);

std::memcpy(buckets_.data(), buckets, sizeof(double) * bucket_count);

buckets_ = std::vector<double>(buckets, buckets + bucket_count);

GuanLuo · 2024-08-16T17:21:45Z

src/test/metrics_api_test.cc

+  std::vector<std::uint64_t> cumulative_counts = {1, 1, 2, 2, 3, 3};
+  ASSERT_EQ(buckets.size() + 1, cumulative_counts.size());


The cumulative_counts is depending on the buckets you split for the histogram, you should initialize cumulative_counts according to the buckets and data

Co-authored-by: Yingge He <[email protected]>

yinggeh added 3 commits August 1, 2024 04:58

Add histogram metric type

a8d83d0

Add collect api for metrics

d560f64

Update copyrights

fd5c44b

yinggeh added the enhancement New feature or request label Aug 7, 2024

yinggeh requested review from GuanLuo, krishung5, kthui, oandreeva-nv and rmccorm4 August 7, 2024 05:14

yinggeh self-assigned this Aug 7, 2024

This was referenced Aug 7, 2024

feat: Report histogram metrics to Triton metrics server triton-inference-server/vllm_backend#56

Merged

feat: Add histogram metric type triton-inference-server/python_backend#374

Merged

rmccorm4 reviewed Aug 7, 2024

View reviewed changes

include/triton/core/tritonserver.h Outdated Show resolved Hide resolved

GuanLuo reviewed Aug 7, 2024

View reviewed changes

yinggeh force-pushed the yinggeh-DLIS-7113-support-histogram-metric-type branch 2 times, most recently from d0fed63 to fd5c44b Compare August 13, 2024 01:53

Update C API

edb0533

yinggeh force-pushed the yinggeh-DLIS-7113-support-histogram-metric-type branch from 79566e7 to edb0533 Compare August 13, 2024 02:30

Remove TRITONSERVER_MetricCollect API

667858e

yinggeh requested review from GuanLuo, Tabrizian and rmccorm4 August 13, 2024 13:34

Tabrizian reviewed Aug 13, 2024

View reviewed changes

Test MetricArgs

60f1e63

yinggeh requested a review from Tabrizian August 13, 2024 19:43

yinggeh mentioned this pull request Aug 14, 2024

test: Test histogram metric triton-inference-server/server#7525

Merged

11 tasks

Restore TRITONSERVER_MetricObserve

349daa6

GuanLuo reviewed Aug 16, 2024

View reviewed changes

yinggeh added 2 commits August 16, 2024 11:21

Fix build error with -DTRITON_ENABLE_METRICS=OFF

c3d478a

Minor update

484c2be

yinggeh requested a review from GuanLuo August 16, 2024 19:32

GuanLuo previously approved these changes Aug 16, 2024

View reviewed changes

Simply GetCumulativeCounts

d4372af

yinggeh dismissed GuanLuo’s stale review via d4372af August 16, 2024 19:46

yinggeh requested a review from GuanLuo August 16, 2024 19:47

GuanLuo approved these changes Aug 16, 2024

View reviewed changes

Tabrizian approved these changes Aug 16, 2024

View reviewed changes

yinggeh merged commit 9598a80 into main Aug 16, 2024

mc-nv pushed a commit that referenced this pull request Aug 19, 2024

feat: Add new histogram metric type (#386)

dffc026

mc-nv added a commit that referenced this pull request Aug 19, 2024

feat: Add new histogram metric type (#386) (#389)

bf27b3a

Co-authored-by: Yingge He <[email protected]>

yinggeh mentioned this pull request Aug 21, 2024

fix: Fix windows build #391

Merged

11 tasks

	TRITONSERVER_Error*
	Metric::Set(double value)
	{
	if (metric_ == nullptr) {
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_INTERNAL,
	"Could not set metric value. Metric has been invalidated.");
	}

	switch (kind_) {
	case TRITONSERVER_METRIC_KIND_COUNTER: {
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_UNSUPPORTED,
	"TRITONSERVER_METRIC_KIND_COUNTER does not support Set");
	}
	case TRITONSERVER_METRIC_KIND_GAUGE: {
	auto gauge_ptr = reinterpret_cast<prometheus::Gauge*>(metric_);
	gauge_ptr->Set(value);
	break;
	}
	case TRITONSERVER_METRIC_KIND_HISTOGRAM: {
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_UNSUPPORTED,
	"TRITONSERVER_METRIC_KIND_HISTOGRAM does not support Set");
	}
	default:
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_UNSUPPORTED,
	"Unsupported TRITONSERVER_MetricKind");
	}

	return nullptr; // Success
	}

	TRITONSERVER_Error*
	Metric::Observe(double value)
	{
	if (metric_ == nullptr) {
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_INTERNAL,
	"Could not set metric value. Metric has been invalidated.");
	}

	switch (kind_) {
	case TRITONSERVER_METRIC_KIND_COUNTER: {
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_UNSUPPORTED,
	"TRITONSERVER_METRIC_KIND_COUNTER does not support Observe");
	}
	case TRITONSERVER_METRIC_KIND_GAUGE: {
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_UNSUPPORTED,
	"TRITONSERVER_METRIC_KIND_GAUGE does not support Observe");
	}
	case TRITONSERVER_METRIC_KIND_HISTOGRAM: {
	auto histogram_ptr = reinterpret_cast<prometheus::Histogram*>(metric_);
	histogram_ptr->Observe(value);
	break;
	}
	default:
	return TRITONSERVER_ErrorNew(
	TRITONSERVER_ERROR_UNSUPPORTED,
	"Unsupported TRITONSERVER_MetricKind");
	}

	return nullptr; // Success
	}

		buckets_.resize(bucket_count);
		std::memcpy(buckets_.data(), buckets, sizeof(double) * bucket_count);

	buckets_.resize(bucket_count);
	std::memcpy(buckets_.data(), buckets, sizeof(double) * bucket_count);
	buckets_ = std::vector<double>(buckets, buckets + bucket_count);

		std::vector<std::uint64_t> cumulative_counts = {1, 1, 2, 2, 3, 3};
		ASSERT_EQ(buckets.size() + 1, cumulative_counts.size());

feat: Add histogram metric type #386

feat: Add histogram metric type #386

Uh oh!

Conversation

yinggeh commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinggeh Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

yinggeh commented Aug 7, 2024 •

edited

Loading

rmccorm4 Aug 8, 2024 •

edited

Loading

rmccorm4 Aug 8, 2024 •

edited

Loading

yinggeh Aug 15, 2024 •

edited

Loading