-
Couldn't load subscription status.
- Fork 34
[awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: aws-cwa-dev
Are you sure you want to change the base?
Conversation
1a663eb to
db2a1ad
Compare
exporter/awsemfexporter/config.go
Outdated
| // MetricDeclarations is the list of rules to be used to set dimensions for exported metrics. | ||
| MetricDeclarations []*MetricDeclaration `mapstructure:"metric_declarations"` | ||
|
|
||
| // List of string denoting gaugage metric names required for aggregation into cw histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Type gaugage -> Gauge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
3e5c2d0 to
6ad6507
Compare
|
What is the configuration you used for the test? Which of the metrics are you aggregating? |
| for val, cnt := range countMap { | ||
| hist.Values = append(hist.Values, val) | ||
| hist.Counts = append(hist.Counts, cnt) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd recommend pre-allocating hist.Values and hist.Counts to reduce the number of allocations. It can help quite a lot with performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! will revise it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will pre-allocate hist.Values and Counts in below format, correct me if i was wrong:
// Pre-allocate slices to avoid multiple allocations during append
hist.Values = make([]float64, 0, len(countMap))
hist.Counts = make([]float64, 0, len(countMap))
| if v > hist.Max { | ||
| hist.Max = v | ||
| } | ||
| countMap[v]++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it.
If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, in fact, we are aggregating to cloudwatch histogram instead of opentelemetry histogram. For cw histogram, it's exactly in the formact of values and counts, and cloudwatch backend would do the calcluation for percentile values e.g. P90.
9406579 to
a58c06e
Compare
|
|
||
| updatedMetadata = replacePatternsIfNeeded(metadata, labels, config, patternReplaceSucceeded) | ||
| // For compacted gauge metrics, use ExponentialHistogram type to convert it to values and counts in PutLogEvent | ||
| updatedMetadata.metricDataType = pmetric.MetricTypeExponentialHistogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this setting needed to force some other logic to run? The data isn't actually a histogram. It's in values/counts but its still a gauge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there was some special logic for converting values and counts in PutLogEvent, but it is actually not. Make it back to Gauge.
a58c06e to
95b4340
Compare
95b4340 to
6b90657
Compare
| } | ||
|
|
||
| // compactGaugeMetrics converts a collection of gauge data points into a compact representation, e.g. values and counts. | ||
| func compactGaugeMetrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc it's not really compacting gauges but converting to CWHistogram. The name could be more aligned with what it actually does like convertGaugesToCWHistogram or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously I named it as buildHistogram, but from Rick's perspective, it should be a kind of compaction or deduplication, so renamed all occurrence of Histogram to Compact. Pasted the comment from Rick for reference:
Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it.
If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.
| MetricDeclarations []*MetricDeclaration `mapstructure:"metric_declarations"` | ||
|
|
||
| // List of string denoting Gauge metric names required for compaction to values and counts | ||
| GaugeMetricsToCompact []string `mapstructure:"gauge_metrics_to_compact"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Id recommend following the pattern of the other fields that start with Metric to imply its something to do with modifying the metrics.
A suggestion would be MetricAsDistribution.
This exporter treats gauges and sums the same - in fact further below, it converts gauges to sums.. so should we just respect both for this too?
| value: dp.value, | ||
| unit: translateUnit(pmd, descriptor), | ||
| } | ||
| filteredDps := filterAndCalculateDps(dps, pmd.Name(), metadata, config, calculators) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we retain some of the code comments that were previously here within the helper methods you are creating?
Like the comment around dropping NaN or the one about patterns.
| } | ||
| filteredDps := filterAndCalculateDps(dps, pmd.Name(), metadata, config, calculators) | ||
|
|
||
| if shouldCompactMetrics(pmd, config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldCompactMetrics -> shouldConvertToDistribution
compactGaugeMetrics -> convertToDistribution etc.
| } | ||
| if v, ok := dp.value.(float64); ok { | ||
| values = append(values, v) | ||
| labels = enrichLabels(dp, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just overwriting it in place over and over for every dp in the loop, so only the last datapoint's labels are what we actually use?
If thats expected behavior, thats fine, but maybe move this line into the chunk above with the timestamp update such that we use the labels corresponding to the most recent timestamp datapoint?
| } | ||
|
|
||
| updatedMetadata = replacePatternsIfNeeded(metadata, labels, config, patternReplaceSucceeded) | ||
| updatedMetadata.metricDataType = pmetric.MetricTypeGauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we call it a gauge when its closer to a histogram? is some subsequent logic relying on this being defined as a gauge?
| var timestampMs int64 | ||
|
|
||
| // Extract float values from data points and find the latest timestamp | ||
| for _, dp := range dps { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this logic relies on the fact that these are all diff datapoints on the same metric. But if they are sent as diff metrics with a datapoint each, then this wont actually club them together.
In the overall use case we are trying to solve, each prometheus scrape is going to create separate metrics right? So im confused how this logic is able to combine all of them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing in your Overview does seem to indicate its working as expected by clubbing all 60 datapoints into a single distribution - but im just missing how thats happening.
Can we maybe add a temporary log line here to print the metric name, the dps array size and the timestamps to see if all 60 are coming into this func at once?
| groupedMetrics := make(map[any]*groupedMetric) | ||
| rms := gaugeMetric.ResourceMetrics() | ||
| ilms := rms.At(0).ScopeMetrics() | ||
| metrics := ilms.At(0).Metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a 2nd metric here with the same name and diff datapoints to validate that scenario?
Per my understanding of current logic, that might log a warning with Duplicate metric found and just ignore the values from the 2nd metric.
Note
This PR is part of enabling high frequency gpu metrics, other related PRs are 370 and 1893.
Description
This PR is exposing a string list -
GaugeMetricsToCompactin config ofawsemfexporter, all matching metrics will be converted to cloudwatch histogram, e.g. values and counts format. This functionality is to support high frequency metrics. For ex, if eks gpu metrics are collected every second, there will be a couple metrics with the same name in current batch. Aggregate them into one cloudwatch histogram would reduce the metrics quantity to Cloudwatch service and customer cost on high frequency metrics.Refactor the code in the meantime to make it more clean.
Testing