[awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter #371

yanhaoluo666 · 2025-10-13T17:31:50Z

Note

This PR is part of enabling high frequency gpu metrics, other related PRs are 370 and 1893.

Description

This PR is exposing a string list - GaugeMetricsToCompact in config of awsemfexporter, all matching metrics will be converted to cloudwatch histogram, e.g. values and counts format. This functionality is to support high frequency metrics. For ex, if eks gpu metrics are collected every second, there will be a couple metrics with the same name in current batch. Aggregate them into one cloudwatch histogram would reduce the metrics quantity to Cloudwatch service and customer cost on high frequency metrics.

Refactor the code in the meantime to make it more clean.

Testing

set up a few metrics in the config, then deploy code changes to personal cluster and run ML job.
check the logs and confirm these metrics are in histogram format.

{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}

spanaik · 2025-10-15T22:37:47Z

exporter/awsemfexporter/config.go

 	// MetricDeclarations is the list of rules to be used to set dimensions for exported metrics.
 	MetricDeclarations []*MetricDeclaration `mapstructure:"metric_declarations"`

+	// List of string denoting gaugage metric names required for aggregation into cw histogram


nit: Type gaugage -> Gauge

dricross · 2025-10-17T16:40:33Z

What is the configuration you used for the test? Which of the metrics are you aggregating?

dricross · 2025-10-17T16:29:35Z

exporter/awsemfexporter/grouped_metric.go

+	for val, cnt := range countMap {
+		hist.Values = append(hist.Values, val)
+		hist.Counts = append(hist.Counts, cnt)
+	}


nit: I'd recommend pre-allocating hist.Values and hist.Counts to reduce the number of allocations. It can help quite a lot with performance.

good point! will revise it

will pre-allocate hist.Values and Counts in below format, correct me if i was wrong:

// Pre-allocate slices to avoid multiple allocations during append hist.Values = make([]float64, 0, len(countMap)) hist.Counts = make([]float64, 0, len(countMap))

dricross · 2025-10-17T16:33:55Z

exporter/awsemfexporter/grouped_metric.go

+		if v > hist.Max {
+			hist.Max = v
+		}
+		countMap[v]++


Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it.

If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.

You are correct, in fact, we are aggregating to cloudwatch histogram instead of opentelemetry histogram. For cw histogram, it's exactly in the formact of values and counts, and cloudwatch backend would do the calcluation for percentile values e.g. P90.

dricross · 2025-10-20T16:10:22Z

exporter/awsemfexporter/grouped_metric.go

+
+	updatedMetadata = replacePatternsIfNeeded(metadata, labels, config, patternReplaceSucceeded)
+	// For compacted gauge metrics, use ExponentialHistogram type to convert it to values and counts in PutLogEvent
+	updatedMetadata.metricDataType = pmetric.MetricTypeExponentialHistogram


Is this setting needed to force some other logic to run? The data isn't actually a histogram. It's in values/counts but its still a gauge.

I thought there was some special logic for converting values and counts in PutLogEvent, but it is actually not. Make it back to Gauge.

…n EMF exporter

movence · 2025-10-22T13:09:31Z

exporter/awsemfexporter/grouped_metric.go

+}
+
+// compactGaugeMetrics converts a collection of gauge data points into a compact representation, e.g. values and counts.
+func compactGaugeMetrics(


iiuc it's not really compacting gauges but converting to CWHistogram. The name could be more aligned with what it actually does like convertGaugesToCWHistogram or something

Previously I named it as buildHistogram, but from Rick's perspective, it should be a kind of compaction or deduplication, so renamed all occurrence of Histogram to Compact. Pasted the comment from Rick for reference:

Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it. If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.

sky333999 · 2025-10-29T20:46:10Z

exporter/awsemfexporter/config.go

 	MetricDeclarations []*MetricDeclaration `mapstructure:"metric_declarations"`

+	// List of string denoting Gauge metric names required for compaction to values and counts
+	GaugeMetricsToCompact []string `mapstructure:"gauge_metrics_to_compact"`


nit: Id recommend following the pattern of the other fields that start with Metric to imply its something to do with modifying the metrics.
A suggestion would be MetricAsDistribution.

This exporter treats gauges and sums the same - in fact further below, it converts gauges to sums.. so should we just respect both for this too?

sky333999 · 2025-10-29T20:53:46Z

exporter/awsemfexporter/grouped_metric.go

-				value: dp.value,
-				unit:  translateUnit(pmd, descriptor),
-			}
+	filteredDps := filterAndCalculateDps(dps, pmd.Name(), metadata, config, calculators)


nit: Can we retain some of the code comments that were previously here within the helper methods you are creating?

Like the comment around dropping NaN or the one about patterns.

sky333999 · 2025-10-29T20:54:13Z

exporter/awsemfexporter/grouped_metric.go

-			}
+	filteredDps := filterAndCalculateDps(dps, pmd.Name(), metadata, config, calculators)

+	if shouldCompactMetrics(pmd, config) {


nit: shouldCompactMetrics -> shouldConvertToDistribution
compactGaugeMetrics -> convertToDistribution etc.

sky333999 · 2025-10-29T21:10:44Z

exporter/awsemfexporter/grouped_metric.go

+		}
+		if v, ok := dp.value.(float64); ok {
+			values = append(values, v)
+			labels = enrichLabels(dp, config)


This is just overwriting it in place over and over for every dp in the loop, so only the last datapoint's labels are what we actually use?

If thats expected behavior, thats fine, but maybe move this line into the chunk above with the timestamp update such that we use the labels corresponding to the most recent timestamp datapoint?

sky333999 · 2025-10-29T21:12:05Z

exporter/awsemfexporter/grouped_metric.go

+	}
+
+	updatedMetadata = replacePatternsIfNeeded(metadata, labels, config, patternReplaceSucceeded)
+	updatedMetadata.metricDataType = pmetric.MetricTypeGauge


why do we call it a gauge when its closer to a histogram? is some subsequent logic relying on this being defined as a gauge?

sky333999 · 2025-10-29T22:57:21Z

exporter/awsemfexporter/grouped_metric.go

+	var timestampMs int64
+
+	// Extract float values from data points and find the latest timestamp
+	for _, dp := range dps {


All of this logic relies on the fact that these are all diff datapoints on the same metric. But if they are sent as diff metrics with a datapoint each, then this wont actually club them together.

In the overall use case we are trying to solve, each prometheus scrape is going to create separate metrics right? So im confused how this logic is able to combine all of them together.

The testing in your Overview does seem to indicate its working as expected by clubbing all 60 datapoints into a single distribution - but im just missing how thats happening.

Can we maybe add a temporary log line here to print the metric name, the dps array size and the timestamps to see if all 60 are coming into this func at once?

sky333999 · 2025-10-29T23:02:13Z

exporter/awsemfexporter/grouped_metric_test.go

+		groupedMetrics := make(map[any]*groupedMetric)
+		rms := gaugeMetric.ResourceMetrics()
+		ilms := rms.At(0).ScopeMetrics()
+		metrics := ilms.At(0).Metrics()


Can we add a 2nd metric here with the same name and diff datapoints to validate that scenario?
Per my understanding of current logic, that might log a warning with Duplicate metric found and just ignore the values from the 2nd metric.

Approved accidentally

yanhaoluo666 requested a review from mxiamxia as a code owner October 13, 2025 17:31

yanhaoluo666 mentioned this pull request Oct 13, 2025

Enable opt-in for high frequency GPU metrics aws/amazon-cloudwatch-agent#1893

Open

yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch 3 times, most recently from 1a663eb to db2a1ad Compare October 14, 2025 13:02

yanhaoluo666 marked this pull request as draft October 14, 2025 13:29

yanhaoluo666 marked this pull request as ready for review October 14, 2025 13:29

yanhaoluo666 marked this pull request as draft October 14, 2025 13:31

yanhaoluo666 marked this pull request as ready for review October 14, 2025 13:34

yanhaoluo666 mentioned this pull request Oct 15, 2025

[awscontainerinsightreceiver] Add accelerated_compute_gpu_metrics_collection_interval config to support gpu metrics collection interval customization #370

Merged

spanaik reviewed Oct 15, 2025

View reviewed changes

yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch 5 times, most recently from 3e5c2d0 to 6ad6507 Compare October 16, 2025 10:59

yanhaoluo666 changed the title ~~[awsemfexporter] Support gaugage to cloudwatch histogram convertion in EMF exporter~~ [awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter Oct 16, 2025

Aakash-Dantre self-requested a review October 17, 2025 10:46

Aakash-Dantre previously approved these changes Oct 17, 2025

View reviewed changes

dricross reviewed Oct 17, 2025

View reviewed changes

yanhaoluo666 dismissed Aakash-Dantre’s stale review via 79ea0f8 October 17, 2025 18:34

yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch 4 times, most recently from 9406579 to a58c06e Compare October 20, 2025 11:42

dricross reviewed Oct 20, 2025

View reviewed changes

yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch from a58c06e to 95b4340 Compare October 20, 2025 19:10

dricross approved these changes Oct 21, 2025

View reviewed changes

[awsemfexporter] Support gaugage to cloudwatch histogram convertion i…

6b90657

…n EMF exporter

yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch from 95b4340 to 6b90657 Compare October 21, 2025 12:06

yanhaoluo666 requested a review from movence October 21, 2025 13:47

movence reviewed Oct 22, 2025

View reviewed changes

sky333999 reviewed Oct 29, 2025

View reviewed changes

Uh oh!

[awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter #371

Are you sure you want to change the base?

[awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter #371

Conversation

yanhaoluo666 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Description

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dricross commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yanhaoluo666 commented Oct 13, 2025 •

edited

Loading