Skip to content

Conversation

@yanhaoluo666
Copy link
Collaborator

@yanhaoluo666 yanhaoluo666 commented Oct 13, 2025

Note

This PR is part of enabling high frequency gpu metrics, other related PRs are 370 and 1893.

Description

This PR is exposing a string list - GaugeMetricsToCompact in config of awsemfexporter, all matching metrics will be converted to cloudwatch histogram, e.g. values and counts format. This functionality is to support high frequency metrics. For ex, if eks gpu metrics are collected every second, there will be a couple metrics with the same name in current batch. Aggregate them into one cloudwatch histogram would reduce the metrics quantity to Cloudwatch service and customer cost on high frequency metrics.

Refactor the code in the meantime to make it more clean.

Testing

  1. set up a few metrics in the config, then deploy code changes to personal cluster and run ML job.
  2. check the logs and confirm these metrics are in histogram format.
{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}

// MetricDeclarations is the list of rules to be used to set dimensions for exported metrics.
MetricDeclarations []*MetricDeclaration `mapstructure:"metric_declarations"`

// List of string denoting gaugage metric names required for aggregation into cw histogram
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Type gaugage -> Gauge

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch 5 times, most recently from 3e5c2d0 to 6ad6507 Compare October 16, 2025 10:59
@yanhaoluo666 yanhaoluo666 changed the title [awsemfexporter] Support gaugage to cloudwatch histogram convertion in EMF exporter [awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter Oct 16, 2025
@Aakash-Dantre Aakash-Dantre self-requested a review October 17, 2025 10:46
Aakash-Dantre
Aakash-Dantre previously approved these changes Oct 17, 2025
@dricross
Copy link

What is the configuration you used for the test? Which of the metrics are you aggregating?

Comment on lines 315 to 320
for val, cnt := range countMap {
hist.Values = append(hist.Values, val)
hist.Counts = append(hist.Counts, cnt)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd recommend pre-allocating hist.Values and hist.Counts to reduce the number of allocations. It can help quite a lot with performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! will revise it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will pre-allocate hist.Values and Counts in below format, correct me if i was wrong:

	// Pre-allocate slices to avoid multiple allocations during append
	hist.Values = make([]float64, 0, len(countMap))
	hist.Counts = make([]float64, 0, len(countMap))

if v > hist.Max {
hist.Max = v
}
countMap[v]++

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it.

If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, in fact, we are aggregating to cloudwatch histogram instead of opentelemetry histogram. For cw histogram, it's exactly in the formact of values and counts, and cloudwatch backend would do the calcluation for percentile values e.g. P90.

@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch 4 times, most recently from 9406579 to a58c06e Compare October 20, 2025 11:42

updatedMetadata = replacePatternsIfNeeded(metadata, labels, config, patternReplaceSucceeded)
// For compacted gauge metrics, use ExponentialHistogram type to convert it to values and counts in PutLogEvent
updatedMetadata.metricDataType = pmetric.MetricTypeExponentialHistogram

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this setting needed to force some other logic to run? The data isn't actually a histogram. It's in values/counts but its still a gauge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was some special logic for converting values and counts in PutLogEvent, but it is actually not. Make it back to Gauge.

@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch from a58c06e to 95b4340 Compare October 20, 2025 19:10
@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch from 95b4340 to 6b90657 Compare October 21, 2025 12:06
@yanhaoluo666 yanhaoluo666 requested a review from movence October 21, 2025 13:47
}

// compactGaugeMetrics converts a collection of gauge data points into a compact representation, e.g. values and counts.
func compactGaugeMetrics(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc it's not really compacting gauges but converting to CWHistogram. The name could be more aligned with what it actually does like convertGaugesToCWHistogram or something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I named it as buildHistogram, but from Rick's perspective, it should be a kind of compaction or deduplication, so renamed all occurrence of Histogram to Compact. Pasted the comment from Rick for reference:

Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it.

If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.

MetricDeclarations []*MetricDeclaration `mapstructure:"metric_declarations"`

// List of string denoting Gauge metric names required for compaction to values and counts
GaugeMetricsToCompact []string `mapstructure:"gauge_metrics_to_compact"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Id recommend following the pattern of the other fields that start with Metric to imply its something to do with modifying the metrics.
A suggestion would be MetricAsDistribution.

This exporter treats gauges and sums the same - in fact further below, it converts gauges to sums.. so should we just respect both for this too?

value: dp.value,
unit: translateUnit(pmd, descriptor),
}
filteredDps := filterAndCalculateDps(dps, pmd.Name(), metadata, config, calculators)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we retain some of the code comments that were previously here within the helper methods you are creating?

Like the comment around dropping NaN or the one about patterns.

}
filteredDps := filterAndCalculateDps(dps, pmd.Name(), metadata, config, calculators)

if shouldCompactMetrics(pmd, config) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldCompactMetrics -> shouldConvertToDistribution
compactGaugeMetrics -> convertToDistribution etc.

}
if v, ok := dp.value.(float64); ok {
values = append(values, v)
labels = enrichLabels(dp, config)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just overwriting it in place over and over for every dp in the loop, so only the last datapoint's labels are what we actually use?

If thats expected behavior, thats fine, but maybe move this line into the chunk above with the timestamp update such that we use the labels corresponding to the most recent timestamp datapoint?

}

updatedMetadata = replacePatternsIfNeeded(metadata, labels, config, patternReplaceSucceeded)
updatedMetadata.metricDataType = pmetric.MetricTypeGauge

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we call it a gauge when its closer to a histogram? is some subsequent logic relying on this being defined as a gauge?

var timestampMs int64

// Extract float values from data points and find the latest timestamp
for _, dp := range dps {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this logic relies on the fact that these are all diff datapoints on the same metric. But if they are sent as diff metrics with a datapoint each, then this wont actually club them together.

In the overall use case we are trying to solve, each prometheus scrape is going to create separate metrics right? So im confused how this logic is able to combine all of them together.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing in your Overview does seem to indicate its working as expected by clubbing all 60 datapoints into a single distribution - but im just missing how thats happening.

Can we maybe add a temporary log line here to print the metric name, the dps array size and the timestamps to see if all 60 are coming into this func at once?

groupedMetrics := make(map[any]*groupedMetric)
rms := gaugeMetric.ResourceMetrics()
ilms := rms.At(0).ScopeMetrics()
metrics := ilms.At(0).Metrics()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a 2nd metric here with the same name and diff datapoints to validate that scenario?
Per my understanding of current logic, that might log a warning with Duplicate metric found and just ignore the values from the 2nd metric.

@sky333999 sky333999 dismissed a stale review October 29, 2025 23:25

Approved accidentally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants