Skip to content

Conversation

@yanhaoluo666
Copy link
Collaborator

@yanhaoluo666 yanhaoluo666 commented Oct 13, 2025

Note

This PR is part of enabling high frequency gpu metrics, other related PRs are 370 and 1893.

Description

This PR is exposing a string list - GaugeMetricsToCompact in config of awsemfexporter, all matching metrics will be converted to cloudwatch histogram, e.g. values and counts format. This functionality is to support high frequency metrics. For ex, if eks gpu metrics are collected every second, there will be a couple metrics with the same name in current batch. Aggregate them into one cloudwatch histogram would reduce the metrics quantity to Cloudwatch service and customer cost on high frequency metrics.

Refactor the code in the meantime to make it more clean.

Testing

  1. set up a few metrics in the config, then deploy code changes to personal cluster and run ML job.
  2. check the logs and confirm these metrics are in histogram format.
{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}

// MetricDeclarations is the list of rules to be used to set dimensions for exported metrics.
MetricDeclarations []*MetricDeclaration `mapstructure:"metric_declarations"`

// List of string denoting gaugage metric names required for aggregation into cw histogram
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Type gaugage -> Gauge

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch 5 times, most recently from 3e5c2d0 to 6ad6507 Compare October 16, 2025 10:59
@yanhaoluo666 yanhaoluo666 changed the title [awsemfexporter] Support gaugage to cloudwatch histogram convertion in EMF exporter [awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter Oct 16, 2025
@Aakash-Dantre Aakash-Dantre self-requested a review October 17, 2025 10:46
Aakash-Dantre
Aakash-Dantre previously approved these changes Oct 17, 2025
@dricross
Copy link

What is the configuration you used for the test? Which of the metrics are you aggregating?

Comment on lines 315 to 320
for val, cnt := range countMap {
hist.Values = append(hist.Values, val)
hist.Counts = append(hist.Counts, cnt)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd recommend pre-allocating hist.Values and hist.Counts to reduce the number of allocations. It can help quite a lot with performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! will revise it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will pre-allocate hist.Values and Counts in below format, correct me if i was wrong:

	// Pre-allocate slices to avoid multiple allocations during append
	hist.Values = make([]float64, 0, len(countMap))
	hist.Counts = make([]float64, 0, len(countMap))

if v > hist.Max {
hist.Max = v
}
countMap[v]++

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it.

If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, in fact, we are aggregating to cloudwatch histogram instead of opentelemetry histogram. For cw histogram, it's exactly in the formact of values and counts, and cloudwatch backend would do the calcluation for percentile values e.g. P90.

@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch 4 times, most recently from 9406579 to a58c06e Compare October 20, 2025 11:42

updatedMetadata = replacePatternsIfNeeded(metadata, labels, config, patternReplaceSucceeded)
// For compacted gauge metrics, use ExponentialHistogram type to convert it to values and counts in PutLogEvent
updatedMetadata.metricDataType = pmetric.MetricTypeExponentialHistogram

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this setting needed to force some other logic to run? The data isn't actually a histogram. It's in values/counts but its still a gauge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was some special logic for converting values and counts in PutLogEvent, but it is actually not. Make it back to Gauge.

@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch from a58c06e to 95b4340 Compare October 20, 2025 19:10
@yanhaoluo666 yanhaoluo666 force-pushed the feature/aws-emf-exporter-aggregation branch from 95b4340 to 6b90657 Compare October 21, 2025 12:06
@yanhaoluo666 yanhaoluo666 requested a review from movence October 21, 2025 13:47
}

// compactGaugeMetrics converts a collection of gauge data points into a compact representation, e.g. values and counts.
func compactGaugeMetrics(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc it's not really compacting gauges but converting to CWHistogram. The name could be more aligned with what it actually does like convertGaugesToCWHistogram or something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I named it as buildHistogram, but from Rick's perspective, it should be a kind of compaction or deduplication, so renamed all occurrence of Histogram to Compact. Pasted the comment from Rick for reference:

Correct me if I'm wrong, but if this is the core of the aggregation logic, then this isn't really creating a histogram. It's aggregating to a series of value/count pairs for each unique floating point value. Considering the precision of float64, I dont think we'll get much aggregation out of it.

If we wanted to create an actual histogram that aggregates a range of datapoints, we'd need define bucket where each bucket represents a range of values, store all of the incoming datapoints into those buckets, and then convert the buckets to values/counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants