Enable opt-in for high frequency GPU metrics #1893

yanhaoluo666 · 2025-10-13T19:50:13Z

Note

This PR is dependent on PR370 and PR371, will update go.mod file and rebase once they are merged.

Description of the issue

Currently, GPU metrics are collected at a one-minute interval, which works well for most machine learning (ML) training jobs. However, for ML inference, where execution times can be as short as 2-3 seconds, this interval is insufficient.

Description of changes

This PR provides customer gpu metrics collection interval customization by introducing a new configuration field. Changes are listed below:

Introduce a new field - accelerated_compute_gpu_metrics_collection_interval to let customer denote metrics collection interval, default value is 60.
If customer sets it to a value less than 60, belows changes will take effect:
2.1 the batch period will turn from 5s to 60s for batch processor;
2.2 groupbyattrs processor will be added to awscontainerinsights pipeline to compact metrics from the same resource;
2.3 gpu sampling frequency would use configured value in awscontainerinsights receiver(PR 370);
2.4 all gpu metrics will be compressed and converted to cloudwatch histogram type in emf exporter(PR 371);

We have also tried out to provide keys for groupbyattrs processor to only compact gpu metrics, but there is hardly improvement for cpu and memory.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

deploy this PR along with PR 370 and PR 371 to personal eks cluster
Spinned up a ML job then checked cloudwatch logs and metrics, confirmed
2.1 gpu metrics were sampled every second, i.e. there were 60 datapoints in each PutLogEvent call;
2.2 gpu metrics were in cloudwatch histogram format.

logs sample

{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}

metrics graph

Requirements

Before commiting your code, please do the following steps.

Run make fmt and make fmt-sh. - done
Run make lint. - done

Integration Tests

To run integration tests against this PR, add the ready for testing label.

yanhaoluo666 · 2025-10-14T16:41:43Z

translator/translate/otel/common/common.go

+	UDP                                            = "udp"
+	TCP                                            = "tcp"


Updated from Udp to UDP since make lint failed: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/18476654488/job/52642571583

spanaik · 2025-10-16T10:22:57Z

translator/translate/otel/exporter/awsemf/kubernetes.go

+
+	if awscontainerinsight.AcceleratedComputeMetricsEnabled(conf) && enhancedContainerInsightsEnabled && awscontainerinsight.IsHighFrequencyGPUMetricsEnabled(conf) {
+		metricsToHistogram = append(metricsToHistogram, []string{
+			"container_gpu_utilization",


There are more metrics here that needs to be added.

added with reference to https://github.com/aws/amazon-cloudwatch-agent/blob/main/translator/translate/otel/exporter/awsemf/kubernetes.go#L505

spanaik · 2025-10-16T14:15:07Z

translator/translate/otel/exporter/awsemf/kubernetes.go

 	return metricDeclarations
 }
+
+func getGaugeMetricsToHistogram(conf *confmap.Conf) []string {


spanaik · 2025-10-16T14:17:22Z

translator/translate/otel/receiver/awscontainerinsight/utils.go

+}
+
+func IsHighFrequencyGPUMetricsEnabled(conf *confmap.Conf) bool {
+	return AcceleratedComputeMetricsEnabled(conf) &&


[nit] micro-optimization: Swap the EnhancedCI check before the AcceleratedCompute Check.

movence · 2025-10-22T13:46:20Z

translator/translate/otel/pipeline/containerinsights/translator.go

+	highFrequencyGPUMetricsEnabled := t.pipelineName == ciPipelineName && awscontainerinsight.IsHighFrequencyGPUMetricsEnabled(conf)
+	batchprocessorTelemetryKey := common.LogsKey
+	// Use 60s batch period for batch processor if high-frequency GPU metrics are enabled, otherwise use 5s
+	if highFrequencyGPUMetricsEnabled {


I'm uncelar what this really does. can you elaborate why?

Currently the batch period of batchprocessor is 5s and it would be changed to 60s if high freq gpu metrics is enabled. That's all about it, common.LogsKey and common. MetricsKey denotes 5s and 60s respectively.

yanhaoluo666 requested a review from a team as a code owner October 13, 2025 19:50

yanhaoluo666 mentioned this pull request Oct 14, 2025

[awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter amazon-contributing/opentelemetry-collector-contrib#371

Open

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from 69a8d55 to cbb5dc4 Compare October 14, 2025 15:58

yanhaoluo666 commented Oct 14, 2025

View reviewed changes

yanhaoluo666 requested a review from movence October 14, 2025 17:25

yanhaoluo666 mentioned this pull request Oct 15, 2025

[awscontainerinsightreceiver] Add accelerated_compute_gpu_metrics_collection_interval config to support gpu metrics collection interval customization amazon-contributing/opentelemetry-collector-contrib#370

Merged

spanaik reviewed Oct 16, 2025

View reviewed changes

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from a40f2b7 to 69ba416 Compare October 16, 2025 13:51

spanaik reviewed Oct 16, 2025

View reviewed changes

spanaik approved these changes Oct 16, 2025

View reviewed changes

yanhaoluo666 added the ready for testing Indicates this PR is ready for integration tests to run label Oct 17, 2025

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 6c0f9d7 to acbbe17 Compare October 20, 2025 11:39

movence reviewed Oct 22, 2025

View reviewed changes

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 6 times, most recently from 1e69ee2 to 31fa2a1 Compare October 23, 2025 16:42

Enable opt-in for high frequency GPU metrics

b9ed82e

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 31fa2a1 to b9ed82e Compare October 23, 2025 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable opt-in for high frequency GPU metrics #1893

Enable opt-in for high frequency GPU metrics #1893

yanhaoluo666 commented Oct 13, 2025 •

edited

Loading

Uh oh!

yanhaoluo666 Oct 14, 2025

Uh oh!

spanaik Oct 16, 2025

Uh oh!

yanhaoluo666 Oct 16, 2025

Uh oh!

spanaik Oct 16, 2025

Uh oh!

spanaik Oct 16, 2025 •

edited

Loading

Uh oh!

movence Oct 22, 2025

Uh oh!

yanhaoluo666 Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable opt-in for high frequency GPU metrics #1893

Are you sure you want to change the base?

Enable opt-in for high frequency GPU metrics #1893

Conversation

yanhaoluo666 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Description of the issue

Description of changes

License

Tests

Requirements

Integration Tests

Uh oh!

yanhaoluo666 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

spanaik Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

yanhaoluo666 Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

spanaik Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

spanaik Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

movence Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

yanhaoluo666 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yanhaoluo666 commented Oct 13, 2025 •

edited

Loading

spanaik Oct 16, 2025 •

edited

Loading