Skip to content

Conversation

@yanhaoluo666
Copy link

@yanhaoluo666 yanhaoluo666 commented Oct 13, 2025

Note

This PR is dependent on PR370 and PR371, will update go.mod file and rebase once they are merged.

Description of the issue

Currently, GPU metrics are collected at a one-minute interval, which works well for most machine learning (ML) training jobs. However, for ML inference, where execution times can be as short as 2-3 seconds, this interval is insufficient.

Description of changes

This PR provides customer gpu metrics collection interval customization by introducing a new configuration field. Changes are listed below:

  1. Introduce a new field - accelerated_compute_gpu_metrics_collection_interval to let customer denote metrics collection interval, default value is 60.
  2. If customer sets it to a value less than 60, belows changes will take effect:
    2.1 the batch period will turn from 5s to 60s for batch processor;
    2.2 groupbyattrs processor will be added to awscontainerinsights pipeline to compact metrics from the same resource;
    2.3 gpu sampling frequency would use configured value in awscontainerinsights receiver(PR 370);
    2.4 all gpu metrics will be compressed and converted to cloudwatch histogram type in emf exporter(PR 371);

We have also tried out to provide keys for groupbyattrs processor to only compact gpu metrics, but there is hardly improvement for cpu and memory.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. deploy this PR along with PR 370 and PR 371 to personal eks cluster
  2. Spinned up a ML job then checked cloudwatch logs and metrics, confirmed
    2.1 gpu metrics were sampled every second, i.e. there were 60 datapoints in each PutLogEvent call;
    2.2 gpu metrics were in cloudwatch histogram format.
  • logs sample
{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}
  • metrics graph
image

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh. - done
  2. Run make lint. - done

Integration Tests

To run integration tests against this PR, add the ready for testing label.

Comment on lines +77 to +78
UDP = "udp"
TCP = "tcp"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if awscontainerinsight.AcceleratedComputeMetricsEnabled(conf) && enhancedContainerInsightsEnabled && awscontainerinsight.IsHighFrequencyGPUMetricsEnabled(conf) {
metricsToHistogram = append(metricsToHistogram, []string{
"container_gpu_utilization",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more metrics here that needs to be added.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from a40f2b7 to 69ba416 Compare October 16, 2025 13:51
return metricDeclarations
}

func getGaugeMetricsToHistogram(conf *confmap.Conf) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

}

func IsHighFrequencyGPUMetricsEnabled(conf *confmap.Conf) bool {
return AcceleratedComputeMetricsEnabled(conf) &&
Copy link
Contributor

@spanaik spanaik Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] micro-optimization: Swap the EnhancedCI check before the AcceleratedCompute Check.

@yanhaoluo666 yanhaoluo666 added the ready for testing Indicates this PR is ready for integration tests to run label Oct 17, 2025
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 6c0f9d7 to acbbe17 Compare October 20, 2025 11:39
highFrequencyGPUMetricsEnabled := t.pipelineName == ciPipelineName && awscontainerinsight.IsHighFrequencyGPUMetricsEnabled(conf)
batchprocessorTelemetryKey := common.LogsKey
// Use 60s batch period for batch processor if high-frequency GPU metrics are enabled, otherwise use 5s
if highFrequencyGPUMetricsEnabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncelar what this really does. can you elaborate why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the batch period of batchprocessor is 5s and it would be changed to 60s if high freq gpu metrics is enabled. That's all about it, common.LogsKey and common. MetricsKey denotes 5s and 60s respectively.

@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 6 times, most recently from 1e69ee2 to 31fa2a1 Compare October 23, 2025 16:42
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 31fa2a1 to b9ed82e Compare October 23, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants