-
Notifications
You must be signed in to change notification settings - Fork 238
Enable opt-in for high frequency GPU metrics #1893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable opt-in for high frequency GPU metrics #1893
Conversation
69a8d55 to
cbb5dc4
Compare
| UDP = "udp" | ||
| TCP = "tcp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated from Udp to UDP since make lint failed: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/18476654488/job/52642571583
|
|
||
| if awscontainerinsight.AcceleratedComputeMetricsEnabled(conf) && enhancedContainerInsightsEnabled && awscontainerinsight.IsHighFrequencyGPUMetricsEnabled(conf) { | ||
| metricsToHistogram = append(metricsToHistogram, []string{ | ||
| "container_gpu_utilization", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more metrics here that needs to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a40f2b7 to
69ba416
Compare
| return metricDeclarations | ||
| } | ||
|
|
||
| func getGaugeMetricsToHistogram(conf *confmap.Conf) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
| } | ||
|
|
||
| func IsHighFrequencyGPUMetricsEnabled(conf *confmap.Conf) bool { | ||
| return AcceleratedComputeMetricsEnabled(conf) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] micro-optimization: Swap the EnhancedCI check before the AcceleratedCompute Check.
6c0f9d7 to
acbbe17
Compare
| highFrequencyGPUMetricsEnabled := t.pipelineName == ciPipelineName && awscontainerinsight.IsHighFrequencyGPUMetricsEnabled(conf) | ||
| batchprocessorTelemetryKey := common.LogsKey | ||
| // Use 60s batch period for batch processor if high-frequency GPU metrics are enabled, otherwise use 5s | ||
| if highFrequencyGPUMetricsEnabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncelar what this really does. can you elaborate why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the batch period of batchprocessor is 5s and it would be changed to 60s if high freq gpu metrics is enabled. That's all about it, common.LogsKey and common. MetricsKey denotes 5s and 60s respectively.
1e69ee2 to
31fa2a1
Compare
31fa2a1 to
b9ed82e
Compare
Note
This PR is dependent on PR370 and PR371, will update
go.modfile and rebase once they are merged.Description of the issue
Currently, GPU metrics are collected at a one-minute interval, which works well for most machine learning (ML) training jobs. However, for ML inference, where execution times can be as short as 2-3 seconds, this interval is insufficient.
Description of changes
This PR provides customer gpu metrics collection interval customization by introducing a new configuration field. Changes are listed below:
accelerated_compute_gpu_metrics_collection_intervalto let customer denote metrics collection interval, default value is 60.2.1 the batch period will turn from 5s to 60s for batch processor;
2.2 groupbyattrs processor will be added to awscontainerinsights pipeline to compact metrics from the same resource;
2.3 gpu sampling frequency would use configured value in awscontainerinsights receiver(PR 370);
2.4 all gpu metrics will be compressed and converted to cloudwatch histogram type in emf exporter(PR 371);
We have also tried out to provide keys for groupbyattrs processor to only compact gpu metrics, but there is hardly improvement for cpu and memory.
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
2.1 gpu metrics were sampled every second, i.e. there were 60 datapoints in each PutLogEvent call;
2.2 gpu metrics were in cloudwatch histogram format.
Requirements
Before commiting your code, please do the following steps.
make fmtandmake fmt-sh. - donemake lint. - doneIntegration Tests
To run integration tests against this PR, add the
ready for testinglabel.