-
-
Notifications
You must be signed in to change notification settings - Fork 382
feat: reimplement history data export #1642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
cristiangreco
merged 7 commits into
prometheus-community:master
from
woehrl01:historical-data
Sep 22, 2025
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
6401f0f
feat: reimplement history data export
woehrl01 e089bca
fix: use struct directly instead of referenes to reduce gc preassure
woehrl01 ce858fb
improve comments and add example
woehrl01 1d3536c
rename to exportAllDataPoints and added test
woehrl01 96e5285
correctly handle nil value and timestamp combinations
woehrl01 52f7b4e
Code review feedback
kgeckhart 88404b3
Merge pull request #1 from kgeckhart/historical-data
woehrl01 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| apiVersion: v1alpha1 | ||
| discovery: | ||
| jobs: | ||
| - type: AWS/SQS | ||
| regions: | ||
| - us-east-1 | ||
| period: 60 | ||
| length: 300 | ||
| addCloudwatchTimestamp: true | ||
| exportAllDataPoints: true | ||
| metrics: | ||
| - name: NumberOfMessagesSent | ||
| statistics: [Sum] | ||
| - name: NumberOfMessagesReceived | ||
| statistics: [Sum] | ||
| - name: NumberOfMessagesDeleted | ||
| statistics: [Sum] | ||
| - name: ApproximateAgeOfOldestMessage | ||
| statistics: [Average] | ||
| - name: NumberOfEmptyReceives | ||
| statistics: [Sum] | ||
| - name: SentMessageSize | ||
| statistics: [Average] | ||
| - name: ApproximateNumberOfMessagesNotVisible | ||
| statistics: [Sum] | ||
| - name: ApproximateNumberOfMessagesDelayed | ||
| statistics: [Sum] | ||
| - name: ApproximateNumberOfMessagesVisible | ||
| statistics: [Sum] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this new behaviour should be used only if
exportAllDataPointsis enabled. If not, we should keep the previous behaviour and avoid copying around the additional values/timestamps from the CW response.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required because this function is not aware of any configuration. This is in my opinion the cleaned implementation. See comments from @kgeckhart in the linked pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Kyle's comment was mainly around allowing the client to return a slice of Datapoint + Timestamp, but he can confirm.
My concern is around useless mem allocations when
exportAllDataPointsis not enabled. I think the option can be made available to the aws client, e.g. viaCloudwatchDataorGetMetricDataProcessingParams.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure there is no significant useless memory allocation. All the values are already in memory because it's part of the api response. So we just allocate the slice a bit bigger, because we use a struct of slice and not a reference to the struct slice, there is a single memory alloc for the underlying array, everything else is just copying over values. So this just keeps the already returned data for a few function calls longer in memory, to unify the struct across v1 and v2.
Feels like an unessary add of complexity, to pass around those parameters and adding additional tests for validate that behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on my earlier comment: even if we assume returning a slice of 5 structs increases memory usage, the overhead is still reasonable, around ~128 MB for 1 million metrics. That comes from storing 4 additional data points per metric, each at 32 bytes (a time.Time value + a *float64 pointer). It's just a larger backing array and a few more pointers to scan during GC. This only becomes relevant if we're allocating and retaining millions of these slices, which we're not.
And if we were dealing with that many, querying 1 million CloudWatch metrics, the real issue wouldn't be memory. Even with 5 data points per request, that's 200K metric fetches, costing ~$2 per query. At that scale, CloudWatch costs and API limits are a far bigger concern than a ~100 MB memory difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run YACE in a stateless multi-tenant fashion so querying a very large amount of metrics in a short period of time often happens. I was concerned about the potential memory overhead for this change and had shared that with Cristian.
This wasn't the area I was concerned about initially as we keep our length + period setup to only produce a single datapoint as much as possible. I do agree with Cristian, if we stick to only mapping a single datapoint when the setting is disabled it will ensure there's a minimal overhead to those who upgrade to latest without using the feature.
I was primarily concerned about overhead from switching to a single datapoint -> a slice for our use case. I don't think we should do anything about it now. I wrote a
OneOrMany[T]that benchmarks nicely vs a single entry slice and would PR it separately if needed.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgeckhart I just blindly merged yesterday your suggested changes, to get this PR of my plate. Still I had a look today, about what you changed around your memory concerns. Looking at the current changes, I don't see any change which would reduce memory usage at all. Despite your suggestion, it still creates a slice everytime and it always creates it with full size for all data points. There is now just added complexity around passing over a flag, to stop the loop early, which you could argue reduced CPU overhead, but is likely neglectable at the scale where the cloudwatch costs would explode.
Before we are moving into premature optimisations, have you actually measured and proved that this actually is an issue? (see my comments above why I doubt that based on numbers). You mention that you are running this at large scale. Maybe you can run the initial PR, side by side for 1-5% of your workload (depending on the scale) and share some realworld memory impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@woehrl01 thanks for taking a look at it and providing feedback.
This was an oversight on my part as I was hastily making the change before I needed to go to an event yesterday 😅. My intention was to make sure to only allocate exactly what is necessary and nothing more which I can do with another minor change
IMO the added complexity is rather minimal and easily testable which is why I went through with the change.
When Cristian brought it up I didn't quite get it at first because we do run it at scale but we do it in such a way that we should only ever get one data point back. We do this intentionally for performance reasons, why ask for data you won't use? I don't know if it was Cristian's intent but my realization was that introducing this feature should not present a noticeable negative impact on the larger community just by upgrading. This exporter is embedded in to places like https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.exporter.cloudwatch/ which I know is used by customers at a scale large enough to incur some rather hefty CloudWatch costs.
Going from a single data point to a slice presents some memory increase that is unlikely to be noticeable for most. We don't know how many people are setting up their configs with a length that is larger than their period (length > period = more than 1 data point) and to what degree they are doing it (2x, 3x, 10x?). This is part of the complexity of building the configs CloudWatch has some incredibly odd behaviors for different metrics so the configs get cargo culted in a way that is not optimal but works.
Could this guard be unnecessary, yes but the complexity of adding it feels acceptable as a means to try to provide a stable experience to existing users.
Since I messed up the DCO + need another change we will have to figure out what to do with this PR. But first it would probably be good to have @cristiangreco take a look to make sure there's nothing further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgeckhart @cristiangreco what's the plan with this PR?
Don't get me wrong, I understand we're all busy and this is OSS after all. So no pressure here please... ;) I would just highly appreciate some short indication of you guys if and when you plan to move ahead with this PR to be able to make plans on our end as well.
From what I see, this PR seems to be waiting on some action from your side and there is nothing really left the community could support with? If that is wrong and there is still something todo, please let me know, I'm happy to contribute as well.