-
Notifications
You must be signed in to change notification settings - Fork 4.1k
roachtest: add Datadog integration for test log ingestion #158528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add Datadog integration for test log ingestion #158528
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
690dece to
890f148
Compare
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
DetailsCurrent approach is one shot. The log parsing is done serially while the uploading is handled by 10 goroutines. Uploading all of an entire nightly run's Log Event Data ModelingBest practice is to keep log event tags limited to low to medium high cardinality fields for search performance. Higher cardinality fields will be assigned as log attributes. More information: https://docs.datadoghq.com/getting_started/tagging/ All associated log attribute and tag information is stored in the following struct Datadog API KeyPreviously, the API Key is being passed in as a CLI Flag. The below build configuration template will pass the key along as an env var so we can remove the CLI flag from roachtest which seems like the more canonical approach to me. This is one of the new env vars being passed to the Docker container. Current Limitations & ConsiderationsCurrently mixedversion logs uses a logger that prefixes all messages. My current regex doesn't account for this. The regex could be expanded to support these prefixes, or buy having logs being sent by Any multiline comments i.e. Next StepsNext Step is to extend this to other log files and roachprod by modifying or extending the roachprod logger package. Introduce capturing log batches in
With these limitations and considerations in mind, I'd still like to start ingesting logs to see on a broader sense if this is helpful for triage or for looking at trends. |
366f3f1 to
a309f35
Compare
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Make sure to adjust the time picker, it defaults to 15 minutes |
5fee6e7 to
bce23c1
Compare
bce23c1 to
06e1a1f
Compare
7e4b616 to
7e72cbc
Compare
7e72cbc to
379fb6d
Compare
golgeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Few comments here and there.
| if tcBuildPropertyFile != "" { | ||
| file, err := os.Open(tcBuildPropertyFile) | ||
| if err != nil { | ||
| l.Printf("failed to open teamcity build properties file: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected to keep processing if you cannot open the file?
If so, I'd add a comment mentioning that reading the file to extract tags from it is optional or something.
| defer resp.Body.Close() | ||
| // Datadog returns 202 Accepted on success | ||
| if resp.StatusCode < 200 || resp.StatusCode >= 300 { | ||
| return errors.Newf("unexpected status code %d from Datadog API", resp.StatusCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider a retry mechanism for temporary errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The datadog api client gives us retry + exponential backoff which is why I opted to use it instead of hitting the endpoint directly
pkg/cmd/roachtest/datadog/datadog.go
Outdated
| const numWorkers = 10 | ||
| const batchSize = 1000 // Datadog max batch size | ||
|
|
||
| g := ctxgroup.WithContext(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I followed the code path correctly, the context is derived (multiple times) from the test_runner context, which means that if it gets cancelled in the test runner, it will get cancelled here.
I'm wondering if this is what we actually want or if we shouldn't keep uploading logs even if the context gets cancelled by the parent. Maybe we should have a dedicated timeout here to ensure we're not blocking forever?
I don't have the full picture, so maybe you thought about it and it's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it gets cancelled in the test runner, it will get cancelled here
Did not think about that. We're doing this in the post test step, so not sure what scenario would cause the context to be cancelled (besides a TC timeout perhaps?), but going to assume there's some scenario in which the parent context gets cancelled. I think I'm ok with everything getting cancelled if the test runner sends the cancel
Maybe we should have a dedicated timeout here to ensure we're not blocking forever?
I'll add this, thanks for the catch. Added to MaybeUploadTestLog which is this functions caller
pkg/cmd/roachtest/datadog/datadog.go
Outdated
| // Start worker goroutines to upload batches concurrently | ||
| for i := 0; i < numWorkers; i++ { | ||
| g.GoCtx(func(ctx context.Context) error { | ||
| for batch := range batches { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, since you don't check for context cancellation here, you will iterate over all batches whether the context is cancelled/timeouts or not.
You could switch to this if you want to honor:
switch {
case batch, ok := <-batches:
if !ok { return nil }
...
case <-ctx.Done():
ctx.Err(), depending on desired behavior
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the catch, added, will run a small smoke
0e7b826 to
de7cb80
Compare
de7cb80 to
b56f8b5
Compare
|
Ripped out unnecessary "test.log" variable I initially added to minimize potential backport conflicts, would be nice to have a single place with all the consts used by roachtest, but can do that in a separate pr |
c197bba to
774f739
Compare
Previously, test.log artifacts were only available on TeamCity's UI. This change adds a datadog package that uploads test.log files to Datadog during test cleanup on master and release branches. The implementation scans the log file serially and uses a worker pool to upload log entries in batches of 1000 using the Datadog API client. Each log entry is tagged with test metadata (test name, owner, cloud, platform, version) and includes attributes for higher cardinality fields (cluster name, build number, result, duration). This enables full-text search across roachtest runs and improves observability for test triage and failure analysis. Epic: None Release note: None
774f739 to
562cdfa
Compare
|
Switched to task instead of ctxgroup (linter) |
Potential Bug(s) DetectedThe three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation. Next Steps: Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary. After you review the findings, please tag the issue as follows:
|
|
Also, I'd argue that the Claude bot's comment is overly defensive but not incorrect, not going to make the suggested change |
|
tftr :) |
158528: roachtest: add Datadog integration for test log ingestion r=golgeek a=williamchoe3 Previously, we didn't have an easy way to do full text search across CI test runs across different TC Build Configurations and branches. To do that, you would have to download the artifacts for what you wanted to search for. This change adds a datadog package that uploads test.log files to Datadog during test cleanup on master and release branches. The implementation scans the log file serially and uses a worker pool to upload log entries in batches of 1000 using the Datadog API client. Each log entry is tagged with test metadata (test name, owner, cloud, platform, version) and includes attributes for higher cardinality fields (cluster name, build number, result, duration). See comments for more details. The entry point for datadog upload in `roachtest` will be during the post step test in `test_runner.go.` Added a new roachtest flag `datadog-always-upload` for e2e testing on a non release branch. Modified TC build scripts to pass new env vars and the teamcity build properties file. 160526: sql: fix TestUnsplitRanges to work with external test tenants r=rafiss a=rafiss Previously, TestUnsplitRanges was skipped in external test tenant mode because it scanned meta ranges directly and performed AdminSplit/ AdminUnsplit operations that external tenants cannot do. This commit fixes the test by: 1. Using the system layer's DB for meta range operations (scanning meta ranges, checking sticky bits, splitting ranges) since tenants cannot access meta ranges directly. 2. Using the application layer's DB for table data operations which tenants can access. 3. Granting the CanAdminUnsplit capability to the external tenant so the GC job can unsplit ranges after dropping tables/indexes. Resolves: #142388 Epic: CRDB-48944 Release note: None 160549: build: remove unused publish script for no-telemetry release r=celiala a=rail Remove the unused script, because we no longer use it. Epic: none Release note: none Co-authored-by: William Choe <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Rail Aliiev <[email protected]>
|
Build failed (retrying...): |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 562cdfa to blathers/backport-release-24.3-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-24.3 failed. See errors above. error creating merge commit from 562cdfa to blathers/backport-release-25.2-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-25.2 failed. See errors above. error creating merge commit from 562cdfa to blathers/backport-release-25.3-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-25.3 failed. See errors above. error creating merge commit from 562cdfa to blathers/backport-release-25.4-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch release-25.4 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |

Previously, we didn't have an easy way to do full text search across CI test runs across different TC Build Configurations and branches. To do that, you would have to download the artifacts for what you wanted to search for.
This change adds a datadog package that uploads test.log files to Datadog during test cleanup on master and release branches. The implementation scans the log file serially and uses a worker pool to upload log entries in batches of 1000 using the Datadog API client. Each log entry is tagged with test metadata (test name, owner, cloud, platform, version) and includes attributes for higher cardinality fields (cluster name, build number, result, duration). See comments for more details.
The entry point for datadog upload in
roachtestwill be during the post step test intest_runner.go.Added a new roachtest flagdatadog-always-uploadfor e2e testing on a non release branch. Modified TC build scripts to pass new env vars and the teamcity build properties file.