Skip to content

Conversation

@williamchoe3
Copy link
Contributor

@williamchoe3 williamchoe3 commented Dec 1, 2025

Previously, we didn't have an easy way to do full text search across CI test runs across different TC Build Configurations and branches. To do that, you would have to download the artifacts for what you wanted to search for.

This change adds a datadog package that uploads test.log files to Datadog during test cleanup on master and release branches. The implementation scans the log file serially and uses a worker pool to upload log entries in batches of 1000 using the Datadog API client. Each log entry is tagged with test metadata (test name, owner, cloud, platform, version) and includes attributes for higher cardinality fields (cluster name, build number, result, duration). See comments for more details.

The entry point for datadog upload in roachtest will be during the post step test in test_runner.go. Added a new roachtest flag datadog-always-upload for e2e testing on a non release branch. Modified TC build scripts to pass new env vars and the teamcity build properties file.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@github-actions

This comment was marked as outdated.

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Dec 1, 2025
@williamchoe3 williamchoe3 added the O-AI-Review-Real-Issue-Found AI reviewer found real issue label Dec 2, 2025
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch from 690dece to 890f148 Compare December 4, 2025 16:43
@github-actions
Copy link

github-actions bot commented Dec 4, 2025

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@williamchoe3
Copy link
Contributor Author

williamchoe3 commented Dec 4, 2025

Details

Current approach is one shot. The log parsing is done serially while the uploading is handled by 10 goroutines. Uploading all of an entire nightly run's test.log's serially took ~10 minutes. Given roachtest's parallelism, the actual impact should be a fraction of that 10 minutes.

Log Event Data Modeling

Best practice is to keep log event tags limited to low to medium high cardinality fields for search performance. Higher cardinality fields will be assigned as log attributes. More information: https://docs.datadoghq.com/getting_started/tagging/

All associated log attribute and tag information is stored in the following struct

// LogMetadata contains the test metadata that will be associated with each log
// entry in Datadog. Fields in LogMetadata will be passed along as either tags
// or log entry attributes.
// Typically, low-cardinality fields are added as tags,
// high-cardinality fields are added as attributes.
type LogMetadata struct {
	TestName        string
	Result          string // PASS or FAIL
	Duration        string // Duration of the test in seconds
	Owner           string
	Cloud           string
	Platform        string // e.g., linux-amd64, linux-arm64
	Version         string // branch name e.g., master, release-25.1
	Cluster         string // roachprod cluster e.g. teamcity-20801641-1764720129-01-n4cpu4
	TCHost          string // Teamcity Agent Name e.g., gce-agent-nightlies-roachtest-20240520-no-preempt-43
	TCBuildConfName string // TeamCity Build Configuration name e.g. Roachtest Nightly - GCE (Bazel)
	TCBuildNumber   string // TeamCity Build Configuration execution instance
	LogName         string // e.g. test.log
	Tags            map[string]string
}

Datadog API Key

Previously, the API Key is being passed in as a CLI Flag. The below build configuration template will pass the key along as an env var so we can remove the CLI flag from roachtest which seems like the more canonical approach to me. This is one of the new env vars being passed to the Docker container.
https://teamcity.cockroachdb.com/admin/editBuildParams.html?id=template:Cockroach_Nightlies_RoachtestNightlyDatadogTemplate#

Current Limitations & Considerations

Currently mixedversion logs uses a logger that prefixes all messages. My current regex doesn't account for this. The regex could be expanded to support these prefixes, or buy having logs being sent by roachprod/logger itself, more information below.

[mixed-version-test/16_run-test-features] 2025/11/24 07:19:04 versionupgrade.go:145: \"ObjectAccess\": OK
[mixed-version-test/16_run-test-features] 2025/11/24 07:19:04 versionupgrade.go:138: running feature test \"JSONB\"
[mixed-version-test/16_run-test-features] 2025/11/24 07:19:04 helper.go:420: running SQL

Any multiline comments i.e. logSQL() https://github.com/cockroachdb/cockroach/blob/6708f9d2585c40fb5c20836dae85cb87a448031d/pkg/cmd/roachtest/roachtestutil/mixedversion/helper.go

Node:      4 (v25.3.5)
Tenant:    system
Statement: SET CLUSTER SETTING cluster.preserve_downgrade_option = $1
Arguments: [25.3]

Next Steps

Next Step is to extend this to other log files and roachprod by modifying or extending the roachprod logger package.

Introduce capturing log batches in github.com/cockroachdb/cockroach/pkg/roachprod/logger directly. At a high level this would keep a copy of log messages up to a certain batch_size in memory and then upload to datadog when that batch size is hit.

  • Can add a tag to differentiate log file sources i.e. test.log vs test-teardown.log
    Multiline logs would be able to be captured as a single log entry with a new method in logger that would create a single log entry object for datadog, but still writes multiple lines in the log file for readability
    Look into agents. I don't think this is the right approach for us, especially since we need to tag and filter and not sure how that would work but I'll look into it. For a continuously running service like roachprod centralized, this sounds better.

With these limitations and considerations in mind, I'd still like to start ingesting logs to see on a broader sense if this is helpful for triage or for looking at trends.

@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch 2 times, most recently from 366f3f1 to a309f35 Compare December 9, 2025 19:23
@blathers-crl
Copy link

blathers-crl bot commented Dec 9, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@williamchoe3
Copy link
Contributor Author

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/20837415?hideTestsFromDependencies=false&hideProblemsFromDependencies=false&expandBuildDeploymentsSection=false
From new datadog.log

datadog: 2025/12/10 01:32:33 datadog.go:373: parsed log file /artifacts/acceptance/build-analyze/run_1/test.log, skipped 0 lines
datadog: 2025/12/10 01:32:33 datadog.go:307: uploaded batch of 15 entries (total so far: 15)
datadog: 2025/12/10 01:32:33 datadog.go:393: successfully uploaded 15 log entries from /artifacts/acceptance/build-analyze/run_1/test.log
datadog: 2025/12/10 01:32:33 datadog.go:394: failed to upload 0 log entries from /artifacts/acceptance/build-analyze/run_1/test.log

Make sure to adjust the time picker, it defaults to 15 minutes
https://us5.datadoghq.com/logs?query=service%3Aroachtest&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice&messageDisplay=inline&refresh_mode=sliding&storage=flex_tier&stream_sort=desc&viz=stream&from_ts=1765320196540&to_ts=1765406596540&live=true

@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch from 5fee6e7 to bce23c1 Compare December 10, 2025 22:58
@williamchoe3 williamchoe3 changed the title roachtest: datadog integration for test.log roachtest: run ingests test log to datadog Dec 10, 2025
@williamchoe3 williamchoe3 changed the title roachtest: run ingests test log to datadog roachtest: add Datadog integration for test log ingestion Dec 10, 2025
@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch from bce23c1 to 06e1a1f Compare December 10, 2025 23:14
@williamchoe3
Copy link
Contributor Author

williamchoe3 commented Dec 10, 2025

image

I unfortunately ripped out the datadog API flag logic i had previously before noticing the incorrect attribuets and it doesn't look like my TC Build Configuration DD_API_KEY changes are taking effect, will debug the duration and missing cluster.os after DD_API_KEY starts getting passed. The build_configuration was using the wrong prefix, will confirm that as well

@williamchoe3 williamchoe3 marked this pull request as ready for review December 10, 2025 23:16
@williamchoe3 williamchoe3 requested review from a team as code owners December 10, 2025 23:16
@williamchoe3 williamchoe3 requested review from herkolategan and shailendra-patel and removed request for a team December 10, 2025 23:16
@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch 3 times, most recently from 7e4b616 to 7e72cbc Compare December 11, 2025 16:12
@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch from 7e72cbc to 379fb6d Compare December 18, 2025 21:35
Copy link
Contributor

@golgeek golgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Few comments here and there.

if tcBuildPropertyFile != "" {
file, err := os.Open(tcBuildPropertyFile)
if err != nil {
l.Printf("failed to open teamcity build properties file: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected to keep processing if you cannot open the file?
If so, I'd add a comment mentioning that reading the file to extract tags from it is optional or something.

defer resp.Body.Close()
// Datadog returns 202 Accepted on success
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return errors.Newf("unexpected status code %d from Datadog API", resp.StatusCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider a retry mechanism for temporary errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datadog api client gives us retry + exponential backoff which is why I opted to use it instead of hitting the endpoint directly

const numWorkers = 10
const batchSize = 1000 // Datadog max batch size

g := ctxgroup.WithContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I followed the code path correctly, the context is derived (multiple times) from the test_runner context, which means that if it gets cancelled in the test runner, it will get cancelled here.

I'm wondering if this is what we actually want or if we shouldn't keep uploading logs even if the context gets cancelled by the parent. Maybe we should have a dedicated timeout here to ensure we're not blocking forever?

I don't have the full picture, so maybe you thought about it and it's OK.

Copy link
Contributor Author

@williamchoe3 williamchoe3 Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it gets cancelled in the test runner, it will get cancelled here

Did not think about that. We're doing this in the post test step, so not sure what scenario would cause the context to be cancelled (besides a TC timeout perhaps?), but going to assume there's some scenario in which the parent context gets cancelled. I think I'm ok with everything getting cancelled if the test runner sends the cancel

Maybe we should have a dedicated timeout here to ensure we're not blocking forever?

I'll add this, thanks for the catch. Added to MaybeUploadTestLog which is this functions caller

// Start worker goroutines to upload batches concurrently
for i := 0; i < numWorkers; i++ {
g.GoCtx(func(ctx context.Context) error {
for batch := range batches {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, since you don't check for context cancellation here, you will iterate over all batches whether the context is cancelled/timeouts or not.

You could switch to this if you want to honor:

switch {
 case batch, ok := <-batches:
    if !ok { return nil }
    ...
  case <-ctx.Done():
    ctx.Err(), depending on desired behavior
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the catch, added, will run a small smoke

@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch from 0e7b826 to de7cb80 Compare December 19, 2025 21:32
@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch from de7cb80 to b56f8b5 Compare January 5, 2026 16:45
@williamchoe3
Copy link
Contributor Author

williamchoe3 commented Jan 5, 2026

Ripped out unnecessary "test.log" variable I initially added to minimize potential backport conflicts, would be nice to have a single place with all the consts used by roachtest, but can do that in a separate pr

@williamchoe3 williamchoe3 added the backport-all Flags PRs that need to be backported to all supported release branches label Jan 5, 2026
@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch 2 times, most recently from c197bba to 774f739 Compare January 5, 2026 20:32
Previously, test.log artifacts were only available on TeamCity's UI.

This change adds a datadog package that uploads test.log files to Datadog
during test cleanup on master and release branches. The implementation scans
the log file serially and uses a worker pool to upload log entries in batches
of 1000 using the Datadog API client. Each log entry is tagged with test
metadata (test name, owner, cloud, platform, version) and includes attributes
for higher cardinality fields (cluster name, build number, result, duration).

This enables full-text search across roachtest runs and improves observability
for test triage and failure analysis.

Epic: None
Release note: None
@williamchoe3 williamchoe3 force-pushed the wchoe/roachtest-test-log-datadog branch from 774f739 to 562cdfa Compare January 6, 2026 15:58
@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@williamchoe3
Copy link
Contributor Author

Also, I'd argue that the Claude bot's comment is overly defensive but not incorrect, not going to make the suggested change

@williamchoe3
Copy link
Contributor Author

tftr :)
bors r=golgeek

craig bot pushed a commit that referenced this pull request Jan 6, 2026
158528: roachtest: add Datadog integration for test log ingestion r=golgeek a=williamchoe3

Previously, we didn't have an easy way to do full text search across CI test runs across different TC Build Configurations and branches. To do that, you would have to download the artifacts for what you wanted to search for.

This change adds a datadog package that uploads test.log files to Datadog during test cleanup on master and release branches. The implementation scans the log file serially and uses a worker pool to upload log entries in batches of 1000 using the Datadog API client. Each log entry is tagged with test metadata (test name, owner, cloud, platform, version) and includes attributes for higher cardinality fields (cluster name, build number, result, duration). See comments for more details.

The entry point for datadog upload in `roachtest` will be during the post step test in `test_runner.go.`  Added a new roachtest flag `datadog-always-upload` for e2e testing on a non release branch.  Modified TC build scripts to pass new env vars and the teamcity build properties file.

160526: sql: fix TestUnsplitRanges to work with external test tenants r=rafiss a=rafiss

Previously, TestUnsplitRanges was skipped in external test tenant mode because it scanned meta ranges directly and performed AdminSplit/ AdminUnsplit operations that external tenants cannot do.

This commit fixes the test by:
1. Using the system layer's DB for meta range operations (scanning meta ranges, checking sticky bits, splitting ranges) since tenants cannot access meta ranges directly.
2. Using the application layer's DB for table data operations which tenants can access.
3. Granting the CanAdminUnsplit capability to the external tenant so the GC job can unsplit ranges after dropping tables/indexes.

Resolves: #142388
Epic: CRDB-48944

Release note: None

160549: build: remove unused publish script for no-telemetry release r=celiala a=rail

Remove the unused script, because we no longer use it.

Epic: none
Release note: none

Co-authored-by: William Choe <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 6, 2026

Build failed (retrying...):

@craig craig bot merged commit 4785105 into cockroachdb:master Jan 6, 2026
25 of 26 checks passed
@craig
Copy link
Contributor

craig bot commented Jan 6, 2026

@blathers-crl
Copy link

blathers-crl bot commented Jan 6, 2026

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 562cdfa to blathers/backport-release-24.3-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-24.3 failed. See errors above.


error creating merge commit from 562cdfa to blathers/backport-release-25.2-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.2 failed. See errors above.


error creating merge commit from 562cdfa to blathers/backport-release-25.3-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.3 failed. See errors above.


error creating merge commit from 562cdfa to blathers/backport-release-25.4-158528: POST https://api.github.com/repos/williamchoe3/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch release-25.4 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-all Flags PRs that need to be backported to all supported release branches backport-failed o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. O-AI-Review-Real-Issue-Found AI reviewer found real issue target-release-26.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants