Parallelize microbenchmarks and run them more times by igoragoli · Pull Request #5313 · DataDog/dd-trace-rb

igoragoli · 2026-02-04T16:27:54Z

⚠️ The PR description was modified to describe new changes.

TODO: Resolve all TODOs before merging to use production branches/images instead of testing ones.

What does this PR do?

Runs benchmarks 6 times (configurable by REPETITIONS on benchmarks/execution.yml) to reduce inter-run variability.
- I had initially tested with 10 repetitions. That would make the CI job go from 30 minutes (current state, without repetitions) to 50 minutes. I brought it down to 6 repetitions, so the CI job duration is kept as 30 minutes.
- If flakiness is still high, we can bump the number of repetitions or configure the threshold on the analysis step over at benchmarking-platform at the dd-trace-rb branch.
To prevent extremely long CI jobs, runs benchmarks in parallel with CPU isolation.
- Since we have 24 available CPUs, the 13 benchmarks were split into two arbitrary groups. Each group is defined on benchmarks/execution.yml
- Every benchmark is run with 2 CPUs (configurable by CPUS_PER_BENCHMARK on the benchmarks/execution.yml job).

Motivation:

https://datadoghq.atlassian.net/browse/APMSP-2544

Change log entry

None.

Additional Notes:

How to test the change?

Execution and reporting

CI pipeline runs successfully with parallel benchmark execution
- "other" group and "profiling" group.
Artifacts collected correctly for all benchmarks (including di_instrument.rb with legacy naming fallback)
- Artifacts on the microbenchmarks-pr-comment job, which has access to all artifacts for candidates and baselines on all repetitions
Results on the BP UI
- Results for "profiling - Allocations" clearly showing six different runs. This means runs are being combined correctly.
Results posted as PR comment

Reducing flakiness
The effect of multiple repetitions and CPU isolation on result variability was tested and reported in this document: https://datadoghq.atlassian.net/wiki/x/egJ3cAE

25 out of ~45 scenarios were flaky before fixes, 0 are flaky after fixes.

These tests used 10 repetitions. While this PR introduces 6 repetitions, it should already bring the flakiness down.

github-actions · 2026-02-04T16:28:11Z

Thank you for updating Change log entry section 👏

^{Visited at: 2026-02-20 09:48:24 UTC}

datadog-official · 2026-02-04T17:19:03Z

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 95.11% (+0.00%)

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 091684c | Docs | Datadog PR Page | Was this helpful? Give us feedback!}

pr-commenter · 2026-02-06T11:11:43Z

Benchmarks

Benchmark execution time: 2026-02-23 08:27:15

Comparing candidate commit 091684c in PR branch augusto/2544.flakiness with baseline commit 1f516ae in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 45 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:profiling - intern mixed existing and new

🟥 throughput [-0.627op/s; -0.399op/s] or [-2.074%; -1.321%]

ivoanjo · 2026-02-06T12:07:17Z

.gitlab/benchmarks.yml

+    matrix:
+      - BENCHMARKS: "profiling_allocation.rb profiling_gc.rb profiling_hold_resume_interruptions.rb profiling_http_transport.rb profiling_memory_sample_serialize.rb profiling_sample_loop_v2.rb profiling_sample_serialize.rb profiling_sample_gvl.rb profiling_string_storage_intern.rb"
+      - BENCHMARKS: "error_tracking_simple.rb tracing_trace.rb di_instrument.rb library_gem_loading.rb"


Could we keep this in the benchmarks folder?

In particular, it seems very easy to miss that any new benchmarks must always be added here, whereas before it was clear "there's a list of benchmarks in the script, and the script is in the same folder as the benchmarks, and the README documents this".

Hey Ivo! Thanks for the review.

Some alternatives:

Defining those lists of benchmarks on something like benchmarks/execution.yml and importing them on benchmarks.yml.

Modifying the README to say that you must add a new benchmark to the .gitlab-ci.yml, instead of saying you must add a new benchmark to run_all.sh

I prefer alternative 2, since adding new benchmarks with the introduced fixes requires knowing that you should split benchmarks across different jobs to make sure there are enough CPUs for all of them.

I added some changes for alternative 2. I'm happy to revert them if you think other alternatives would be better.

Alternative 1's benchmarks/execution.yml could be something like this:

.groups: - &profiling >- profiling_allocation.rb profiling_gc.rb profiling_hold_resume_interruptions.rb profiling_http_transport.rb profiling_memory_sample_serialize.rb profiling_sample_loop_v2.rb profiling_sample_serialize.rb profiling_sample_gvl.rb profiling_string_storage_intern.rb - &other >- error_tracking_simple.rb tracing_trace.rb di_instrument.rb library_gem_loading.rb .execution: variables: REPETITIONS: "10" CPU_AFFINITY: "24-47" CPUS_PER_BENCHMARK: "2" parallel: matrix: - BENCHMARKS: - *profiling - *other

And then on benchmarks.yml:

include: - local: 'benchmarks/execution.yml' microbenchmarks: extends: .execution script: - ./benchmarks/run_all.sh

I like an alternative suggestion, but then we need to signal in some docs about how we do it 👍🏼

Just to be sure - alternative 1, right? @Strech

I kinda like benchmarks/execution.yml, but yeah I think if it's clear in the README what to update option 2 is probably fine as well

I kept the benchmarks/execution.yml. Turned out to be neater. 406f32f

ivoanjo · 2026-02-06T15:16:21Z

I think the PR seems in good shape? My only question is -- I see the new jobs coming up in GitLab -- is there a way to check that the way we're exposing the files is still correct?

E.g. how do we test that we haven't broken our benchmark reporting code?

p-datadog · 2026-02-09T14:19:51Z

benchmarks/run_all.sh

+  for file in "${benchmark_array[@]}"; do
+    local cpus
+    cpus=$(get_cpus_for_benchmark "$cpu_ids_str" "$idx" "$cpus_per_benchmark")
+    taskset -c "$cpus" bundle exec ruby "$SCRIPT_DIR/$file" &


In my testing taskset does not seem to actually work, was this manually tested to function as expected?

And if any of these executions fail, how does the process exit status get tracked and reported? Doesn't look like this is done at all?

In my testing taskset does not seem to actually work, was this manually tested to function as expected?

Yes. I tested it with some dummy CI jobs.

And if any of these executions fail, how does the process exit status get tracked and reported? Doesn't look like this is done at all?

You're right, this is not done at all! If a benchmark fails, we won't know. I'm working on this tracking.

This was fixed. The test checkboxes on the PR description covers this and link to results.

igoragoli · 2026-02-09T14:24:56Z

Hey @ivoanjo and @p-datadog, thank you for the reviews!

I'll answer Oleg on the conversation thread. To answer Ivo:

E.g. how do we test that we haven't broken our benchmark reporting code?

Great point, and while reports on the BP UI are working as expected, PR comments from one microbenchmarking job will overwrite the other. Results have to be combined somehow.

Like on other benchmarks

…rallel

igoragoli · 2026-02-20T10:38:21Z

Hi! For visibility, I re-requested reviews from all that have reviewed, since you have pointed out towards different fixes.

ivoanjo

👍 LGTM

We're in the middle of a release so merging to master is blocked, it should be unblocked later today/by Monday.

At this point I don't see any reason to not to give this a try and then if there's some extra adjustment needed we'll do at as a follow-up PR.

ivoanjo · 2026-02-20T12:05:56Z

benchmarks/Dockerfile

 # Reinstall a recent version of the trace to help Docker cache dependencies.
 # Bump this version periodically.
-RUN gem install datadog -v 1.20.0
+RUN gem install datadog -v 2.28.0


Lol this was broken before and nobody noticed! There's no datadog 1.20.0, we renamed the library. Thanks for catching it.

igoragoli · 2026-02-23T06:57:16Z

#5313 (comment) is awesome. We're comparing benchmarks from this branch against master. Since there were no code changes that should impact performance, we would indeed expect to see 0 improvements/regression.

Corroborates with Ruby microbenchmark stability experiments showing that 10 reps took out all flakiness. 6 reps seems to be sufficient.

igoragoli changed the title ~~[DO NOT MERGE] test: mitigations for benchmark stability~~ Reduce microbenchmark flakiness Feb 6, 2026

igoragoli marked this pull request as ready for review February 6, 2026 10:51

igoragoli requested a review from a team as a code owner February 6, 2026 10:51

ivoanjo reviewed Feb 6, 2026

View reviewed changes

p-datadog reviewed Feb 9, 2026

View reviewed changes

igoragoli added 8 commits February 19, 2026 13:34

chore: save di_instrument bench results based on their filename

af77fbd

Like on other benchmarks

chore: bump datadog version on benchmarks/Dockerfile

f857a84

feat: update microbenchmarks jobs for executing microbenchmarks in pa…

406f32f

…rallel

feat: add separate microbench PR comment job

fb713db

docs: add a helpful header

e3a5c36

test: point to test image

79db6c8

docs: update execution.yml

37d9b3f

docs: update benchmarks/README.md

c7a672e

igoragoli force-pushed the augusto/2544.flakiness branch from 4fe6f54 to c7a672e Compare February 19, 2026 13:11

igoragoli added 2 commits February 19, 2026 17:30

tweak: repeat benchmarks 5 times

4b58ebc

tweak: remove ".rb" from benchmark results files

d8459b9

igoragoli changed the title ~~Reduce microbenchmark flakiness~~ Parallelize microbenchmarks and run them more times Feb 20, 2026

igoragoli added 5 commits February 20, 2026 11:18

refactor: fix indentation on execution.yml

020a755

chore: point back to production image

d10914a

tweak: remove run_all.sh, since it's not needed anymore

e940d67

chore: point back to production branch on benchmarking-platform

4e306b6

Merge branch 'master' into augusto/2544.flakiness

a75635b

igoragoli requested review from Strech, ivoanjo and p-datadog February 20, 2026 10:36

ivoanjo approved these changes Feb 20, 2026

View reviewed changes

test: trigger ci

acf1cc9

Merge branch 'master' into augusto/2544.flakiness

091684c

Comments

Conversation

igoragoli commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-official bot commented Feb 4, 2026 • edited by datadog-datadog-prod-us1-2 bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Explanation

More details about the CI and significant changes

scenario:profiling - intern mixed existing and new

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivoanjo commented Feb 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igoragoli commented Feb 9, 2026

Uh oh!

igoragoli commented Feb 20, 2026

Uh oh!

ivoanjo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

igoragoli commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

igoragoli commented Feb 4, 2026 •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading

datadog-official bot commented Feb 4, 2026 •

edited by datadog-datadog-prod-us1-2 bot

Loading

pr-commenter bot commented Feb 6, 2026 •

edited

Loading