Skip to content

Comments

Parallelize microbenchmarks and run them more times#5313

Open
igoragoli wants to merge 17 commits intomasterfrom
augusto/2544.flakiness
Open

Parallelize microbenchmarks and run them more times#5313
igoragoli wants to merge 17 commits intomasterfrom
augusto/2544.flakiness

Conversation

@igoragoli
Copy link
Contributor

@igoragoli igoragoli commented Feb 4, 2026

⚠️ The PR description was modified to describe new changes.

  • TODO: Resolve all TODOs before merging to use production branches/images instead of testing ones.

What does this PR do?

  • Runs benchmarks 6 times (configurable by REPETITIONS on benchmarks/execution.yml) to reduce inter-run variability.
    • I had initially tested with 10 repetitions. That would make the CI job go from 30 minutes (current state, without repetitions) to 50 minutes. I brought it down to 6 repetitions, so the CI job duration is kept as 30 minutes.
    • If flakiness is still high, we can bump the number of repetitions or configure the threshold on the analysis step over at benchmarking-platform at the dd-trace-rb branch.
  • To prevent extremely long CI jobs, runs benchmarks in parallel with CPU isolation.
    • Since we have 24 available CPUs, the 13 benchmarks were split into two arbitrary groups. Each group is defined on benchmarks/execution.yml
    • Every benchmark is run with 2 CPUs (configurable by CPUS_PER_BENCHMARK on the benchmarks/execution.yml job).

Motivation:

https://datadoghq.atlassian.net/browse/APMSP-2544

Change log entry

None.

Additional Notes:

How to test the change?

Execution and reporting

Reducing flakiness
The effect of multiple repetitions and CPU isolation on result variability was tested and reported in this document: https://datadoghq.atlassian.net/wiki/x/egJ3cAE

25 out of ~45 scenarios were flaky before fixes, 0 are flaky after fixes.

These tests used 10 repetitions. While this PR introduces 6 repetitions, it should already bring the flakiness down.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Thank you for updating Change log entry section 👏

Visited at: 2026-02-20 09:48:24 UTC

@datadog-official
Copy link

datadog-official bot commented Feb 4, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 95.11% (+0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 091684c | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@igoragoli igoragoli changed the title [DO NOT MERGE] test: mitigations for benchmark stability Reduce microbenchmark flakiness Feb 6, 2026
@igoragoli igoragoli marked this pull request as ready for review February 6, 2026 10:51
@igoragoli igoragoli requested a review from a team as a code owner February 6, 2026 10:51
@pr-commenter
Copy link

pr-commenter bot commented Feb 6, 2026

Benchmarks

Benchmark execution time: 2026-02-23 08:27:15

Comparing candidate commit 091684c in PR branch augusto/2544.flakiness with baseline commit 1f516ae in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 45 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:profiling - intern mixed existing and new

  • 🟥 throughput [-0.627op/s; -0.399op/s] or [-2.074%; -1.321%]

Comment on lines 172 to 174
matrix:
- BENCHMARKS: "profiling_allocation.rb profiling_gc.rb profiling_hold_resume_interruptions.rb profiling_http_transport.rb profiling_memory_sample_serialize.rb profiling_sample_loop_v2.rb profiling_sample_serialize.rb profiling_sample_gvl.rb profiling_string_storage_intern.rb"
- BENCHMARKS: "error_tracking_simple.rb tracing_trace.rb di_instrument.rb library_gem_loading.rb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep this in the benchmarks folder?

In particular, it seems very easy to miss that any new benchmarks must always be added here, whereas before it was clear "there's a list of benchmarks in the script, and the script is in the same folder as the benchmarks, and the README documents this".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Ivo! Thanks for the review.

Some alternatives:

  1. Defining those lists of benchmarks on something like benchmarks/execution.yml and importing them on benchmarks.yml.
  2. Modifying the README to say that you must add a new benchmark to the .gitlab-ci.yml, instead of saying you must add a new benchmark to run_all.sh

I prefer alternative 2, since adding new benchmarks with the introduced fixes requires knowing that you should split benchmarks across different jobs to make sure there are enough CPUs for all of them.

I added some changes for alternative 2. I'm happy to revert them if you think other alternatives would be better.

Alternative 1's benchmarks/execution.yml could be something like this:

  .groups:
    - &profiling >-
        profiling_allocation.rb
        profiling_gc.rb
        profiling_hold_resume_interruptions.rb
        profiling_http_transport.rb
        profiling_memory_sample_serialize.rb
        profiling_sample_loop_v2.rb
        profiling_sample_serialize.rb
        profiling_sample_gvl.rb
        profiling_string_storage_intern.rb

    - &other >-
        error_tracking_simple.rb
        tracing_trace.rb
        di_instrument.rb
        library_gem_loading.rb

  .execution:
    variables:
      REPETITIONS: "10"
      CPU_AFFINITY: "24-47"
      CPUS_PER_BENCHMARK: "2"
    parallel:
      matrix:
        - BENCHMARKS:
            - *profiling
            - *other

And then on benchmarks.yml:

  include:
    - local: 'benchmarks/execution.yml'

  microbenchmarks:
    extends: .execution
    script:
      - ./benchmarks/run_all.sh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like an alternative suggestion, but then we need to signal in some docs about how we do it 👍🏼

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure - alternative 1, right? @Strech

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda like benchmarks/execution.yml, but yeah I think if it's clear in the README what to update option 2 is probably fine as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the benchmarks/execution.yml. Turned out to be neater. 406f32f

@ivoanjo
Copy link
Member

ivoanjo commented Feb 6, 2026

I think the PR seems in good shape? My only question is -- I see the new jobs coming up in GitLab -- is there a way to check that the way we're exposing the files is still correct?

E.g. how do we test that we haven't broken our benchmark reporting code?

for file in "${benchmark_array[@]}"; do
local cpus
cpus=$(get_cpus_for_benchmark "$cpu_ids_str" "$idx" "$cpus_per_benchmark")
taskset -c "$cpus" bundle exec ruby "$SCRIPT_DIR/$file" &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing taskset does not seem to actually work, was this manually tested to function as expected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if any of these executions fail, how does the process exit status get tracked and reported? Doesn't look like this is done at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing taskset does not seem to actually work, was this manually tested to function as expected?

Yes. I tested it with some dummy CI jobs.

And if any of these executions fail, how does the process exit status get tracked and reported? Doesn't look like this is done at all?

You're right, this is not done at all! If a benchmark fails, we won't know. I'm working on this tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed. The test checkboxes on the PR description covers this and link to results.

@igoragoli
Copy link
Contributor Author

Hey @ivoanjo and @p-datadog, thank you for the reviews!

I'll answer Oleg on the conversation thread. To answer Ivo:

E.g. how do we test that we haven't broken our benchmark reporting code?

Great point, and while reports on the BP UI are working as expected, PR comments from one microbenchmarking job will overwrite the other. Results have to be combined somehow.

@igoragoli igoragoli force-pushed the augusto/2544.flakiness branch from 4fe6f54 to c7a672e Compare February 19, 2026 13:11
@igoragoli igoragoli changed the title Reduce microbenchmark flakiness Parallelize microbenchmarks and run them more times Feb 20, 2026
@igoragoli
Copy link
Contributor Author

Hi! For visibility, I re-requested reviews from all that have reviewed, since you have pointed out towards different fixes.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

We're in the middle of a release so merging to master is blocked, it should be unblocked later today/by Monday.

At this point I don't see any reason to not to give this a try and then if there's some extra adjustment needed we'll do at as a follow-up PR.

Comment on lines 36 to +38
# Reinstall a recent version of the trace to help Docker cache dependencies.
# Bump this version periodically.
RUN gem install datadog -v 1.20.0
RUN gem install datadog -v 2.28.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol this was broken before and nobody noticed! There's no datadog 1.20.0, we renamed the library. Thanks for catching it.

@igoragoli
Copy link
Contributor Author

#5313 (comment) is awesome. We're comparing benchmarks from this branch against master. Since there were no code changes that should impact performance, we would indeed expect to see 0 improvements/regression.

Corroborates with Ruby microbenchmark stability experiments showing that 10 reps took out all flakiness. 6 reps seems to be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants