Fix nightly CI regressions, flaky tests, and Windows ccache from #14478 by xingbowang · Pull Request #14508 · facebook/rocksdb

xingbowang · 2026-03-25T13:39:57Z

Summary

Fix build failures, flaky tests, and Windows ccache issues exposed by PR #14478.

1. Release build (NDEBUG) compile error

Job: build-linux-release-with-folly

The SyncPoint cleanup listener in testharness.cc referenced SyncPoint::GetInstance() unconditionally, but SyncPoint is only declared in debug builds. Wrapped with #ifndef NDEBUG.

2. Folly-lite link error

Job: build-linux-cmake-with-folly-lite

The USE_FOLLY_LITE cmake path unconditionally linked -lglog, which fails with lld (added in #14478) when glog isn't installed. Use find_library() to link only when available.

3. Flaky tests — leaked `Env::Default()` thread pool state

Sharded execution runs multiple tests per process. Several tests in db_compaction_test.cc modified the global Env::Default() BOTTOM thread pool without resetting. With a leaked BOTTOM pool, subsequent tests' last-level compactions get forwarded to the bottom pool, freeing the LOW thread to pick additional compactions unexpectedly.

Fix: add TearDown() override to DBCompactionTest that captures default thread pool sizes in the constructor and restores them after every test. This is more robust than per-test cleanup because:

It runs even when a test fails (gtest calls TearDown after assertion failures)
It catches all current and future leakers without per-test maintenance

4. Windows ccache — 0.26% hit rate → should be ~99%

The Windows nightly build takes 57 minutes because ccache has near-zero hit rate. Two issues:

Cache key: The hendrikmuhs/ccache-action used a timestamp-based key, so each nightly run created a unique key and never found the previous run's cache (No cache found.). Fixed by using a stable key ccache-windows-<workflow> with prefix-based restore.
Compiler check: Missing compiler_check=content setting, so MSVC path/version changes between runners invalidated all cache entries. Added compiler_check=content (same fix as macOS in Accelerate build and test infrastructure #14478).

Test Plan

Release and debug builds compile cleanly
70 consecutive shuffled runs of db_compaction_test (367 tests each) pass — 50 on full cores + 20 on 4 cores
Format check passes
Windows ccache fix requires CI run to verify (first run populates cache, second run should see high hit rate)

github-actions · 2026-03-25T13:41:49Z

✅ clang-tidy: No findings on changed lines

Completed in 273.8s.

github-actions · 2026-03-25T14:55:16Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit ae9db09

Confirmed: ReleaseCompactionDuringManifestWrite sets LOW=3 and BOTTOM=3 but never resets them. The TearDown fix catches this.

Now I have all the context needed. Here's the review:

Summary

A well-targeted fix addressing four distinct CI regressions. The changes are correct, low-risk, and well-scoped. One suggestion regarding the conditional glog linking deserves attention.

Issues Found

🟡 Suggestion: Conditional glog linking may mask real link failures

File: CMakeLists.txt:1140-1143
Perspective: Correctness & Cross-Component

The find_library(GLOG_LIBRARY glog) approach silently skips glog if not found. However, the folly-lite sources (especially SafeAssert.cpp, Exception.cpp) may reference glog symbols transitively. On the build-linux-cmake-with-folly-lite CI job, glog is passed via -DCMAKE_CXX_FLAGS=-DGLOG_USE_GLOG_EXPORT, suggesting it IS expected to be present. If glog is genuinely absent in the lld environment, this fix works. But if a future environment has glog headers but not the library, you'd get confusing compile-succeeds-link-fails errors from folly internals.

Suggested improvement: Add a status message so failures are diagnosable:

find_library(GLOG_LIBRARY glog)
if(GLOG_LIBRARY)
  list(APPEND THIRDPARTY_LIBS glog)
else()
  message(STATUS "glog not found — USE_FOLLY_LITE will build without glog")
endif()

🟡 Suggestion: TearDown resets LOW and HIGH but tests may also modify them

File: db/db_compaction_test.cc:125-133
Perspective: Correctness & Edge Cases

The TearDown correctly resets BOTTOM=0, LOW=1, HIGH=1 to match the values DBTestBase sets in its constructor (db/db_test_util.cc:87-88). However, at least one test (ReleaseCompactionDuringManifestWrite, line 11368) sets LOW=3 without cleanup. The TearDown catches this — good.

One subtlety: DBTestBase sets threads on env_ (the SpecialEnv wrapper), while this TearDown sets them on Env::Default() directly. Since SpecialEnv inherits from EnvWrapper and delegates SetBackgroundThreads to the target (include/rocksdb/env.h:1754-1755), these are equivalent when env_ wraps Env::Default(). This is correct but could be fragile if a test ever uses a non-default base env. Consider adding a comment noting this equivalence.

🟢 Nitpick: `#ifndef NDEBUG` guard also excludes the include

File: test_util/testharness.cc:16-18
Perspective: Code Quality

Wrapping #include "test_util/sync_point.h" inside #ifndef NDEBUG is functional but unconventional — the header itself already handles NDEBUG internally by making macros no-ops. The include is guarded here because the code references SyncPoint::GetInstance() directly (not via macros), and the class definition doesn't exist in release builds (test_util/sync_point.h:55-60). This is correct and necessary.

Cross-Component Analysis

Context	Assessment
Release builds (NDEBUG)	Fixed — SyncPoint guard correctly prevents compile error
CMake with lld	Fixed — conditional glog avoids link failure when absent
Sharded test execution	Fixed — TearDown prevents thread pool state leakage across tests
Windows CI caching	Fixed — stable key + content-based compiler check should dramatically improve hit rate
Make-based builds	Unaffected — `folly.mk` still unconditionally links glog, which is fine since `make` uses `ld` not `lld`
Non-DBCompactionTest suites	Unaffected — TearDown is scoped to `DBCompactionTest` only

Positive Observations

The TearDown approach is strictly better than per-test cleanup: it's defensive against both current and future test leakers, and it runs even after assertion failures.
The PR description is exceptionally thorough with clear root-cause analysis for each fix.
The 70-run shuffled validation (50 full + 20 on 4 cores) is strong evidence the flaky test fix is robust.
The ccache fix (stable key + compiler_check=content) follows the pattern already established for macOS in the setup-ccache action, maintaining cross-platform consistency.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

Fix two build failures introduced by the accelerate-build PR (facebook#14478): 1. **Release build (NDEBUG) compile error**: The SyncPoint cleanup listener in testharness.cc referenced SyncPoint::GetInstance() which is only declared in debug builds. Guard the listener with #ifndef NDEBUG since it's only needed for debug test sharding anyway. 2. **Folly-lite link error**: The USE_FOLLY_LITE cmake path unconditionally linked -lglog, but glog may not be installed as a system library. Use find_library() to only link glog when available. Co-authored-by: Navi <navi@navibot.dev>

meta-codesync · 2026-03-26T22:53:53Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98380757.

meta-cla bot added the CLA Signed label Mar 25, 2026

xingbowang force-pushed the navi/fix-nightly-regressions branch from d2c8f32 to ce80912 Compare March 25, 2026 14:09

xingbowang force-pushed the navi/fix-nightly-regressions branch from ce80912 to b01a041 Compare March 25, 2026 17:29

xingbowang changed the title ~~Fix nightly CI regressions from #14478~~ Fix nightly CI regressions and flaky tests from #14478 Mar 25, 2026

xingbowang force-pushed the navi/fix-nightly-regressions branch from b01a041 to 8fd634a Compare March 25, 2026 17:45

xingbowang force-pushed the navi/fix-nightly-regressions branch from 8fd634a to ae9db09 Compare March 26, 2026 01:18

xingbowang changed the title ~~Fix nightly CI regressions and flaky tests from #14478~~ Fix nightly CI regressions, flaky tests, and Windows ccache from #14478 Mar 26, 2026

xingbowang force-pushed the navi/fix-nightly-regressions branch from 1a1d04c to ae9db09 Compare March 26, 2026 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nightly CI regressions, flaky tests, and Windows ccache from #14478#14508

Fix nightly CI regressions, flaky tests, and Windows ccache from #14478#14508
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:navi/fix-nightly-regressions

xingbowang commented Mar 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xingbowang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Release build (NDEBUG) compile error

2. Folly-lite link error

3. Flaky tests — leaked Env::Default() thread pool state

4. Windows ccache — 0.26% hit rate → should be ~99%

Test Plan

Uh oh!

github-actions bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ clang-tidy: No findings on changed lines

Uh oh!

github-actions bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Claude Code Review

Summary

Issues Found

🟡 Suggestion: Conditional glog linking may mask real link failures

🟡 Suggestion: TearDown resets LOW and HIGH but tests may also modify them

🟢 Nitpick: #ifndef NDEBUG guard also excludes the include

Cross-Component Analysis

Positive Observations

Uh oh!

meta-codesync bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xingbowang commented Mar 25, 2026 •

edited

Loading

3. Flaky tests — leaked `Env::Default()` thread pool state

github-actions bot commented Mar 25, 2026 •

edited

Loading

github-actions bot commented Mar 25, 2026 •

edited

Loading

🟢 Nitpick: `#ifndef NDEBUG` guard also excludes the include