Skip to content

Fix nightly CI regressions, flaky tests, and Windows ccache from #14478#14508

Closed
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:navi/fix-nightly-regressions
Closed

Fix nightly CI regressions, flaky tests, and Windows ccache from #14478#14508
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:navi/fix-nightly-regressions

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented Mar 25, 2026

Summary

Fix build failures, flaky tests, and Windows ccache issues exposed by PR #14478.

1. Release build (NDEBUG) compile error

Job: build-linux-release-with-folly

The SyncPoint cleanup listener in testharness.cc referenced SyncPoint::GetInstance() unconditionally, but SyncPoint is only declared in debug builds. Wrapped with #ifndef NDEBUG.

2. Folly-lite link error

Job: build-linux-cmake-with-folly-lite

The USE_FOLLY_LITE cmake path unconditionally linked -lglog, which fails with lld (added in #14478) when glog isn't installed. Use find_library() to link only when available.

3. Flaky tests — leaked Env::Default() thread pool state

Sharded execution runs multiple tests per process. Several tests in db_compaction_test.cc modified the global Env::Default() BOTTOM thread pool without resetting. With a leaked BOTTOM pool, subsequent tests' last-level compactions get forwarded to the bottom pool, freeing the LOW thread to pick additional compactions unexpectedly.

Fix: add TearDown() override to DBCompactionTest that captures default thread pool sizes in the constructor and restores them after every test. This is more robust than per-test cleanup because:

  • It runs even when a test fails (gtest calls TearDown after assertion failures)
  • It catches all current and future leakers without per-test maintenance

4. Windows ccache — 0.26% hit rate → should be ~99%

The Windows nightly build takes 57 minutes because ccache has near-zero hit rate. Two issues:

  • Cache key: The hendrikmuhs/ccache-action used a timestamp-based key, so each nightly run created a unique key and never found the previous run's cache (No cache found.). Fixed by using a stable key ccache-windows-<workflow> with prefix-based restore.
  • Compiler check: Missing compiler_check=content setting, so MSVC path/version changes between runners invalidated all cache entries. Added compiler_check=content (same fix as macOS in Accelerate build and test infrastructure #14478).

Test Plan

  • Release and debug builds compile cleanly
  • 70 consecutive shuffled runs of db_compaction_test (367 tests each) pass — 50 on full cores + 20 on 4 cores
  • Format check passes
  • Windows ccache fix requires CI run to verify (first run populates cache, second run should see high hit rate)

@meta-cla meta-cla bot added the CLA Signed label Mar 25, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

✅ clang-tidy: No findings on changed lines

Completed in 273.8s.

@xingbowang xingbowang force-pushed the navi/fix-nightly-regressions branch from d2c8f32 to ce80912 Compare March 25, 2026 14:09
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit ae9db09


Confirmed: ReleaseCompactionDuringManifestWrite sets LOW=3 and BOTTOM=3 but never resets them. The TearDown fix catches this.

Now I have all the context needed. Here's the review:


Summary

A well-targeted fix addressing four distinct CI regressions. The changes are correct, low-risk, and well-scoped. One suggestion regarding the conditional glog linking deserves attention.

Issues Found

🟡 Suggestion: Conditional glog linking may mask real link failures

File: CMakeLists.txt:1140-1143
Perspective: Correctness & Cross-Component

The find_library(GLOG_LIBRARY glog) approach silently skips glog if not found. However, the folly-lite sources (especially SafeAssert.cpp, Exception.cpp) may reference glog symbols transitively. On the build-linux-cmake-with-folly-lite CI job, glog is passed via -DCMAKE_CXX_FLAGS=-DGLOG_USE_GLOG_EXPORT, suggesting it IS expected to be present. If glog is genuinely absent in the lld environment, this fix works. But if a future environment has glog headers but not the library, you'd get confusing compile-succeeds-link-fails errors from folly internals.

Suggested improvement: Add a status message so failures are diagnosable:

find_library(GLOG_LIBRARY glog)
if(GLOG_LIBRARY)
  list(APPEND THIRDPARTY_LIBS glog)
else()
  message(STATUS "glog not found — USE_FOLLY_LITE will build without glog")
endif()

🟡 Suggestion: TearDown resets LOW and HIGH but tests may also modify them

File: db/db_compaction_test.cc:125-133
Perspective: Correctness & Edge Cases

The TearDown correctly resets BOTTOM=0, LOW=1, HIGH=1 to match the values DBTestBase sets in its constructor (db/db_test_util.cc:87-88). However, at least one test (ReleaseCompactionDuringManifestWrite, line 11368) sets LOW=3 without cleanup. The TearDown catches this — good.

One subtlety: DBTestBase sets threads on env_ (the SpecialEnv wrapper), while this TearDown sets them on Env::Default() directly. Since SpecialEnv inherits from EnvWrapper and delegates SetBackgroundThreads to the target (include/rocksdb/env.h:1754-1755), these are equivalent when env_ wraps Env::Default(). This is correct but could be fragile if a test ever uses a non-default base env. Consider adding a comment noting this equivalence.

🟢 Nitpick: #ifndef NDEBUG guard also excludes the include

File: test_util/testharness.cc:16-18
Perspective: Code Quality

Wrapping #include "test_util/sync_point.h" inside #ifndef NDEBUG is functional but unconventional — the header itself already handles NDEBUG internally by making macros no-ops. The include is guarded here because the code references SyncPoint::GetInstance() directly (not via macros), and the class definition doesn't exist in release builds (test_util/sync_point.h:55-60). This is correct and necessary.

Cross-Component Analysis

Context Assessment
Release builds (NDEBUG) Fixed — SyncPoint guard correctly prevents compile error
CMake with lld Fixed — conditional glog avoids link failure when absent
Sharded test execution Fixed — TearDown prevents thread pool state leakage across tests
Windows CI caching Fixed — stable key + content-based compiler check should dramatically improve hit rate
Make-based builds Unaffectedfolly.mk still unconditionally links glog, which is fine since make uses ld not lld
Non-DBCompactionTest suites Unaffected — TearDown is scoped to DBCompactionTest only

Positive Observations

  • The TearDown approach is strictly better than per-test cleanup: it's defensive against both current and future test leakers, and it runs even after assertion failures.
  • The PR description is exceptionally thorough with clear root-cause analysis for each fix.
  • The 70-run shuffled validation (50 full + 20 on 4 cores) is strong evidence the flaky test fix is robust.
  • The ccache fix (stable key + compiler_check=content) follows the pattern already established for macOS in the setup-ccache action, maintaining cross-platform consistency.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@xingbowang xingbowang force-pushed the navi/fix-nightly-regressions branch from ce80912 to b01a041 Compare March 25, 2026 17:29
@xingbowang xingbowang changed the title Fix nightly CI regressions from #14478 Fix nightly CI regressions and flaky tests from #14478 Mar 25, 2026
@xingbowang xingbowang force-pushed the navi/fix-nightly-regressions branch from b01a041 to 8fd634a Compare March 25, 2026 17:45
Fix two build failures introduced by the accelerate-build PR (facebook#14478):

1. **Release build (NDEBUG) compile error**: The SyncPoint cleanup
   listener in testharness.cc referenced SyncPoint::GetInstance() which
   is only declared in debug builds. Guard the listener with #ifndef
   NDEBUG since it's only needed for debug test sharding anyway.

2. **Folly-lite link error**: The USE_FOLLY_LITE cmake path
   unconditionally linked -lglog, but glog may not be installed as a
   system library. Use find_library() to only link glog when available.

Co-authored-by: Navi <navi@navibot.dev>
@xingbowang xingbowang force-pushed the navi/fix-nightly-regressions branch from 8fd634a to ae9db09 Compare March 26, 2026 01:18
@xingbowang xingbowang changed the title Fix nightly CI regressions and flaky tests from #14478 Fix nightly CI regressions, flaky tests, and Windows ccache from #14478 Mar 26, 2026
@xingbowang xingbowang force-pushed the navi/fix-nightly-regressions branch from 1a1d04c to ae9db09 Compare March 26, 2026 02:31
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 26, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D98380757.

@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Mar 27, 2026

@xingbowang merged this pull request in 926e2b5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant