build: Optimize CI with test splitting and 32-core runner#16691
build: Optimize CI with test splitting and 32-core runner#16691kgpai wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
32eab85 to
98fe7f6
Compare
|
@czentgr @pratikpugalia @majetideepak PTAL. This should reduce CI time by 20-30 minutes at cost of 75c roughly / run. |
98fe7f6 to
fdfbe6d
Compare
|
@kgpai Thanks, this looks good. A lot more executables so there is some tradeoff in generating them vs one. But certainly better overall to reduce the runtime later for more parallelism. We probably later can experiment and increase the MAX_LIINK_JOBS to 16. On the release build without extra symbols we should be ok on that. We might be able to increase this further still. |
158730b to
24bc35f
Compare
Split 5 monolithic test targets into 139 individual test binaries for better ctest parallelism, and upgrade both ubuntu-debug and adapters jobs to 32-core runners with tuned build settings. Test target splits (CMakeLists.txt changes only): - velox_exec_test -> 76 individual tests - velox_exec_util_test -> 9 individual tests - velox_aggregates_test -> 43 individual tests - velox_cache_test -> 5 individual tests - velox_serializer_test -> 5 individual tests CI workflow changes (ubuntu-debug + adapters): - Runner: 16-core/8-core -> 32-core (128 GB RAM) - NUM_THREADS: 16/8 -> 32 - MAX_HIGH_MEM_JOBS: 4 -> 12 (128 GB / 6.8 GB per heavy TU) - MAX_LINK_JOBS: 2/4 -> 12 (shared links use ~1 GB each) - ctest: -j 8 -> -j 24 --timeout 900 - ulimit -n 65536 to prevent "Too many open files" - Adapters: add -DVELOX_BUILD_SHARED=ON for faster linking The test splits add ~135 new link targets. On the previous 8-core runner (32 GB RAM), these would OOM during concurrent compilation of heavy TUs (up to 6.8 GB each) + parallel linking. Validated on 48-core test machine (throttled to 32 threads): - Build: 26m 34s (cold, no ccache) - Test: 17m 45s (631 tests, 625 passed) - Total: 44m 28s vs estimated ~70 min current CI
|
@claude /pr-review |
|
Claude finished @pratikpugalia's task in 3m 30s —— View job SummaryThis PR splits 5 monolithic test targets into ~139 individual binaries for improved Issues Found🟡 Suggestion: Missing
|
| Setting | Before | After |
|---|---|---|
| Runner | 8-core-ubuntu-22.04 |
32-core-ubuntu |
NUM_THREADS |
8 | 32 |
MAX_HIGH_MEM_JOBS |
4 | 12 |
MAX_LINK_JOBS |
4 | 12 |
ctest -j |
8 | 24 |
VELOX_BUILD_SHARED |
not set | ON |
This is a 4x runner upgrade (not 2x like ubuntu-debug) and changes the OS label from ubuntu-22.04 to generic ubuntu. The cost implications for the adapters job are not documented. Consider updating the description with a table for adapters changes as well, since this is a larger jump than ubuntu-debug.
🟡 Suggestion: velox_exec_test timeout regression risk
The old velox_exec_test had a custom TIMEOUT 6000 (100 minutes) via set_tests_properties, with a comment referencing issue #13879. This has been replaced by a global --timeout 900 (15 minutes). While individual split tests should run faster, the PR description notes SpillerTest needs 12+ minutes. If SpillerTest approaches that limit, the 900s timeout could be tight. Consider:
- Verifying
velox_exec_SpillerTestcompletes well within 900s, or - Adding a per-test timeout override for it if needed
🟡 Suggestion: VELOX_GFLAGS_URL bypasses SHA256 verification
The new VELOX_GFLAGS_URL: file:///velox/deps-sources/gflags-v2.2.2.tar.gz (line 296) causes velox_resolve_dependency_url to clear the SHA256 checksum (see CMake/ResolveDependency.cmake:132). This is acceptable for CI using a pre-bundled file in the container image, but worth noting — the gflags tarball will not be integrity-verified during the build. Adding VELOX_GFLAGS_SHA256 with the known checksum would restore verification.
🟢 Nitpick: OS label specificity
The adapters runner changed from 8-core-ubuntu-22.04 (pinned OS version) to 32-core-ubuntu (latest). This means the OS version could change across runs. This may be intentional, but differs from the original explicit pinning.
Positive Observations
- The foreach-based CMake pattern is clean, maintainable, and easy to extend when new test files are added.
- Correctly handling conditional tests (
SpatialJoinTest,GeometryAggregateTest) outside the loop with their extra dependencies. - Including
Main.cppas a source file in each split binary (for exec/aggregates tests) is the right approach, avoiding a separate static library. - The
PrestoQueryRunnerIntermediateTypeTransformTestBase.cppbase class inclusion in the util test loop is well-handled. - The
ulimit -n 65536addition is a good defensive measure for running many parallel test processes. - No test target name collisions between the
velox_exec_*(main) andvelox_exec_*(util) namespaces — verified that all util tests havePrestoQueryRunner*prefixed names.
CI is currently in progress: Linux Build using GCC |
| set(TEST_TARGET "velox_exec_${TEST_NAME}") | ||
| add_executable(${TEST_TARGET} ${TEST_SOURCE} Main.cpp) | ||
| add_test(NAME ${TEST_TARGET} COMMAND ${TEST_TARGET} WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}) | ||
| target_link_libraries(${TEST_TARGET} ${VELOX_EXEC_TEST_DEPS}) |
There was a problem hiding this comment.
The PRIVATE keyword is missing here. Other split targets (e.g., velox_cache_* in velox/common/caching/tests/CMakeLists.txt:51) correctly use PRIVATE. For test executables, PRIVATE is best practice to prevent unintended transitive dependency leakage.
| target_link_libraries(${TEST_TARGET} ${VELOX_EXEC_TEST_DEPS}) | |
| target_link_libraries(${TEST_TARGET} PRIVATE ${VELOX_EXEC_TEST_DEPS}) |
| set(TEST_TARGET "velox_aggregates_${TEST_NAME}") | ||
| add_executable(${TEST_TARGET} ${TEST_SOURCE} Main.cpp) | ||
| add_test(NAME ${TEST_TARGET} COMMAND ${TEST_TARGET} WORKING_DIRECTORY .) | ||
| target_link_libraries(${TEST_TARGET} ${VELOX_AGGREGATES_TEST_DEPS}) |
There was a problem hiding this comment.
Same as exec tests: missing PRIVATE keyword for consistency and correctness.
| target_link_libraries(${TEST_TARGET} ${VELOX_AGGREGATES_TEST_DEPS}) | |
| target_link_libraries(${TEST_TARGET} PRIVATE ${VELOX_AGGREGATES_TEST_DEPS}) |
| set(TEST_TARGET "velox_serializer_${TEST_NAME}") | ||
| add_executable(${TEST_TARGET} ${TEST_SOURCE}) | ||
| add_test(${TEST_TARGET} ${TEST_TARGET}) | ||
| target_link_libraries(${TEST_TARGET} ${VELOX_SERIALIZER_TEST_DEPS}) |
There was a problem hiding this comment.
Missing PRIVATE keyword, for consistency with velox_cache_* tests and CMake best practices.
| target_link_libraries(${TEST_TARGET} ${VELOX_SERIALIZER_TEST_DEPS}) | |
| target_link_libraries(${TEST_TARGET} PRIVATE ${VELOX_SERIALIZER_TEST_DEPS}) |
| Main.cpp | ||
| ) | ||
| add_test(NAME ${TEST_TARGET} COMMAND ${TEST_TARGET}) | ||
| target_link_libraries(${TEST_TARGET} ${VELOX_EXEC_UTIL_TEST_DEPS}) |
There was a problem hiding this comment.
Missing PRIVATE keyword here as well.
| target_link_libraries(${TEST_TARGET} ${VELOX_EXEC_UTIL_TEST_DEPS}) | |
| target_link_libraries(${TEST_TARGET} PRIVATE ${VELOX_EXEC_UTIL_TEST_DEPS}) |
| COMMAND velox_exec_SpatialJoinTest | ||
| WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} | ||
| ) | ||
| target_link_libraries(velox_exec_SpatialJoinTest ${VELOX_EXEC_TEST_DEPS}) |
There was a problem hiding this comment.
PRIVATE keyword missing here as well, for consistency:
| target_link_libraries(velox_exec_SpatialJoinTest ${VELOX_EXEC_TEST_DEPS}) | |
| target_link_libraries(velox_exec_SpatialJoinTest PRIVATE ${VELOX_EXEC_TEST_DEPS}) |
24bc35f to
5e00f5b
Compare
Summary
ctestparallelismubuntu-debugjob from16-core-ubuntuto32-core-ubuntuwith tuned parallelism settings-DVELOX_BUILD_SHARED=ONto the adapters job for faster shared linkingulimit -n 65536and--timeout 900to prevent test infra failuresTest target splits (CMakeLists.txt only, no source changes)
velox_exec_testvelox_exec_util_testvelox_aggregates_testvelox_cache_testvelox_serializer_testCI tuning changes (
ubuntu-debug)16-core-ubuntu32-core-ubuntuNUM_THREADSMAX_HIGH_MEM_JOBSMAX_LINK_JOBSctest -jValidated results (cold build, no ccache, 32-thread simulation)
With ccache (typical CI run), expected total is 25-35 min vs ~55 min current.
Cost per run increases from $2.94 to ~$3.69 (+26%), justified by 25-35 min savings per PR iteration.
Test plan