Testing infrastructure support #4678
Replies: 2 comments 6 replies
-
The main bottleneck for testing is the capability of the CPU / compile time, I think? So if Nautilus has a beefier CPU than tartarus, I think we are onto something promising. It might make sense to somehow divide the tests into two groups and run one group on tartarus and one on Nautilus. I would propose to fix the tests that run on either system rather than sometimes running on one or the other. That way, if there is a system-specific failure, we can debug something specific. What do you think about that reasoning? Another very important need is to get away from using the Caltech cluster for CI, which is flaky. We need multiple GPUs for that. Tartarus is almost totally dedicated to testing now, but we are also using it for ClimaOcean tests and benchmarks, so there are not enough GPUs to also support multi-GPU tests. |
Beta Was this translation helpful? Give feedback.
-
Yeah I agree that CPU performance is the main factor affecting test/build times due to how compilation heavy testing is. I benchmarked six of the longer-running test groups on Nautilus vs. Tartarus (specifically https://buildkite.com/clima/oceananigans/builds/24823) and I'm seeing speedups of 2.7-3.3x on CPU test groups and speedups of 1.8-2.2x on GPU test groups.
I agree with this reasoning! Just to clarify: is this to eliminate the randomness of system failures on CI robustness by always running the same test group on the same server? If test group A always runs on server A and suddenly fails, then it strongly suggests that something is wrong with test group A. But if test group A sometimes runs on server A and sometimes on server B and it suddenly fails, then we're less sure that the test group is at fault? I think deciding which test groups to run on each server also gives us more control to load balance and set up CI pipelines with predictable runtimes. Did you have a split in mind? Like CPU vs. GPU or shorter test groups on Tartarus with longer test groups on Nautilus?
Ah that is unfortunate about the Caltech cluster :( Unfortunately we can only dedicate one GPU for testing, but if enough Oceananigans.jl GPU test groups moved to Nautilus would that free up enough GPU resources on Tartarus to run multi-GPU tests there? I see there are some multi-GPU regression tests but are the distributed tests mostly just using the CPU to compile then just use a tiny bit of GPU power? Or do the GPUs actually work hard during testing? Or maybe GPU memory is the bottleneck? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@atdepth has recently set up a server called Nautilus with a dedicated GPU for testing. I know the testing infrastructure/capacity on Oceananigans.jl's Buildkite has been in need of some expansion for a long time, so I'm wondering if Oceananigans.jl would benefit from some additional testing capacity.
We'd also love to be able to get the test/build time <20 minutes total which would then allow for the addition of more tests and the expansion of existing tests. I feel like there are plenty of validation scripts that would make for excellent integration tests (and I'd love to revive #1223).
To start we can assign 32 agents/runners in parallel and see where we're at. If I remember, Tartarus has 2x 12-core Intel Xeon Silver 4214 CPUs so we might cut down test/build time just by using newer CPUs. Then by splitting up some of the longer-running test groups I think we can reduce the total test build time to under 20 minutes.
And caching the depot would cut down the initialization step from ~10 minutes to ~1 minute, allowing all tests (where the parallelization happens) to start much sooner.
Some long-running steps might need some thinking on how to cut them down. I'm mostly thinking of the docs step.
To get started, I believe all we would need is a Buildkite API key (token?) from the Oceananigans/CliMA account. And I can open a PR to update the Buildkite config(s).
Is Tartarus + GPUs still operational? If so, I'm wondering if there is a way we can split up the GPU tests on Buildkite so that they sometimes run on Tartarus and sometimes on Nautilus. But if not, we can start by just running all tests on Nautilus.
@tomchor and I can help in case anything goes wrong on Nautilus. I'm hoping it's just the occasional Docker container restart.
Beta Was this translation helpful? Give feedback.
All reactions