Testing infrastructure support #4678

ali-ramadhan · 2025-07-28T16:39:05Z

ali-ramadhan
Jul 28, 2025
Maintainer

@atdepth has recently set up a server called Nautilus with a dedicated GPU for testing. I know the testing infrastructure/capacity on Oceananigans.jl's Buildkite has been in need of some expansion for a long time, so I'm wondering if Oceananigans.jl would benefit from some additional testing capacity.

We'd also love to be able to get the test/build time <20 minutes total which would then allow for the addition of more tests and the expansion of existing tests. I feel like there are plenty of validation scripts that would make for excellent integration tests (and I'd love to revive #1223).

To start we can assign 32 agents/runners in parallel and see where we're at. If I remember, Tartarus has 2x 12-core Intel Xeon Silver 4214 CPUs so we might cut down test/build time just by using newer CPUs. Then by splitting up some of the longer-running test groups I think we can reduce the total test build time to under 20 minutes.

And caching the depot would cut down the initialization step from ~10 minutes to ~1 minute, allowing all tests (where the parallelization happens) to start much sooner.

Some long-running steps might need some thinking on how to cut them down. I'm mostly thinking of the docs step.

To get started, I believe all we would need is a Buildkite API key (token?) from the Oceananigans/CliMA account. And I can open a PR to update the Buildkite config(s).

Is Tartarus + GPUs still operational? If so, I'm wondering if there is a way we can split up the GPU tests on Buildkite so that they sometimes run on Tartarus and sometimes on Nautilus. But if not, we can start by just running all tests on Nautilus.

@tomchor and I can help in case anything goes wrong on Nautilus. I'm hoping it's just the occasional Docker container restart.

glwagner · 2025-07-30T06:57:59Z

glwagner
Jul 30, 2025
Maintainer

The main bottleneck for testing is the capability of the CPU / compile time, I think? So if Nautilus has a beefier CPU than tartarus, I think we are onto something promising. It might make sense to somehow divide the tests into two groups and run one group on tartarus and one on Nautilus.

I would propose to fix the tests that run on either system rather than sometimes running on one or the other. That way, if there is a system-specific failure, we can debug something specific. What do you think about that reasoning?

Another very important need is to get away from using the Caltech cluster for CI, which is flaky. We need multiple GPUs for that. Tartarus is almost totally dedicated to testing now, but we are also using it for ClimaOcean tests and benchmarks, so there are not enough GPUs to also support multi-GPU tests.

0 replies

ali-ramadhan · 2025-07-30T19:25:15Z

ali-ramadhan
Jul 30, 2025
Maintainer Author

The main bottleneck for testing is the capability of the CPU / compile time, I think? So if Nautilus has a beefier CPU than tartarus, I think we are onto something promising.

Yeah I agree that CPU performance is the main factor affecting test/build times due to how compilation heavy testing is. I benchmarked six of the longer-running test groups on Nautilus vs. Tartarus (specifically https://buildkite.com/clima/oceananigans/builds/24823) and I'm seeing speedups of 2.7-3.3x on CPU test groups and speedups of 1.8-2.2x on GPU test groups.

Architecture	Test group	Tartarus	Nautilus	Speedup
CPU	`poisson_solvers_2`	36m 44.7s	13m 40.2s	2.7x
CPU	`time_stepping_2`	52m 13.1s	15m 28.5s	3.3x
CPU	`enzyme`	66m 49.1s	24m 49.3s	2.7x
GPU	`multi_region`	39m 35.6s	19m 47.2s	2.0x
GPU	`enzyme`	55m 33.9s	24m 44.1s	1.8x
GPU	`reactant_2`	59m 38.9s	26m 39.1s	2.2x

It might make sense to somehow divide the tests into two groups and run one group on tartarus and one on Nautilus.

I would propose to fix the tests that run on either system rather than sometimes running on one or the other. That way, if there is a system-specific failure, we can debug something specific. What do you think about that reasoning?

I agree with this reasoning! Just to clarify: is this to eliminate the randomness of system failures on CI robustness by always running the same test group on the same server?

If test group A always runs on server A and suddenly fails, then it strongly suggests that something is wrong with test group A. But if test group A sometimes runs on server A and sometimes on server B and it suddenly fails, then we're less sure that the test group is at fault?

I think deciding which test groups to run on each server also gives us more control to load balance and set up CI pipelines with predictable runtimes.

Did you have a split in mind? Like CPU vs. GPU or shorter test groups on Tartarus with longer test groups on Nautilus?

Another very important need is to get away from using the Caltech cluster for CI, which is flaky. We need multiple GPUs for that. Tartarus is almost totally dedicated to testing now, but we are also using it for ClimaOcean tests and benchmarks, so there are not enough GPUs to also support multi-GPU tests.

Ah that is unfortunate about the Caltech cluster :( Unfortunately we can only dedicate one GPU for testing, but if enough Oceananigans.jl GPU test groups moved to Nautilus would that free up enough GPU resources on Tartarus to run multi-GPU tests there?

I see there are some multi-GPU regression tests but are the distributed tests mostly just using the CPU to compile then just use a tiny bit of GPU power? Or do the GPUs actually work hard during testing? Or maybe GPU memory is the bottleneck?

6 replies

glwagner Jul 31, 2025
Maintainer

Just to clarify: is this to eliminate the randomness of system failures on CI robustness by always running the same test group on the same server?

Yes, I think the more limited scope will make test failures easier to diagnose. Also, I think our experience with tartarus and caltech CI is that we end up investing some system-specific time. If we limit the scope of the tests that run on each system (rather than running every test on every system), we reduce the work required to keep CI going. Maybe there is a counterargument that running every test everywhere is more robust though. The point about load balancing is perhaps the most important though --- it's also important to be able to predict how long CI will take to complete.

Did you have a split in mind? Like CPU vs. GPU or shorter test groups on Tartarus with longer test groups on Nautilus?

Based on your benchmarks it seems like we'll get the most gain by moving the computationally-intensive / long running tests to Nautilus?

I see there are some multi-GPU regression tests but are the distributed tests mostly just using the CPU to compile then just use a tiny bit of GPU power? Or do the GPUs actually work hard during testing? Or maybe GPU memory is the bottleneck?

As far as I know, none of our tests really stress the GPU, they just run on it. The one exception are examples; we have started to move a few example builds to GPU and some of those run real simulations.

ali-ramadhan Jul 31, 2025
Maintainer Author

Based on your benchmarks it seems like we'll get the most gain by moving the computationally-intensive / long running tests to Nautilus?

I agree! We can start by moving the longer running tests with the goal of load balancing between Tartarus and Nautilus and with the eventual goal of having both take <20 minutes per build?

glwagner Aug 1, 2025
Maintainer

Asking chatgpt to a look at https://buildkite.com/clima/oceananigans/builds/24852#_

Test Name	Duration	Hardware
CPU - 🎭 reactant_2 tests	1h 15m	CPU
CPU - 👺 enzyme tests	1h 8m	CPU
GPU - 🎭 reactant_2 tests	1h 1m	GPU
GPU - 👺 enzyme tests	57m 53s	GPU
CPU - 🧅 multi_region tests	55m 55s	CPU
GPU - 🧅 multi_region tests	42m 29s	GPU
🦉 documentation	39m 54s	(N/A)
CPU - 🦈 time_stepping_2 tests	51m 25s	CPU
GPU - 🦖 poisson_solvers_2 tests	48m 21s	GPU
GPU - 🦈 time_stepping_2 tests	36m 21s	GPU
CPU - 🐳 simulation tests	35m 17s	CPU
CPU - 🦈 time_stepping_2 tests	31m 16s	CPU
GPU - 🐳 simulation tests	30m 11s	GPU
GPU - 🥑 vertical_coordinate tests	28m 48s	GPU
CPU - 🥑 vertical_coordinate tests	27m 50s	CPU
CPU - 🕊 poisson_solvers_1 tests	27m 38s	CPU
GPU - 🦟 time_stepping_3 tests	26m 37s	GPU
GPU - 🐇 unit tests	24m 31s	GPU
CPU - 🦀 time_stepping_1 tests	23m 49s	CPU
CPU - 🐙 hydrostatic_free_surface tests	23m 24s	CPU
GPU - 🐙 hydrostatic_free_surface tests	22m 34s	GPU
GPU - 🍂 lagrangian_particles tests	16m 1s	GPU
CPU - 👹 reactant_1 tests	17m 55s	CPU
GPU - 🦀 time_stepping_1 tests	17m 29s	GPU
CPU - 🐇 unit tests	14m 30s	CPU
GPU - 🙈 hydrostatic_regression tests	8m 7s	GPU
CPU - 🙈 hydrostatic_regression tests	8m 11s	CPU
CPU - 🐫 nonhydrostatic_regression tests	8m 12s	CPU
GPU - 🐫 nonhydrostatic_regression tests	6m 13s	GPU
GPU - 👻 abstract_operations tests	10m 24s	GPU
CPU - 👻 abstract_operations tests	13m 44s	CPU
CPU - 🎣 turbulence_closures tests	11m 59s	CPU
GPU - 🎣 turbulence_closures tests	7m 43s	GPU
CPU - 🍂 lagrangian_particles tests	12m 21s	CPU
GPU - 🕊 poisson_solvers_1 tests	12m 22s	GPU
CPU - 🦧 scripts tests	9m 31s	CPU
GPU - 👹 reactant_1 tests	9m 4s	GPU
CPU - 🦢 shallow_water tests	9m 12s	CPU
GPU - 🦢 shallow_water tests	5m 43s	GPU
CPU - 🌷 matrix_poisson_solvers tests	5m 3s	CPU
GPU - 🌷 matrix_poisson_solvers tests	4m 53s	GPU
GPU - 🫐 tripolar_grid tests	4m 58s	GPU
CPU - 🫐 tripolar_grid tests	6m 52s	CPU
GPU - 🦧 scripts tests	2m 57s	GPU
CPU - 🦤 general_solvers tests	3m 30s	CPU
GPU - 🦤 general_solvers tests	2m 9s	GPU

glwagner Aug 1, 2025
Maintainer

We could start by moving all tests longer than 20 min to Nautlius. I think a bit more work will be needed to achieve the goal of 20 min CI though. But especially with another machine in play, we can easily split up the tests further (since we will effectively be able to have 2x or more agents).

ali-ramadhan Aug 1, 2025
Maintainer Author

Sounds like a plan! Yeah for sub 20 min test times, we'll probably need to split up some test groups and cache the Julia depot. I've already set this up in Buildkite though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Testing infrastructure support #4678

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Testing infrastructure support #4678

Uh oh!

Uh oh!

ali-ramadhan Jul 28, 2025 Maintainer

Replies: 2 comments · 6 replies

Uh oh!

glwagner Jul 30, 2025 Maintainer

Uh oh!

Uh oh!

ali-ramadhan Jul 30, 2025 Maintainer Author

Uh oh!

glwagner Jul 31, 2025 Maintainer

Uh oh!

ali-ramadhan Jul 31, 2025 Maintainer Author

Uh oh!

glwagner Aug 1, 2025 Maintainer

Uh oh!

glwagner Aug 1, 2025 Maintainer

Uh oh!

ali-ramadhan Aug 1, 2025 Maintainer Author

ali-ramadhan
Jul 28, 2025
Maintainer

Replies: 2 comments 6 replies

glwagner
Jul 30, 2025
Maintainer

ali-ramadhan
Jul 30, 2025
Maintainer Author

glwagner Jul 31, 2025
Maintainer

ali-ramadhan Jul 31, 2025
Maintainer Author

glwagner Aug 1, 2025
Maintainer

glwagner Aug 1, 2025
Maintainer

ali-ramadhan Aug 1, 2025
Maintainer Author