Skip to content

Conversation

@Alexey-Rivkin
Copy link
Contributor

What?

Downgrade CUDA 12.9 to 12.8 for Jenkins jobs.

Why?

The downgrade is due to a bug in CUDA 12.9.

Depends on #996

Replace bash [[ ]] with POSIX [ ] in container detection.
Scripts using #!/bin/sh failed on [[ syntax, causing NPROC to
default to 256 CPUs instead of memory-based limit, leading to OOM.

Signed-off-by: Alexey Rivkin <[email protected]>
GPUNETIO plugin does not work with CUDA 13.0 at the moment,
because DOCA 3.1 still links against CUDA 12 libraries.

Signed-off-by: Alexey Rivkin <[email protected]>
GPUNetIO is only tested on CUDA12, so adding
both allows testing for one with GPUNETIO and one WO.

Signed-off-by: Alexey Rivkin <[email protected]>
This should resolve provider hangs during AWS tests

Signed-off-by: Alexey Rivkin <[email protected]>
libfabric hangs and test fail on timeout
when CUDA 13 umages are used

Signed-off-by: Alexey Rivkin <[email protected]>
Some tests (e.g. gpunetio) only run on specific CUDA ver.
Adding both CUDA 12 and 13 improves coverage.

Signed-off-by: Alexey Rivkin <[email protected]>
Downgrading CUDA ver due to a bug in CUDA 12.8

Signed-off-by: Alexey Rivkin <[email protected]>
@github-actions
Copy link

👋 Hi Alexey-Rivkin! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@dpressle
Copy link
Contributor

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants