Skip to content

docker: parameterize CUDA version in build matrix, remove cuda-new.Dockerfile#20920

Merged
CISC merged 1 commit intoggml-org:masterfrom
M1DNYT3:cuda-bump
Apr 3, 2026
Merged

docker: parameterize CUDA version in build matrix, remove cuda-new.Dockerfile#20920
CISC merged 1 commit intoggml-org:masterfrom
M1DNYT3:cuda-bump

Conversation

@M1DNYT3
Copy link
Copy Markdown
Contributor

@M1DNYT3 M1DNYT3 commented Mar 23, 2026

Overview

ggml/src/ggml-cuda/CMakeLists.txt gates Blackwell cubin compilation on CUDAToolkit_VERSION >= 12.8:

cmake
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "12.8")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 120a-real)
endif()

The official image ships with CUDA_VERSION=12.4.0 - gate never triggers, RTX 50xx and RTX PRO 6000 fall back to PTX JIT.

Validated across 4 architectures (100-record batch, 19 parallel slots per card): https://github.com/M1DNYT3/self-hosted-llm-inference/blob/main/iterations/10-blackwell-root-cause/summary.md

12.9.1 is the latest 12.x. Pascal (sm_60/62) and Maxwell (sm_52) are not deprecated until 13.0 - no supported GPU is dropped.

Test image: docker pull m1dnyt3/llama-cpp:server-cuda-12.9

Full logs + dmon traces: https://github.com/M1DNYT3/self-hosted-llm-inference/tree/main/iterations/10-blackwell-root-cause/logs

Addresses #17822, #18865.

Additional information

If you need additional data, I can run benchmarks on demand for any specific GPU. The harness is built around Vast.ai, so any card available on the platform can be tested - inference cost per benchmark run is negligible.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Claude assisted with case study documentation and log analysis. Benchmarks, Docker build, source code investigation, and the one-line fix are my own work.

@M1DNYT3 M1DNYT3 requested a review from ngxson as a code owner March 23, 2026 21:36
@github-actions github-actions bot added the devops improvements to build systems and github actions label Mar 23, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Mar 23, 2026

Hi @M1DNYT3, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 23, 2026

Just use the cuda-13 image and/or cuda-new.Dockerfile for Blackwell.

@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Mar 23, 2026

Just use the cuda-13 image and/or cuda-new.Dockerfile for Blackwell.

CUDA 13 drops support for Pascal and Maxwell. While both are quite old, 12.9.1 is the last 12.x release, which enables Blackwell without dropping anything. No need to split the stack into two images either; mixed fleets with old and new hardware stay on a single image until fully transitioned to 13 and dropped 12.x.

@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 1, 2026

@CISC

Noticed a merge conflict popped up, and I see 21207 was just merged, bumping the CUDA floor to 12.8 - seems like the concern about forward compatibility is acknowledged after all. Would appreciate a fresh look at my proposal given that context, especially since the review bar appears more flexible than I was led to believe.

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 1, 2026

I am considering bumping both 12.x and 13.x to latest, and removing cuda-new.Dockerfile in the process as it is no longer needed, not today though. :)

@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 1, 2026

I am considering bumping both 12.x and 13.x to latest, and removing cuda-new.Dockerfile in the process as it is no longer needed, not today though. :)

@CISC
Appreciated. That said, noting that #21207 was merged without a similar "not today" deferral, my PR predates it and addresses the same compatibility concern. Given the bump to 12.8 already landed, merging this as an incremental fix in the interim seems low-risk and consistent with that precedent. Happy to rebase if needed.

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 1, 2026

I am considering bumping both 12.x and 13.x to latest, and removing cuda-new.Dockerfile in the process as it is no longer needed, not today though. :)

@CISC Appreciated. That said, noting that #21207 was merged without a similar "not today" deferral, my PR predates it and addresses the same compatibility concern. Given the bump to 12.8 already landed, merging this as an incremental fix in the interim seems low-risk and consistent with that precedent. Happy to rebase if needed.

It did a whole lot more, most specifically updating the base image and tools so that arm64 builds could be done, the bump was a pre-requisite for that, not the other way around.

@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 1, 2026

It did a whole lot more, most specifically updating the base image and tools so that arm64 builds could be done, the bump was a pre-requisite for that, not the other way around.

Makes sense on the ARM64 context. Given that 12.8 landed as a prerequisite for that work, bumping to 12.9.1 seems like an even smaller ask now that the baseline has already moved - it's the latest stable 12.x, picking up all fixes since 12.8, while avoiding 13.x, which still has open issues like #21255. If 12.9.1 becomes a prerequisite for something else tomorrow, the PR is already there. I have no objection to #21207 or its purpose - my question is simply why #20920 shouldn't be merged on its own merits.

@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 2, 2026

It did a whole lot more, most specifically updating the base image and tools so that arm64 builds could be done, the bump was a pre-requisite for that, not the other way around.

Makes sense on the ARM64 context. Given that 12.8 landed as a prerequisite for that work, bumping to 12.9.1 seems like an even smaller ask now that the baseline has already moved - it's the latest stable 12.x, picking up all fixes since 12.8, while avoiding 13.x, which still has open issues like #21255. If 12.9.1 becomes a prerequisite for something else tomorrow, the PR is already there. I have no objection to #21207 or its purpose - my question is simply why #20920 shouldn't be merged on its own merits.

@CISC Following up on the above - is my PR worth a second look on its own merits?

@CISC
Copy link
Copy Markdown
Member

CISC commented Apr 2, 2026

@CISC Following up on the above - is my PR worth a second look on its own merits?

With some modifications, sure. :)

First of all, leave cuda.Dockerfile as-is and remove cuda-new.Dockerfile, CUDA_VERSION is an argument, so simply update docker.yml with the appropriate cuda_version in the build matrix.

…ckerfile

Co-authored-by: CISC <CISC@users.noreply.github.com>
@M1DNYT3 M1DNYT3 requested a review from a team as a code owner April 2, 2026 10:03
@M1DNYT3 M1DNYT3 changed the title Bump CUDA base image 12.4.0 -> 12.9.1 (enables sm_120a-real for Blackwell) docker: parameterize CUDA version in build matrix, remove cuda-new.Dockerfile Apr 2, 2026
@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 2, 2026

@CISC Following up on the above - is my PR worth a second look on its own merits?

With some modifications, sure. :)

First of all, leave cuda.Dockerfile as-is and remove cuda-new.Dockerfile, CUDA_VERSION is an argument, so simply update docker.yml with the appropriate cuda_version in the build matrix.

@CISC Done + marked you as a Co-Author since you suggested the design. Could you please review the changes?

cuda13 builds for both arches are pointing at 13.1.1, as it was in cuda-new.Dockerfile.
cuda cuda12 builds for both arches are pointing at 12.9.1, as I proposed for the original cuda.Dockerfile.
cuda-new.Dockerfile was removed and cuda13 builds are pointing at cuda.Dockerfile now.

Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank you, can you run it on your fork to verify it builds (disable s390x to be able to complete)?

@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 2, 2026

Perfect, thank you, can you run it on your fork to verify it builds (disable s390x to be able to complete)?

Runs for now, will notify additionally once completed. Made a separate branch specifically for build test with s390x line deleted from the yml, so it won't impact the PR.

https://github.com/M1DNYT3/llama.cpp/actions/runs/23900519302

@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 2, 2026

Perfect, thank you, can you run it on your fork to verify it builds (disable s390x to be able to complete)?

Runs for now, will notify additionally once completed. Made a separate branch specifically for build test with s390x line deleted from the yml, so it won't impact the PR.

https://github.com/M1DNYT3/llama.cpp/actions/runs/23900519302

@CISC, Done the builds, all green, 12/12. You can check at the URL I provided above.
However, the PR still needs one more approval. Any guidance on how to proceed here?

@CISC CISC requested a review from a team April 2, 2026 14:45
@taronaeo taronaeo added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 3, 2026
@CISC CISC merged commit 277ff5f into ggml-org:master Apr 3, 2026
2 checks passed
@M1DNYT3
Copy link
Copy Markdown
Contributor Author

M1DNYT3 commented Apr 3, 2026

@taronaeo @CISC Thank you for reviewing and helping with improvements! :)

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Apr 6, 2026
Notable upstream changes:
- ggml-org#21038: rotate activations for better KV cache quantization
- ggml-org#21074: generic NVFP4 MMQ kernel
- ggml-org#21271: fix FA kernel selection logic
- ggml-org#21238: fix mmvq mmid kernel selection
- ggml-org#20998: FA support for head dim 512
- ggml-org#20920: docker cuda12 bump to 12.9.1

Conflicts resolved:
- fattn.cu: took upstream (adds head_dim 512 exclusion)
- mmq.cu/mmq.cuh: kept both TQ3 types + upstream additions
- llama-graph.cpp: kept both turbo WHT + upstream v_rot
- docker.yml: took upstream
- tests/CMakeLists.txt: took upstream
@M1DNYT3 M1DNYT3 deleted the cuda-bump branch April 6, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants