docker: parameterize CUDA version in build matrix, remove cuda-new.Dockerfile by M1DNYT3 · Pull Request #20920 · ggml-org/llama.cpp

M1DNYT3 · 2026-03-23T21:36:11Z

Overview

ggml/src/ggml-cuda/CMakeLists.txt gates Blackwell cubin compilation on CUDAToolkit_VERSION >= 12.8:

cmake
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "12.8")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 120a-real)
endif()

The official image ships with CUDA_VERSION=12.4.0 - gate never triggers, RTX 50xx and RTX PRO 6000 fall back to PTX JIT.

Validated across 4 architectures (100-record batch, 19 parallel slots per card): https://github.com/M1DNYT3/self-hosted-llm-inference/blob/main/iterations/10-blackwell-root-cause/summary.md

12.9.1 is the latest 12.x. Pascal (sm_60/62) and Maxwell (sm_52) are not deprecated until 13.0 - no supported GPU is dropped.

Test image: docker pull m1dnyt3/llama-cpp:server-cuda-12.9

Full logs + dmon traces: https://github.com/M1DNYT3/self-hosted-llm-inference/tree/main/iterations/10-blackwell-root-cause/logs

Addresses #17822, #18865.

Additional information

If you need additional data, I can run benchmarks on demand for any specific GPU. The harness is built around Vast.ai, so any card available on the platform can be tested - inference cost per benchmark run is negligible.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Claude assisted with case study documentation and log analysis. Benchmarks, Docker build, source code investigation, and the one-line fix are my own work.

ggml-gh-bot · 2026-03-23T21:40:38Z

Hi @M1DNYT3, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

CISC · 2026-03-23T21:47:51Z

Just use the cuda-13 image and/or cuda-new.Dockerfile for Blackwell.

M1DNYT3 · 2026-03-23T21:55:45Z

Just use the cuda-13 image and/or cuda-new.Dockerfile for Blackwell.

CUDA 13 drops support for Pascal and Maxwell. While both are quite old, 12.9.1 is the last 12.x release, which enables Blackwell without dropping anything. No need to split the stack into two images either; mixed fleets with old and new hardware stay on a single image until fully transitioned to 13 and dropped 12.x.

M1DNYT3 · 2026-04-01T13:14:00Z

@CISC

Noticed a merge conflict popped up, and I see 21207 was just merged, bumping the CUDA floor to 12.8 - seems like the concern about forward compatibility is acknowledged after all. Would appreciate a fresh look at my proposal given that context, especially since the review bar appears more flexible than I was led to believe.

CISC · 2026-04-01T18:10:49Z

I am considering bumping both 12.x and 13.x to latest, and removing cuda-new.Dockerfile in the process as it is no longer needed, not today though. :)

M1DNYT3 · 2026-04-01T18:23:18Z

I am considering bumping both 12.x and 13.x to latest, and removing cuda-new.Dockerfile in the process as it is no longer needed, not today though. :)

@CISC
Appreciated. That said, noting that #21207 was merged without a similar "not today" deferral, my PR predates it and addresses the same compatibility concern. Given the bump to 12.8 already landed, merging this as an incremental fix in the interim seems low-risk and consistent with that precedent. Happy to rebase if needed.

CISC · 2026-04-01T18:28:05Z

I am considering bumping both 12.x and 13.x to latest, and removing cuda-new.Dockerfile in the process as it is no longer needed, not today though. :)

@CISC Appreciated. That said, noting that #21207 was merged without a similar "not today" deferral, my PR predates it and addresses the same compatibility concern. Given the bump to 12.8 already landed, merging this as an incremental fix in the interim seems low-risk and consistent with that precedent. Happy to rebase if needed.

It did a whole lot more, most specifically updating the base image and tools so that arm64 builds could be done, the bump was a pre-requisite for that, not the other way around.

M1DNYT3 · 2026-04-01T18:49:05Z

It did a whole lot more, most specifically updating the base image and tools so that arm64 builds could be done, the bump was a pre-requisite for that, not the other way around.

Makes sense on the ARM64 context. Given that 12.8 landed as a prerequisite for that work, bumping to 12.9.1 seems like an even smaller ask now that the baseline has already moved - it's the latest stable 12.x, picking up all fixes since 12.8, while avoiding 13.x, which still has open issues like #21255. If 12.9.1 becomes a prerequisite for something else tomorrow, the PR is already there. I have no objection to #21207 or its purpose - my question is simply why #20920 shouldn't be merged on its own merits.

M1DNYT3 · 2026-04-02T08:36:13Z

It did a whole lot more, most specifically updating the base image and tools so that arm64 builds could be done, the bump was a pre-requisite for that, not the other way around.

Makes sense on the ARM64 context. Given that 12.8 landed as a prerequisite for that work, bumping to 12.9.1 seems like an even smaller ask now that the baseline has already moved - it's the latest stable 12.x, picking up all fixes since 12.8, while avoiding 13.x, which still has open issues like #21255. If 12.9.1 becomes a prerequisite for something else tomorrow, the PR is already there. I have no objection to #21207 or its purpose - my question is simply why #20920 shouldn't be merged on its own merits.

@CISC Following up on the above - is my PR worth a second look on its own merits?

CISC · 2026-04-02T09:18:08Z

@CISC Following up on the above - is my PR worth a second look on its own merits?

With some modifications, sure. :)

First of all, leave cuda.Dockerfile as-is and remove cuda-new.Dockerfile, CUDA_VERSION is an argument, so simply update docker.yml with the appropriate cuda_version in the build matrix.

…ckerfile Co-authored-by: CISC <CISC@users.noreply.github.com>

M1DNYT3 · 2026-04-02T10:06:25Z

@CISC Following up on the above - is my PR worth a second look on its own merits?

With some modifications, sure. :)

First of all, leave cuda.Dockerfile as-is and remove cuda-new.Dockerfile, CUDA_VERSION is an argument, so simply update docker.yml with the appropriate cuda_version in the build matrix.

@CISC Done + marked you as a Co-Author since you suggested the design. Could you please review the changes?

cuda13 builds for both arches are pointing at 13.1.1, as it was in cuda-new.Dockerfile.
cuda cuda12 builds for both arches are pointing at 12.9.1, as I proposed for the original cuda.Dockerfile.
cuda-new.Dockerfile was removed and cuda13 builds are pointing at cuda.Dockerfile now.

CISC

Perfect, thank you, can you run it on your fork to verify it builds (disable s390x to be able to complete)?

M1DNYT3 · 2026-04-02T13:06:00Z

Perfect, thank you, can you run it on your fork to verify it builds (disable s390x to be able to complete)?

Runs for now, will notify additionally once completed. Made a separate branch specifically for build test with s390x line deleted from the yml, so it won't impact the PR.

https://github.com/M1DNYT3/llama.cpp/actions/runs/23900519302

M1DNYT3 · 2026-04-02T14:20:18Z

Perfect, thank you, can you run it on your fork to verify it builds (disable s390x to be able to complete)?

Runs for now, will notify additionally once completed. Made a separate branch specifically for build test with s390x line deleted from the yml, so it won't impact the PR.

https://github.com/M1DNYT3/llama.cpp/actions/runs/23900519302

@CISC, Done the builds, all green, 12/12. You can check at the URL I provided above.
However, the PR still needs one more approval. Any guidance on how to proceed here?

M1DNYT3 · 2026-04-03T13:22:48Z

@taronaeo @CISC Thank you for reviewing and helping with improvements! :)

Notable upstream changes: - ggml-org#21038: rotate activations for better KV cache quantization - ggml-org#21074: generic NVFP4 MMQ kernel - ggml-org#21271: fix FA kernel selection logic - ggml-org#21238: fix mmvq mmid kernel selection - ggml-org#20998: FA support for head dim 512 - ggml-org#20920: docker cuda12 bump to 12.9.1 Conflicts resolved: - fattn.cu: took upstream (adds head_dim 512 exclusion) - mmq.cu/mmq.cuh: kept both TQ3 types + upstream additions - llama-graph.cpp: kept both turbo WHT + upstream v_rot - docker.yml: took upstream - tests/CMakeLists.txt: took upstream

M1DNYT3 requested a review from ngxson as a code owner March 23, 2026 21:36

github-actions bot added the devops improvements to build systems and github actions label Mar 23, 2026

M1DNYT3 mentioned this pull request Mar 23, 2026

Eval bug: Qwen 3 Next CUDA poor performance #17822

Open

M1DNYT3 force-pushed the cuda-bump branch from 68fd615 to 98c3174 Compare April 2, 2026 09:13

docker: parameterize CUDA version in build matrix, remove cuda-new.Do…

5ef55ac

…ckerfile Co-authored-by: CISC <CISC@users.noreply.github.com>

M1DNYT3 force-pushed the cuda-bump branch from 98c3174 to 5ef55ac Compare April 2, 2026 10:03

M1DNYT3 requested a review from a team as a code owner April 2, 2026 10:03

M1DNYT3 changed the title ~~Bump CUDA base image 12.4.0 -> 12.9.1 (enables sm_120a-real for Blackwell)~~ docker: parameterize CUDA version in build matrix, remove cuda-new.Dockerfile Apr 2, 2026

CISC approved these changes Apr 2, 2026

View reviewed changes

CISC requested a review from a team April 2, 2026 14:45

taronaeo approved these changes Apr 3, 2026

View reviewed changes

taronaeo added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Apr 3, 2026

CISC merged commit 277ff5f into ggml-org:master Apr 3, 2026
2 checks passed

taronaeo mentioned this pull request Apr 4, 2026

Misc. bug: Docker version of llama.cpp doesn't support CUDA 12.8 anymore #21429

Closed

M1DNYT3 deleted the cuda-bump branch April 6, 2026 16:31

Conversation

M1DNYT3 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot bot commented Mar 23, 2026

Uh oh!

CISC commented Mar 23, 2026

Uh oh!

M1DNYT3 commented Mar 23, 2026

Uh oh!

M1DNYT3 commented Apr 1, 2026

Uh oh!

CISC commented Apr 1, 2026

Uh oh!

M1DNYT3 commented Apr 1, 2026

Uh oh!

CISC commented Apr 1, 2026

Uh oh!

M1DNYT3 commented Apr 1, 2026

Uh oh!

M1DNYT3 commented Apr 2, 2026

Uh oh!

CISC commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

M1DNYT3 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

M1DNYT3 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

M1DNYT3 commented Apr 2, 2026

Uh oh!

Uh oh!

M1DNYT3 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

M1DNYT3 commented Mar 23, 2026 •

edited

Loading

CISC commented Apr 2, 2026 •

edited

Loading

M1DNYT3 commented Apr 2, 2026 •

edited

Loading

M1DNYT3 commented Apr 2, 2026 •

edited

Loading