64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

createthis · 2025-08-13T18:24:53Z

Disclaimer: I couldn't code my way out of a wet paper bag in C++. This is 100% vibe coded AI slop. Upstream issue is #15049

What?

This PR adds GGML_CUDA_ALLOW_LARGE_TENSORS. When enabled, it allows 64 bit sizes in the CUDA copy routines.

Q. What is the difference in INT_MAX and SIZE_MAX / 4? How much larger of a tensor will this accomodate?

A. The difference between INT_MAX and SIZE_MAX/4 is enormous:

INT_MAX: 2,147,483,647 bytes ≈ 2.00 GB
SIZE_MAX/4: 4,611,686,018,427,387,903 bytes ≈ 4,294,967,296 GB ≈ 4.3 PB

How?

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_CUDA_ALLOW_LARGE_TENSORS=ON
cmake --build build --config Release

Then:

./build/bin/llama-server \
    --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \
    --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 400000 \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --cache-type-k q4_1 \
    --cache-type-v q4_1 \
    --seed 3407 \
    --prio 3 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05 \
    --min-p 0.0 \
    --log-colors on \
    --flash-attn on \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

Why?

Cards with a lot of VRAM like the blackwell 6000 pro may enable us to use larger in-GPU context lengths than INT_MAX allows.

Results

This model starts out with 20-22 tok/s generation at 0 context, so that's pretty terrible performance. Still, when you absolutely, positively, MUST read a huge number of tokens, this may be a potential solution.

… check in ggml_cuda_cpy

beware.

…by Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

… CUDA large tensor support This change by gpt-oss-120b-mxfp4.

JohannesGaessler

For the copy operations register pressure is not an issue. It should be fine to just use 64 bit integers for everything without the need for an extra compile option.

bitbottrap · 2025-08-31T18:33:53Z

This works for me. I'm not familiar with CUDA but from the comments it sounds like the #ifdef fencing isn't required?

createthis · 2025-08-31T18:40:22Z

This works for me. I'm not familiar with CUDA but from the comments it sounds like the #ifdef fencing isn't required?

I just saw @JohannesGaessler's comment a couple days ago. I'm currently focused on trying to get another PR pushed through, but I'll circle back around to implement the suggested change shortly.

default per JohannesGaessler's request.

createthis · 2025-09-10T01:02:47Z

@JohannesGaessler I removed the compile option. I also ran this with LongBench for a few hours ( 15/502 tests ) just to ensure it was working: https://github.com/createthis/LongBench/pull/1/files

LongBench tested it out to 400k context:

JohannesGaessler

I think you misunderstood what I meant: you should be using 64 bit values for the kernel arguments and add a loop that allows the kernel to iterate over an essentially arbitrarily large amount of data. Launching multiple CUDA kernels in chunks is a fundamentally bad solution.

JohannesGaessler · 2025-09-10T16:16:39Z

ggml/src/ggml-cuda/cpy.cu

-        (cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, cdst_indirect, graph_cpynode_index++);
+    const char * cx, char * cdst, const int64_t ne,
+    const int64_t ne00, const int64_t ne01, const int64_t ne02, const int64_t nb00, const int64_t nb01, const int64_t nb02,
+    const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) {


Suggested change

const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) {

const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11,

const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) {

createthis added 4 commits August 13, 2025 09:21

Add compile-time flag GGML_CUDA_ALLOW_LARGE_TENSORS to bypass INT_MAX…

73ef5b9

… check in ggml_cuda_cpy

R1-0528's attempt to implement this. I doubt this code works. User

d3ea7d2

beware.

New assertions for GGML_CUDA_ALLOW_LARGE_TENSORS upper bounds, coded …

39fbbb8

…by Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

Add compile option GGML_CUDA_ALLOW_LARGE_TENSORS and define macro for…

e40e6a6

… CUDA large tensor support This change by gpt-oss-120b-mxfp4.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2025

This was referenced Aug 13, 2025

Eval bug: Qwen3-Coder-480B-A35B-Instruct-1M-GGUF GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed #15049

Open

Fix 131k context ggml assert createthis/llama.cpp#3

Closed

JohannesGaessler reviewed Aug 19, 2025

View reviewed changes

createthis added 4 commits September 9, 2025 14:35

Merge branch 'master' into fix_131k_context_GGML_ASSERT

68b2df9

Remove trailing whitespace.

804b0da

Missed a few whitespace issues.

de92af7

Remove GGML_CUDA_ALLOW_LARGE_TENSORS compile option and just make it the

d1afcd8

default per JohannesGaessler's request.

JohannesGaessler reviewed Sep 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

createthis commented Aug 13, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

bitbottrap commented Aug 31, 2025

Uh oh!

createthis commented Aug 31, 2025

Uh oh!

createthis commented Sep 10, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

JohannesGaessler Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) {
	const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11,
	const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) {

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

Are you sure you want to change the base?

64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298

Conversation

createthis commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

How?

Why?

Results

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

bitbottrap commented Aug 31, 2025

Uh oh!

createthis commented Aug 31, 2025

Uh oh!

createthis commented Sep 10, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

createthis commented Aug 13, 2025 •

edited

Loading