-
Notifications
You must be signed in to change notification settings - Fork 13.4k
64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
64 bit CUDA copy routines via GGML_CUDA_ALLOW_LARGE_TENSORS #15298
Conversation
… check in ggml_cuda_cpy
…by Qwen3-Coder-480B-A35B-Instruct-1M-GGUF
… CUDA large tensor support This change by gpt-oss-120b-mxfp4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the copy operations register pressure is not an issue. It should be fine to just use 64 bit integers for everything without the need for an extra compile option.
This works for me. I'm not familiar with CUDA but from the comments it sounds like the #ifdef fencing isn't required? |
I just saw @JohannesGaessler's comment a couple days ago. I'm currently focused on trying to get another PR pushed through, but I'll circle back around to implement the suggested change shortly. |
@JohannesGaessler I removed the compile option. I also ran this with LongBench for a few hours ( 15/502 tests ) just to ensure it was working: https://github.com/createthis/LongBench/pull/1/files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you misunderstood what I meant: you should be using 64 bit values for the kernel arguments and add a loop that allows the kernel to iterate over an essentially arbitrarily large amount of data. Launching multiple CUDA kernels in chunks is a fundamentally bad solution.
(cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, cdst_indirect, graph_cpynode_index++); | ||
const char * cx, char * cdst, const int64_t ne, | ||
const int64_t ne00, const int64_t ne01, const int64_t ne02, const int64_t nb00, const int64_t nb01, const int64_t nb02, | ||
const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) { | |
const int64_t nb03, const int64_t ne10, const int64_t ne11, const int64_t ne12, const int64_t nb10, const int64_t nb11, | |
const int64_t nb12, const int64_t nb13, cudaStream_t stream, char ** cdst_indirect, int & graph_cpynode_index) { |
Disclaimer: I couldn't code my way out of a wet paper bag in C++. This is 100% vibe coded AI slop. Upstream issue is #15049
What?
This PR adds
GGML_CUDA_ALLOW_LARGE_TENSORS
. When enabled, it allows 64 bit sizes in the CUDA copy routines.Q. What is the difference in INT_MAX and
SIZE_MAX / 4
? How much larger of a tensor will this accomodate?A. The difference between INT_MAX and SIZE_MAX/4 is enormous:
INT_MAX: 2,147,483,647 bytes ≈ 2.00 GB
SIZE_MAX/4: 4,611,686,018,427,387,903 bytes ≈ 4,294,967,296 GB ≈ 4.3 PB
How?
Then:
./build/bin/llama-server \ --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \ --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \ --no-webui \ --numa numactl \ --threads 32 \ --ctx-size 400000 \ --n-gpu-layers 63 \ -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \ -ot exps=CPU \ -ub 4096 -b 4096 \ --cache-type-k q4_1 \ --cache-type-v q4_1 \ --seed 3407 \ --prio 3 \ --temp 0.7 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.05 \ --min-p 0.0 \ --log-colors on \ --flash-attn on \ --host 0.0.0.0 \ --jinja \ --port 11434
Why?
Cards with a lot of VRAM like the blackwell 6000 pro may enable us to use larger in-GPU context lengths than INT_MAX allows.
Results
This model starts out with 20-22 tok/s generation at 0 context, so that's pretty terrible performance. Still, when you absolutely, positively, MUST read a huge number of tokens, this may be a potential solution.