-
Notifications
You must be signed in to change notification settings - Fork 162
Disable pipeline parallel for tensor override or allocation failed #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
When I was testing with two gpu setup, enabling pipeline parallel leads to an extra 5400MiB vram usage in CUDA0_host, which is not present in mainline. Disabling pipeline parallel will get rid of the 5400MiB vram usage. This size is about 4 times of context size. Also some error in debug when doing inferencing, which is also not present in mainline. |
|
The above issue happens when I use DeepSeek-V2-Lite-Chat.IQ4_XS.gguf. Since most people use tensor override for deepseek, it won't surface after this PR is merged. Qwen2.5-7B does not have vram issue, but I get |
I just tested with
The absence of these messages is due to mainline changes in the allocator (
|
|
All this PR does is enable or disable pipeline parallel. I tested with it disabled, and My cmake command is Server command: I will do some digging for the ram usage issue. |
|
For testing purpose, I synced the ggml-alloc.c from mainline, but still see 5.4G in ram. Then I tried different mla, With |
|
@firecoperana Thank you for investigating this! I think, if syncing with mainline's Concerning the gibberish with Qwen2.5: which model are you using? I tested with Qwen2.5-7B vision model that I had lying around, assuming the text portion of it is the same as pure text Qwen2.5. |
d851b05 to
2ac4b42
Compare
|
I was using Qwen2.5-7B.i1-Q4_K_S.gguf, but I tested Qwen2.5-VL-7B-Instruct-Q8_0.gguf and it also produce gibberish. I also see this with GLM 4.5 air using CUDA. Maybe I need to play around with my build command. |
GLM-4.5-Air, being so popular, is one of the models I test with. It works fine for me. Both, CPU-only and hybrid GPU/CPU. I cannot do CUDA-only with my setup. Perhaps pull the latest main branch and make a fresh build? If the failure persist, please post full logs including command line. |
|
After I did a fresh build, it works now. Thanks! |
This PR attempts to solve abnormal vram issue when people use -DGGML_SCHED_MAX_COPIES=4 with tensor override. This values is linked to pipeline parallel. When GGML_SCHED_MAX_COPIES>1, it enables pipeline parallel in the code. Disable pipeline parallel will make vram return to normal.
It will also disable pipeline parallel if allocation of compute buffer failed and reallocate with pipeline parallel disabled.