Zero inference is too slow #5418
Unanswered
garyyang85
asked this question in
Q&A
Replies: 1 comment 4 replies
-
|
@garyyang85, zero-inference is expected to be slower before of streaming weights over the slower PCIe link. Here are a couple of things to do.
|
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using deepspeed zero-inference for inference. The model is 13b float16, and running on a v100 32G GPU. For common inference, when the inputs token is more than 2000(should support max 4096), it will report "cuda out of memory". So I found the zero-reference solution in deepspeed here. But the inference speed is too slow. And the GPU memory usage is about 8G. Is there a way to speed up the inference and use more GPU? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions