Gunicorn Workers Not Using GPU in Parallel #2985
vibhas-singh
started this conversation in
General
Replies: 3 comments
-
I moved it to a discussion sin eit's more likely an OS issue than directly related to gunicorn. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I'm having a similar issue. @vibhas-singh did you resolve this issue? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Any updates on this @vibhas-singh or @Irtiza17 ? I am also facing similar issue. Any help will be greatly appriciated. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to deploy a Pytorch image classification model wrapped in
Flask
ong4dn.xlarge
(4 vCPU, 16GB RAM, T4 GPU with 16GB Memory) instances on AWS.For selecting the optimal number of workers I performed some experiments:
Experiment 1:
Concurrent Requests: 1
Total Time To Process 15 Requests By A Client: 15.87s (
model.forward
takes 14.98s)Experiment 2:
Concurrent Requests: 2 (2 clients sending requests in parallel)
Total Time To Process 15 Requests By A Client: 29.35s (
model.forward
takes 28.34s, 2x of a single request, every other step taking a similar time)Experiment 3:
Concurrent Requests: 3 (3 clients sending requests in parallel)
Total Time To Process 15 Requests By A Client: 43.82s (
model.forward
takes 41.81s, 3x of a single request, every other step taking a similar time).Using 3x workers is enabling me to process 3 requests in parallel but the overall processing time of all those requests is also becoming 3x - hence no improvement in real terms.
I initially thought CPU or IO processes are the bottlenecks in the app - but upon intensively logging the time taken at each step, I found the bottleneck is coming from the GPU processing (
model.forward
starts taking 2x-3x times).Upon checking the process ids of the workers for each request - I can also confirm that all the workers are getting the requests in parallel - but those are not able to perform the GPU processings in parallel at the same time.
Any guidance on what can be the bottleneck here will be very helpful.
Also - is there a recommended worker type to be used for such kinds of processing which are GPU-dependent?
Beta Was this translation helpful? Give feedback.
All reactions