-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Make all rpc-servers load tensors to device (GPU) memory in parallel whenavailable in their respective local cache.
Motivation
As of b6140 tensors are loaded sequentially from disk to the GPUs memory. This is a simple and performant approach when all GPUs are connected to a single machine. The load performance is bound to the IO disk performance, parallelizing could even decrease performance.
The scenario changes if some of the GPUs are connected via rpc-servers.
If there's yet no cache in the rpc-server machine(s), meaning that all tensors must be transmitted through the network, the load is bound to the network bandwidth of the client machine (the one sending out the tensors). Parallelizing would again be useless or detrimental.
However, if tensors are already cached in the rpc-server machine(s) disk(s), each rpc-server could immediately start loading them into its GPUs. This would greatly decrease model load time in multi rpc-servers scenarios, as
they would be effectively loading tensors in parallel.
Possible Implementation
For each rpc-server in use, start immediately loading tensors if they are present in the server's local cache.
As we can now run just one rpc-server per physical machine (thanks to @rgerganov on PR #16276 ), there's no longer the risk of parallel access to the hard drive (which can be slower in some hardware).
PR #13106 seems related to this feature request.