- 
                Notifications
    
You must be signed in to change notification settings  - Fork 0
 
Description
Note: This issue was copied from ggml-org#16434
Original Author: @nguha
Original Issue Number: ggml-org#16434
Created: 2025-10-05T15:49:49Z
Prerequisites
- I am running the latest code. Mention the version if possible as well.
 - I carefully followed the README.md.
 - I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
 - I reviewed the Discussions, and have a new and useful enhancement to share.
 
Feature Description
Make all rpc-servers load tensors to device (GPU) memory in parallel whenavailable in their respective local cache.
Motivation
As of b6140 tensors are loaded sequentially from disk to the GPUs memory. This is a simple and performant approach when all GPUs are connected to a single machine. The load performance is bound to the IO disk performance, parallelizing could even decrease performance.
The scenario changes if some of the GPUs are connected via rpc-servers.
If there's yet no cache in the rpc-server machine(s), meaning that all tensors must be transmitted through the network, the load is bound to the network bandwidth of the client machine (the one sending out the tensors). Parallelizing would again be useless or detrimental.
However, if tensors are already cached in the rpc-server machine(s) disk(s), each rpc-server could immediately start loading them into its GPUs. This would greatly decrease model load time in multi rpc-servers scenarios, as
they would be effectively loading tensors in parallel.
Possible Implementation
For each rpc-server in use, start immediately loading tensors if they are present in the server's local cache.
As we can now run just one rpc-server per physical machine (thanks to @rgerganov on PR ggml-org#16276  ), there's no longer the risk of parallel access to the hard drive (which can be slower in some hardware).
PR ggml-org#13106 seems related to this feature request.