Feature Request: rpc: Load tensors in parallel throughout rpc-servers

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Make all rpc-servers load tensors to device (GPU) memory in parallel whenavailable in their respective local cache.

### Motivation

As of b6140 tensors are loaded sequentially from disk to the GPUs memory. This is a simple and performant approach when all GPUs are connected to a single machine. The load performance is bound to the IO disk performance, parallelizing could even decrease performance.

The scenario changes if some of the GPUs are connected via rpc-servers.

If there's yet *no cache* in the rpc-server machine(s), meaning that all tensors must be transmitted through the network, the load is bound to the network bandwidth of the client machine (the one sending out the tensors). Parallelizing would again be useless or detrimental.

However, if tensors are *already cached* in the rpc-server machine(s) disk(s), each rpc-server could immediately start loading them into its GPUs. This would greatly decrease model load time in multi rpc-servers scenarios, as
they would be effectively loading tensors in parallel.

### Possible Implementation

For each rpc-server in use, start immediately loading tensors if they are present in the server's local cache.
As we can now run just one rpc-server per physical machine (thanks to @rgerganov on PR #16276  ), there's no longer the risk of parallel access to the hard drive (which can be slower in some hardware).

PR #13106 seems related to this feature request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: rpc: Load tensors in parallel throughout rpc-servers #16434

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: rpc: Load tensors in parallel throughout rpc-servers #16434

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions