Skip to content

Conversation

saood06
Copy link
Collaborator

@saood06 saood06 commented Feb 8, 2025

I grabbed all of the changes needed for llama.cpp/pull/11047 , which was ggml-org/llama.cpp#9912 and ggml-org/llama.cpp#9040

This compiles, but has not been tested yet.

@ikawrakow
Copy link
Owner

I never use RPC, have never looked into the RPC code, so I'll have to rely on you for self-review and testing.

@saood06
Copy link
Collaborator Author

saood06 commented Feb 10, 2025

@jukofyork

I strongly suspect something funky is going on

There is, see this comment: #180 (comment)

This fork has much faster PP speeds, has Deepseek MLA support with a flag (-mla), this PR should allow RPC to work, and I'm working on porting the add option to override model tensor buffers.

This is something I've done for a while on my Windows builds due to the fact that on Windows long is not 8 bytes. On linux this changes nothing as both are 8 bytes there.
@saood06
Copy link
Collaborator Author

saood06 commented Feb 27, 2025

This has been tested, and does not currently work. I'm not sure why as the errors I'm getting seem to have never been encountered by people on llama.cpp.


rpc_msg_get_alloc_size_rsp response;
bool status = send_rpc_cmd(sock, RPC_CMD_GET_ALLOC_SIZE, &request, sizeof(request), &response, sizeof(response));
GGML_ASSERT(status);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RPC client crashes here, which happens as the RPC server hits an issue.

ggml_tensor * tensor = deserialize_tensor(ctx, &request.tensor);

if (tensor == nullptr) {
GGML_PRINT_DEBUG("Null tensor pointer passed to server get_alloc_size function.\n");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain this is where the RPC server is crashing, although it doesn't print the message as I never ran with GGML_DEBUG on.

@ubergarm
Copy link
Contributor

@saood06

I just came across another llama.cpp fork called prima.cpp which claims to have improved support for multi-device distributed inferencing.

I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box.

@saood06
Copy link
Collaborator Author

saood06 commented Apr 12, 2025

@saood06

I just came across another llama.cpp fork called prima.cpp which claims to have improved support for multi-device distributed inferencing.

I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box.

Thanks for the link, it is interesting. I think it would work for dense models but not as well for MoE because as far as I can tell it doesn't handle -ot (this commit looks relevant) . I'd also need windows support which is on the roadmap (but I might see what the issue is by trying to build it on my machine, and see if I can fix it), and the GPU machine has to run windows (my big RAM box runs clear linux, and I have other servers that run FreeBSD and Proxmox).

@saood06 saood06 mentioned this pull request Jun 1, 2025
4 tasks
@saood06
Copy link
Collaborator Author

saood06 commented Jun 15, 2025

Closed as superseded by #480 / #506

@saood06 saood06 closed this Jun 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants