Releases: z80maniac/llama.cpp
Releases · z80maniac/llama.cpp
b1663
CUDA: Faster Mixtral prompt processing (#4538) * CUDA: make MoE tensors contiguous for batch size>1 * Update ggml-cuda.cu Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>
b1644
llama : sanity checks for access to logits (#4274) Co-authored-by: Georgi Gerganov <[email protected]>
b1643
server : add optional API Key Authentication example (#4441) * Add API key authentication for enhanced server-client security * server : to snake_case --------- Co-authored-by: Georgi Gerganov <[email protected]>
b1620
sync : ggml (new ops, tests, backend, etc.) (#4359) * sync : ggml (part 1) * sync : ggml (part 2, CUDA) * sync : ggml (part 3, Metal) * ggml : build fixes ggml-ci * cuda : restore lost changes * cuda : restore lost changes (StableLM rope) * cmake : enable separable compilation for CUDA ggml-ci * ggml-cuda : remove device side dequantize * Revert "cmake : enable separable compilation for CUDA" This reverts commit 09e35d04b1c4ca67f9685690160b35bc885a89ac. * cuda : remove assert for rope * tests : add test-backend-ops * ggml : fix bug in ggml_concat * ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()` * ci : try to fix macOS * ggml-backend : remove backend self-registration * ci : disable Metal for macOS cmake build ggml-ci * metal : fix "supports family" call * metal : fix assert * metal : print resource path ggml-ci --------- Co-authored-by: slaren <[email protected]>