Skip to content

Latest commit

 

History

History
54 lines (39 loc) · 1.41 KB

File metadata and controls

54 lines (39 loc) · 1.41 KB

XSched Integration for llama.cpp

This demonstrates how XSched can be integrated into llama.cpp to enable priority-based scheduling between multiple inference requests.

Basic Idea

We modify the backend of llama.cpp ggml-backend to create an XQueue for a CUDA stream and add an API to set its priority. We also modify the server example to a one-model multiple-instance manner and set the priority of the XQueue of each instance. Then, we use local scheduler and highest priority first policy to schedule these XQueues.

Usage

Apply Integration Patch

# commit id: 73e53dc834c0a2336cd104473af6897197b96277
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout 73e53dc834c0a2336cd104473af6897197b96277
git apply <xsched_dir>/integration/llama.cpp/llamacpp-xsched-73e53dc.patch

or, use our forked repo

git clone https://github.com/XpuOS/llama.cpp.git -b xsched

Build XSched

cd <xsched_dir>
make cuda

Build llama.cpp Server

cd <llama.cpp_dir>
cmake -B build -DGGML_CUDA=on -DCMAKE_PREFIX_PATH=<xsched_dir>/output/lib
cmake --build build -- -j$(nproc)

Run Server

cd <llama.cpp_dir>
export LD_LIBRARY_PATH=<xsched_dir>/output/lib:$LD_LIBRARY_PATH
export XSCHED_POLICY=HPF
./build/bin/llama-server -m <model_path> -ngl 9999 -c 4096 -np 2

Example

See llama.cpp example for concrete details.