Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
llamacpp-xsched-73e53dc.patch	llamacpp-xsched-73e53dc.patch

Name

Last commit message

Last commit date

XSched Integration for llama.cpp

This demonstrates how XSched can be integrated into llama.cpp to enable priority-based scheduling between multiple inference requests.

Basic Idea

We modify the backend of llama.cpp ggml-backend to create an XQueue for a CUDA stream and add an API to set its priority. We also modify the server example to a one-model multiple-instance manner and set the priority of the XQueue of each instance. Then, we use local scheduler and highest priority first policy to schedule these XQueues.

Usage

Apply Integration Patch

# commit id: 73e53dc834c0a2336cd104473af6897197b96277
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout 73e53dc834c0a2336cd104473af6897197b96277
git apply <xsched_dir>/integration/llama.cpp/llamacpp-xsched-73e53dc.patch

or, use our forked repo

git clone https://github.com/XpuOS/llama.cpp.git -b xsched

Build XSched

cd <xsched_dir>
make cuda

Build llama.cpp Server

cd <llama.cpp_dir>
cmake -B build -DGGML_CUDA=on -DCMAKE_PREFIX_PATH=<xsched_dir>/output/lib
cmake --build build -- -j$(nproc)

Run Server

cd <llama.cpp_dir>
export LD_LIBRARY_PATH=<xsched_dir>/output/lib:$LD_LIBRARY_PATH
export XSCHED_POLICY=HPF
./build/bin/llama-server -m <model_path> -ngl 9999 -c 4096 -np 2

Example

See llama.cpp example for concrete details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

XSched Integration for llama.cpp

Basic Idea

Usage

Apply Integration Patch

Build XSched

Build llama.cpp Server

Run Server

Example

FilesExpand file tree

llama.cpp

Directory actions

More options

Directory actions

More options

Latest commit

History

llama.cpp

Folders and files

parent directory

README.md

XSched Integration for llama.cpp

Basic Idea

Usage

Apply Integration Patch

Build XSched

Build llama.cpp Server

Run Server

Example