Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
c7499c5
examples : do not use common library in simple example (#9803)
slaren Oct 10, 2024
cf8e0a3
musa: add docker image support (#9685)
yeahdongcn Oct 10, 2024
0e9f760
rpc : add backend registry / device interfaces (#9812)
slaren Oct 10, 2024
7eee341
common : use common_ prefix for common library functions (#9805)
slaren Oct 10, 2024
9677640
ggml : move more prints to the ggml log system (#9839)
slaren Oct 11, 2024
943d20b
musa : update doc (#9856)
yeahdongcn Oct 12, 2024
11ac980
llama : improve infill support and special token detection (#9798)
ggerganov Oct 12, 2024
95c76e8
server : remove legacy system_prompt feature (#9857)
ggerganov Oct 12, 2024
1bde94d
server : remove self-extend features (#9860)
ggerganov Oct 12, 2024
edc2656
server : add option to time limit the generation phase (#9865)
ggerganov Oct 12, 2024
92be9f1
flake.lock: Update (#9870)
ggerganov Oct 13, 2024
c7181bd
server : reuse cached context chunks (#9866)
ggerganov Oct 13, 2024
d4c19c0
server : accept extra_context for the infill endpoint (#9874)
ggerganov Oct 13, 2024
13dca2a
Vectorize load instructions in dmmv f16 CUDA kernel (#9816)
agray3 Oct 14, 2024
a89f75e
server : handle "logprobs" field with false value (#9871)
VoidIsVoid Oct 14, 2024
4c42f93
readme : update bindings list (#9889)
srgtuszy Oct 15, 2024
dcdd535
server : update preact (#9895)
ggerganov Oct 15, 2024
fbc98b7
sampling : add XTC sampler (#9742)
MaggotHATE Oct 15, 2024
223c25a
server : improve infill context reuse (#9894)
ggerganov Oct 15, 2024
755a9b2
llama : add infill sampler (#9896)
ggerganov Oct 15, 2024
becfd38
[CANN] Fix cann compilation error (#9891)
leo-pony Oct 16, 2024
cd60b88
ggml-alloc : remove buffer_id from leaf_alloc (ggml/987)
danbev Oct 9, 2024
0e41b30
sync : ggml
ggerganov Oct 16, 2024
1f66b69
server : fix the disappearance of the end of the text (#9867)
z80maniac Oct 16, 2024
10433e8
llama : add tensor name for "result_norm" (#9907)
MollySophia Oct 16, 2024
66c2c93
grammar : fix JSON Schema for string regex with top-level alt. (#9903)
jemc Oct 16, 2024
dbf18e4
llava : fix typo in error message [no ci] (#9884)
danbev Oct 16, 2024
9e04102
llama : suppress conversion from 'size_t' to 'int' (#9046)
danbev Oct 16, 2024
73afe68
fix: use `vm_allocate` to allocate CPU backend buffer on macOS (#9875)
giladgd Oct 16, 2024
2194200
fix: allocating CPU buffer with size `0` (#9917)
giladgd Oct 16, 2024
f010b77
vulkan : add backend registry / device interfaces (#9721)
slaren Oct 17, 2024
3752217
readme : update bindings list (#9918)
ShenghaiWang Oct 17, 2024
99bd4ac
llama : infill sampling handle very long tokens (#9924)
ggerganov Oct 17, 2024
9f45fc1
llama : change warning to debug log
ggerganov Oct 17, 2024
17bb928
readme : remove --memory-f32 references (#9925)
ggerganov Oct 17, 2024
6f55bcc
llama : rename batch_all to batch (#8881)
danbev Oct 17, 2024
8901755
server : add n_indent parameter for line indentation requirement (#9929)
ggerganov Oct 18, 2024
60ce97c
add amx kernel for gemm (#8998)
mingfeima Oct 18, 2024
87421a2
[SYCL] Add SYCL Backend registry, device and Event Interfaces (#9705)
OuadiElfarouki Oct 18, 2024
afd9909
rpc : backend refactoring (#9912)
rgerganov Oct 18, 2024
cda0e4b
llama : remove all_pos_0, all_pos_1, all_seq_id from llama_batch (#9745)
ngxson Oct 18, 2024
7cab208
readme : update infra list (#9942)
icppWorld Oct 20, 2024
45f0976
readme : update bindings list (#9951)
lcarrere Oct 20, 2024
1db8c84
fix mul_mat_vec_q and *_vec_q error (#9939)
NeoZhangJianyu Oct 21, 2024
bc21975
speculative : fix handling of some input params (#9963)
ggerganov Oct 21, 2024
55e4778
llama : default sampling changes + greedy update (#9897)
ggerganov Oct 21, 2024
d5ebd79
rpc : pack only RPC structs (#9959)
rgerganov Oct 21, 2024
f594bc8
ggml : add asserts for type conversion in fattn kernels (#9971)
ggerganov Oct 21, 2024
dbd5f2f
llama.vim : plugin for Neovim (#9787)
ggerganov Oct 21, 2024
94008cc
arg : fix attention non-causal arg value hint (#9985)
danbev Oct 21, 2024
994cfb1
readme : update UI list (#9972)
a-ghorbani Oct 21, 2024
e01c67a
llama.vim : move info to the right of screen [no ci] (#9787)
ggerganov Oct 21, 2024
e94a138
llama.vim : fix info text display [no ci] (#9787)
ggerganov Oct 21, 2024
674804a
arg : fix typo in embeddings argument help [no ci] (#9994)
danbev Oct 22, 2024
6b84473
[CANN] Adapt to dynamically loadable backends mechanism (#9970)
leo-pony Oct 22, 2024
4ff7fe1
llama : add chat template for RWKV-World + fix EOT (#9968)
MollySophia Oct 22, 2024
c421ac0
lora : warn user if new token is added in the adapter (#9948)
ngxson Oct 22, 2024
11d4705
Rwkv chat template fix (#10001)
MollySophia Oct 22, 2024
19d900a
llama : rename batch to ubatch (#9950)
danbev Oct 22, 2024
c8c07d6
llama : fix empty batch causing llama_batch_allocr to crash (#9966)
ngxson Oct 22, 2024
873279b
flake.lock: Update
github-actions[bot] Oct 20, 2024
4c9388f
metal : add POOL2D and fix IM2COL (#9943)
junhee-yoo Oct 23, 2024
ac113a0
llama.vim : add classic vim support (#9995)
m18coppola Oct 23, 2024
c19af0a
ggml : remove redundant set of contexts used field (ggml/978)
danbev Oct 16, 2024
80273a3
CUDA: fix 1D im2col, add tests (ggml/993)
JohannesGaessler Oct 18, 2024
2d3aba9
llama.vim : bump generation time limit to 3s [no ci]
ggerganov Oct 23, 2024
190a37d
sync : ggml
ggerganov Oct 23, 2024
0a1c750
server : samplers accept the prompt correctly (#10019)
wwoodsTM Oct 23, 2024
c39665f
CUDA: fix MMQ for non-contiguous src0, add tests (#10021)
JohannesGaessler Oct 24, 2024
167a515
CUDA: fix insufficient buffer clearing for MMQ (#10032)
JohannesGaessler Oct 24, 2024
40f2555
ci : fix cmake flags for SYCL
ggerganov Oct 24, 2024
958367b
server : refactor slot input data, move tokenizer to HTTP thread (#10…
ngxson Oct 24, 2024
bc5ba00
server : check that the prompt fits in the slot's context (#10030)
ggerganov Oct 25, 2024
2f8bd2b
llamafile : extend sgemm.cpp support for Q5_0 models (#10010)
Srihari-mcw Oct 25, 2024
d80fb71
llama: string_split fix (#10022)
Xarbirus Oct 25, 2024
ff252ea
llama : add DRY sampler (#9702)
wwoodsTM Oct 25, 2024
6687503
metal : support permuted matrix multiplicaions (#10033)
ggerganov Oct 25, 2024
9e4a256
scripts : fix amx sync [no ci]
ggerganov Oct 26, 2024
8c60a8a
increase cuda_cpy block size (ggml/996)
bssrdf Oct 23, 2024
cc2983d
sync : ggml
ggerganov Oct 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .devops/full-musa.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc3.1.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

RUN apt-get update && \
apt-get install -y build-essential cmake python3 python3-pip git libcurl4-openssl-dev libgomp1

COPY requirements.txt requirements.txt
COPY requirements requirements

RUN pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt

WORKDIR /app

COPY . .

RUN cmake -B build -DGGML_MUSA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release -j$(nproc) && \
cp build/bin/* .

ENTRYPOINT ["/app/.devops/tools.sh"]
30 changes: 30 additions & 0 deletions .devops/llama-cli-musa.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc3.1.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the MUSA runtime image
ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

RUN apt-get update && \
apt-get install -y build-essential git cmake

WORKDIR /app

COPY . .

RUN cmake -B build -DGGML_MUSA=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release --target llama-cli -j$(nproc)

FROM ${BASE_MUSA_RUN_CONTAINER} AS runtime

RUN apt-get update && \
apt-get install -y libgomp1

COPY --from=build /app/build/ggml/src/libggml.so /libggml.so
COPY --from=build /app/build/src/libllama.so /libllama.so
COPY --from=build /app/build/bin/llama-cli /llama-cli

ENTRYPOINT [ "/llama-cli" ]
35 changes: 35 additions & 0 deletions .devops/llama-server-musa.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG MUSA_VERSION=rc3.1.0
# Target the MUSA build image
ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
# Target the MUSA runtime image
ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

RUN apt-get update && \
apt-get install -y build-essential git cmake libcurl4-openssl-dev

WORKDIR /app

COPY . .

RUN cmake -B build -DGGML_MUSA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release --target llama-server -j$(nproc)

FROM ${BASE_MUSA_RUN_CONTAINER} AS runtime

RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev libgomp1 curl

COPY --from=build /app/build/ggml/src/libggml.so /libggml.so
COPY --from=build /app/build/src/libllama.so /libllama.so
COPY --from=build /app/build/bin/llama-server /llama-server

# Must be set to 0.0.0.0 so it can listen to requests from host machine
ENV LLAMA_ARG_HOST=0.0.0.0

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]

ENTRYPOINT [ "/llama-server" ]
3 changes: 3 additions & 0 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ jobs:
- { tag: "light-cuda", dockerfile: ".devops/llama-cli-cuda.Dockerfile", platforms: "linux/amd64" }
- { tag: "server-cuda", dockerfile: ".devops/llama-server-cuda.Dockerfile", platforms: "linux/amd64" }
- { tag: "full-cuda", dockerfile: ".devops/full-cuda.Dockerfile", platforms: "linux/amd64" }
- { tag: "light-musa", dockerfile: ".devops/llama-cli-musa.Dockerfile", platforms: "linux/amd64" }
- { tag: "server-musa", dockerfile: ".devops/llama-server-musa.Dockerfile", platforms: "linux/amd64" }
- { tag: "full-musa", dockerfile: ".devops/full-musa.Dockerfile", platforms: "linux/amd64" }
# Note: the rocm images are failing due to a compiler error and are disabled until this is fixed to allow the workflow to complete
#- { tag: "light-rocm", dockerfile: ".devops/llama-cli-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
#- { tag: "server-rocm", dockerfile: ".devops/llama-server-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
Expand Down
4 changes: 4 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,10 @@ if (NOT DEFINED GGML_LLAMAFILE)
set(GGML_LLAMAFILE_DEFAULT ON)
endif()

if (NOT DEFINED GGML_AMX)
set(GGML_AMX ON)
endif()

if (NOT DEFINED GGML_CUDA_GRAPHS)
set(GGML_CUDA_GRAPHS_DEFAULT ON)
endif()
Expand Down
24 changes: 19 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,6 @@ GGML_METAL := 1
DEPRECATE_WARNING := 1
endif

ifdef LLAMA_OPENMP
GGML_OPENMP := 1
DEPRECATE_WARNING := 1
endif

ifdef LLAMA_RPC
GGML_RPC := 1
DEPRECATE_WARNING := 1
Expand Down Expand Up @@ -584,6 +579,11 @@ ifndef GGML_NO_LLAMAFILE
OBJ_GGML += ggml/src/llamafile/sgemm.o
endif

ifndef GGML_NO_AMX
MK_CPPFLAGS += -DGGML_USE_AMX
OBJ_GGML += ggml/src/ggml-amx.o ggml/src/ggml-amx/mmq.o
endif

ifdef GGML_RPC
MK_CPPFLAGS += -DGGML_USE_RPC
OBJ_GGML += ggml/src/ggml-rpc.o
Expand Down Expand Up @@ -1087,6 +1087,19 @@ ggml/src/llamafile/sgemm.o: \
$(CXX) $(CXXFLAGS) -c $< -o $@
endif # GGML_NO_LLAMAFILE

ifndef GGML_NO_AMX
ggml/src/ggml-amx.o: \
ggml/src/ggml-amx.cpp \
ggml/include/ggml-amx.h
$(CXX) $(CXXFLAGS) -c $< -o $@

ggml/src/ggml-amx/mmq.o: \
ggml/src/ggml-amx/mmq.cpp \
ggml/src/ggml-amx/mmq.h \
ggml/include/ggml.h
$(CXX) $(CXXFLAGS) -c $< -o $@
endif

ifdef GGML_RPC
ggml/src/ggml-rpc.o: \
ggml/src/ggml-rpc.cpp \
Expand Down Expand Up @@ -1238,6 +1251,7 @@ clean:
rm -vrf ggml/src/ggml-metal-embed.metal
rm -vrf ggml/src/ggml-cuda/*.o
rm -vrf ggml/src/ggml-cuda/template-instances/*.o
rm -vrf ggml/src/ggml-amx/*.o
rm -rvf $(BUILD_TARGETS)
rm -rvf $(TEST_TARGETS)
rm -f vulkan-shaders-gen ggml/src/ggml-vulkan-shaders.hpp ggml/src/ggml-vulkan-shaders.cpp
Expand Down
12 changes: 9 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ variety of hardware - locally and in the cloud.

- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Expand Down Expand Up @@ -93,6 +93,7 @@ Typically finetunes of the base models below are supported as well.
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
- [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
- [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
- [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)

(instructions for supporting more models: [HOWTO-add-model.md](./docs/development/HOWTO-add-model.md))

Expand Down Expand Up @@ -122,6 +123,7 @@ Typically finetunes of the base models below are supported as well.
- Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
- Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
- C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
- C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
- Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
- Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
- React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
Expand All @@ -130,6 +132,8 @@ Typically finetunes of the base models below are supported as well.
- Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
- PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggerganov/llama.cpp/pull/6326)
- Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
- Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
- Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)

**UI:**

Expand Down Expand Up @@ -170,6 +174,7 @@ Unless otherwise noted these projects are open-source with permissive licensing:
- [LARS - The LLM & Advanced Referencing Solution](https://github.com/abgulati/LARS) (AGPL)
- [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
- [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
- [PocketPal AI - An iOS and Android App](https://github.com/a-ghorbani/pocketpal-ai) (MIT)

*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*

Expand All @@ -185,6 +190,7 @@ Unless otherwise noted these projects are open-source with permissive licensing:

- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly

**Games:**
- [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
Expand Down Expand Up @@ -413,7 +419,7 @@ Please refer to [Build llama.cpp locally](./docs/build.md)
| [BLAS](./docs/build.md#blas-build) | All |
| [BLIS](./docs/backend/BLIS.md) | All |
| [SYCL](./docs/backend/SYCL.md) | Intel and Nvidia GPU |
| [MUSA](./docs/build.md#musa) | Moore Threads GPU |
| [MUSA](./docs/build.md#musa) | Moore Threads MTT GPU |
| [CUDA](./docs/build.md#cuda) | Nvidia GPU |
| [hipBLAS](./docs/build.md#hipblas) | AMD GPU |
| [Vulkan](./docs/build.md#vulkan) | GPU |
Expand Down
2 changes: 1 addition & 1 deletion ci/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ if [ ! -z ${GG_BUILD_SYCL} ]; then
exit 1
fi

CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_SYCL=1 DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON"
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_SYCL=1 -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON"
fi

if [ ! -z ${GG_BUILD_VULKAN} ]; then
Expand Down
Loading
Loading