Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
8551c44
context : always use non-causal attention for encoder graphs (#12447)
ggerganov Mar 18, 2025
99aa304
llama : add support for EXAONE tied word embeddings (#12451)
ngxson Mar 18, 2025
c6af216
speculative : fix seg fault in certain cases (#12454)
ggerganov Mar 18, 2025
29fff30
llama : support converting Mistral Small text-only (#12450)
ngxson Mar 18, 2025
bb115d2
musa: override warp_size of musa device to 32 (#12445)
yeahdongcn Mar 18, 2025
75422e8
graph : normalize Q, K, V shapes + sync cross attention (#12449)
ggerganov Mar 18, 2025
d84635b
opencl: improve profiling (#12442)
lhez Mar 18, 2025
c446b2e
vulkan: Submit once enough matmul work has been recorded (#12406)
jeffbolznv Mar 19, 2025
a686171
convert : Support chat_template.json (#12460)
CISC Mar 19, 2025
108e53c
llama : add support for GPT2, Bloom and CodeShell tied word embedding…
CISC Mar 19, 2025
0fd8487
Fix visionOS build and add CI (#12415)
guusw Mar 19, 2025
a9b5928
vulkan: optimize iq1 coopmat2 dequant functions (#12427)
jeffbolznv Mar 19, 2025
517b5dd
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
gaugarg-nv Mar 19, 2025
568013d
context : clear sets containing encoder output sequence ids before st…
fairydreaming Mar 19, 2025
732b5fb
convert : avoid calls to tokenizer.added_tokens_decoder (#12473)
bartowski1182 Mar 20, 2025
3d82dbc
ggml : block interleaving support for Q4_K quantization for x86 AVX2 …
Srihari-mcw Mar 20, 2025
dbb3a47
llama : make Qwen2MoE QKV bias optional (#12477)
CISC Mar 20, 2025
e046430
webui : Prevent rerendering on textarea input (#12299)
woof-dog Mar 20, 2025
9ffcc9e
sycl: cleanup oneDNN related code (#12097)
sgeor255 Mar 21, 2025
1aa87ee
[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)
MakeDecisionWorth Mar 21, 2025
ea1518e
llama-tts : avoid crashes related to bad model file paths (#12482)
marcoStocchi Mar 21, 2025
960e726
chore : cleanup llama_model_loader::TENSOR_ usage (#12492)
CISC Mar 21, 2025
af04481
model : do not repack if a GPU device is present (#12498)
ggerganov Mar 21, 2025
30c42ef
vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (#12472)
netrunnereve Mar 21, 2025
4375415
Vulkan: RTE rounding for cpy to quant (#12480)
stduhpf Mar 21, 2025
eddfb43
vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
jeffbolznv Mar 22, 2025
fac63a3
musa: refine compute capability (#12493)
yeahdongcn Mar 22, 2025
ba932df
ggml : fix quantized cpy op (#12310)
ggerganov Mar 22, 2025
fbdfefe
llama : gemma3 : use output tensor if it exists in model weight (#12506)
ngxson Mar 22, 2025
18b663d
install : add macports (#12518)
IOOI-SqAR Mar 23, 2025
77f9c6b
server : Add verbose output to OAI compatible chat endpoint. (#12246)
mglambda Mar 23, 2025
9b169a4
vulkan: fix mul_mat_vec failure in backend tests (#12529)
jeffbolznv Mar 24, 2025
c54f6b7
mmap : skip resource limit checks on AIX (#12541)
mehendarkarprajwal Mar 24, 2025
7ea7503
CUDA: Fix clang warnings (#12540)
yeahdongcn Mar 24, 2025
00d5380
llama-vocab : add SuperBPE pre-tokenizer (#12532)
compilade Mar 24, 2025
3361e2d
docs: update: improve the Fedoa CUDA guide (#12536)
teihome Mar 24, 2025
48d7021
CI: fix SYCL build (#12546)
qnixsynapse Mar 24, 2025
2b65ae3
opencl: simplify kernel embedding logic in cmakefile (#12503)
lhez Mar 24, 2025
c95fa36
ci: [SYCL] ggml-ci Use main GPU and enable sysman (#12547)
qnixsynapse Mar 24, 2025
2d77d88
context : fix worst-case reserve outputs (#12545)
ggerganov Mar 25, 2025
3cd3a39
ci: [MUSA] add CI and update doc (#12562)
yeahdongcn Mar 25, 2025
36ee06d
docs : add build instructions for KleidiAI (#12563)
eddnjjn Mar 25, 2025
e2f5601
SYCL: disable Q4_0 reorder optimization (#12560)
qnixsynapse Mar 25, 2025
053b3f9
ggml-cpu : update KleidiAI to v1.5.0 (#12568)
eddnjjn Mar 25, 2025
ef19c71
run: de-duplicate fmt and format functions and optimize (#11596)
ericcurtin Mar 25, 2025
53af4db
convert: fix Mistral3/Gemma3 model hparams init (#12571)
CISC Mar 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -676,6 +676,35 @@ jobs:
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO

macOS-latest-cmake-visionos:
runs-on: macos-latest

steps:
- name: Clone
id: checkout
uses: actions/checkout@v4

- name: Dependencies
id: depends
continue-on-error: true
run: |
brew update

- name: Build
id: cmake_build
run: |
sysctl -a
cmake -B build -G Xcode \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-DCMAKE_SYSTEM_NAME=visionOS \
-DCMAKE_OSX_DEPLOYMENT_TARGET=1.0 \
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO

macOS-latest-swift:
runs-on: macos-latest

Expand Down
8 changes: 4 additions & 4 deletions build-xcframework.sh
Original file line number Diff line number Diff line change
Expand Up @@ -432,8 +432,8 @@ cmake -B build-visionos -G Xcode \
-DCMAKE_SYSTEM_NAME=visionOS \
-DCMAKE_OSX_SYSROOT=xros \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=xros \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 -Du_int=unsigned\ int -Du_char=unsigned\ char -Du_short=unsigned\ short ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 -Du_int=unsigned\ int -Du_char=unsigned\ char -Du_short=unsigned\ short ${COMMON_CXX_FLAGS}" \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_CXX_FLAGS}" \
-S .
cmake --build build-visionos --config Release -- -quiet

Expand All @@ -445,8 +445,8 @@ cmake -B build-visionos-sim -G Xcode \
-DCMAKE_SYSTEM_NAME=visionOS \
-DCMAKE_OSX_SYSROOT=xrsimulator \
-DCMAKE_XCODE_ATTRIBUTE_SUPPORTED_PLATFORMS=xrsimulator \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 -Du_int=unsigned\ int -Du_char=unsigned\ char -Du_short=unsigned\ short ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 -Du_int=unsigned\ int -Du_char=unsigned\ char -Du_short=unsigned\ short ${COMMON_CXX_FLAGS}" \
-DCMAKE_C_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_C_FLAGS}" \
-DCMAKE_CXX_FLAGS="-D_XOPEN_SOURCE=700 ${COMMON_CXX_FLAGS}" \
-S .
cmake --build build-visionos-sim --config Release -- -quiet

Expand Down
39 changes: 39 additions & 0 deletions ci/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,43 @@ GG_BUILD_CUDA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# with SYCL support
source /opt/intel/oneapi/setvars.sh
GG_BUILD_SYCL=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt

# with MUSA support
GG_BUILD_MUSA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
```

## Running MUSA CI in a Docker Container

Assuming `$PWD` is the root of the `llama.cpp` repository, follow these steps to set up and run MUSA CI in a Docker container:

### 1. Create a local directory to store cached models, configuration files and venv:

```bash
mkdir -p $HOME/llama.cpp/ci-cache
```

### 2. Create a local directory to store CI run results:

```bash
mkdir -p $HOME/llama.cpp/ci-results
```

### 3. Start a Docker container and run the CI:

```bash
docker run --privileged -it \
-v $HOME/llama.cpp/ci-cache:/ci-cache \
-v $HOME/llama.cpp/ci-results:/ci-results \
-v $PWD:/ws -w /ws \
mthreads/musa:rc3.1.1-devel-ubuntu22.04
```

Inside the container, execute the following commands:

```bash
apt update -y && apt install -y cmake git python3.10-venv wget
git config --global --add safe.directory /ws
GG_BUILD_MUSA=1 bash ./ci/run.sh /ci-results /ci-cache
```

This setup ensures that the CI runs within an isolated Docker environment while maintaining cached files and results across runs.
30 changes: 24 additions & 6 deletions ci/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@
# # with VULKAN support
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# # with MUSA support
# GG_BUILD_MUSA=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#

if [ -z "$2" ]; then
echo "usage: $0 <output-dir> <mnt-dir>"
Expand Down Expand Up @@ -52,13 +55,22 @@ if [ ! -z ${GG_BUILD_SYCL} ]; then
echo "source /opt/intel/oneapi/setvars.sh"
exit 1
fi

# Use only main GPU
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
# Enable sysman for correct memory reporting
export ZES_ENABLE_SYSMAN=1
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_SYCL=1 -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON"
fi

if [ ! -z ${GG_BUILD_VULKAN} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_VULKAN=1"
fi

if [ ! -z ${GG_BUILD_MUSA} ]; then
# Use qy1 by default (MTT S80)
MUSA_ARCH=${MUSA_ARCH:-21}
CMAKE_EXTRA="-DGGML_MUSA=ON -DMUSA_ARCHITECTURES=${MUSA_ARCH}"
fi
## helpers

# download a file if it does not exist or if it is outdated
Expand Down Expand Up @@ -808,7 +820,7 @@ export LLAMA_LOG_PREFIX=1
export LLAMA_LOG_TIMESTAMPS=1

if [ -z ${GG_BUILD_LOW_PERF} ]; then
# Create symlink: ./llama.cpp/models-mnt -> $MNT/models/models-mnt
# Create symlink: ./llama.cpp/models-mnt -> $MNT/models
rm -rf ${SRC}/models-mnt
mnt_models=${MNT}/models
mkdir -p ${mnt_models}
Expand All @@ -826,16 +838,20 @@ if [ -z ${GG_BUILD_LOW_PERF} ]; then
fi

ret=0

test $ret -eq 0 && gg_run ctest_debug
if [ -z ${GG_BUILD_SYCL} ]; then
# SYCL build breaks with debug build flags
test $ret -eq 0 && gg_run ctest_debug
fi
test $ret -eq 0 && gg_run ctest_release

if [ -z ${GG_BUILD_LOW_PERF} ]; then
test $ret -eq 0 && gg_run embd_bge_small
test $ret -eq 0 && gg_run rerank_tiny

if [ -z ${GG_BUILD_CLOUD} ] || [ ${GG_BUILD_EXTRA_TESTS_0} ]; then
test $ret -eq 0 && gg_run test_scripts_debug
if [ -z ${GG_BUILD_SYCL} ]; then
test $ret -eq 0 && gg_run test_scripts_debug
fi
test $ret -eq 0 && gg_run test_scripts_release
fi

Expand All @@ -846,7 +862,9 @@ if [ -z ${GG_BUILD_LOW_PERF} ]; then
test $ret -eq 0 && gg_run pythia_2_8b
#test $ret -eq 0 && gg_run open_llama_7b_v2
fi
test $ret -eq 0 && gg_run ctest_with_model_debug
if [ -z ${GG_BUILD_SYCL} ]; then
test $ret -eq 0 && gg_run ctest_with_model_debug
fi
test $ret -eq 0 && gg_run ctest_with_model_release
fi
fi
Expand Down
69 changes: 44 additions & 25 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,8 @@ def get_tensors(self) -> Iterator[tuple[str, Tensor]]:
extra = sorted(tensor_names_from_parts.difference(self.tensor_names))
missing_files = sorted(set(weight_map[n] for n in missing if n in weight_map))
if len(extra) == 0 and len(missing_files) > 0:
raise ValueError(f"Missing or incomplete model files: {missing_files}")
raise ValueError(f"Missing or incomplete model files: {missing_files}\n"
f"Missing tensors: {missing}")
else:
raise ValueError("Mismatch between weight map and model parts for tensor names:\n"
f"Missing tensors: {missing}\n"
Expand Down Expand Up @@ -528,6 +529,8 @@ def get_vocab_base(self) -> tuple[list[str], list[int], str]:
reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()}
added_vocab = tokenizer.get_added_vocab()

added_tokens_decoder = tokenizer.added_tokens_decoder

for i in range(vocab_size):
if i not in reverse_vocab:
tokens.append(f"[PAD{i}]")
Expand All @@ -537,13 +540,13 @@ def get_vocab_base(self) -> tuple[list[str], list[int], str]:
if token in added_vocab:
# The tokenizer in llama.cpp assumes the CONTROL and USER_DEFINED tokens are pre-normalized.
# To avoid unexpected issues - we make sure to normalize non-normalized tokens
if not tokenizer.added_tokens_decoder[i].normalized:
if not added_tokens_decoder[i].normalized:
previous_token = token
token = tokenizer.decode(tokenizer.encode(token, add_special_tokens=False))
if previous_token != token:
logger.info(f"{repr(previous_token)} is encoded and decoded back to {repr(token)} using AutoTokenizer")

if tokenizer.added_tokens_decoder[i].special or self.does_token_look_special(token):
if added_tokens_decoder[i].special or self.does_token_look_special(token):
toktypes.append(gguf.TokenType.CONTROL)
else:
# NOTE: this was added for Gemma.
Expand Down Expand Up @@ -702,6 +705,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "ccc2ef013c104be7bae2965776d611e1d7a8a2a9c547dd93a682c9a9fc80352e":
# ref: https://huggingface.co/Xenova/gpt-4o
res = "gpt-4o"
if chkhsh == "7dec86086fcc38b66b7bc1575a160ae21cf705be7718b9d5598190d7c12db76f":
# ref: https://huggingface.co/UW/OLMo2-8B-SuperBPE-t180k
res = "superbpe"

if res is None:
logger.warning("\n")
Expand Down Expand Up @@ -1099,13 +1105,6 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter

tensors.append((self.map_tensor_name(name), data_torch))

if name == "word_embeddings.weight":
assert self.tensor_names is not None

# TODO: tie them at runtime, don't duplicate in the model file
if all(s not in self.tensor_names for s in ("lm_head.weight", "output.weight")):
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

return tensors


Expand Down Expand Up @@ -1747,6 +1746,25 @@ def prepare_tensors(self):
raise ValueError(f"Unprocessed experts: {experts}")


@Model.register("Mistral3ForConditionalGeneration")
class Mistral3Model(LlamaModel):
model_arch = gguf.MODEL_ARCH.LLAMA

# we need to merge the text_config into the root level of hparams
def __init__(self, *args, **kwargs):
hparams = kwargs["hparams"] if "hparams" in kwargs else Model.load_hparams(args[0])
if "text_config" in hparams:
hparams = {**hparams, **hparams["text_config"]}
kwargs["hparams"] = hparams
super().__init__(*args, **kwargs)

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
name = name.replace("language_model.", "")
if "multi_modal_projector" in name or "vision_tower" in name:
return []
return super().modify_tensors(data_torch, name, bid)


@Model.register("DeciLMForCausalLM")
class DeciModel(Model):
model_arch = gguf.MODEL_ARCH.DECI
Expand Down Expand Up @@ -2404,10 +2422,6 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter

tensors.append((new_name, data_torch))

# note: GPT2 output is tied to (same as) wte in original model
if new_name == self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD):
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))

return tensors


Expand Down Expand Up @@ -2737,21 +2751,26 @@ def set_gguf_parameters(self):
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.LINEAR)
self.gguf_writer.add_rope_scaling_factor(1.0)

_has_tok_embd = False

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused

new_name = self.map_tensor_name(name)

tensors: list[tuple[str, Tensor]] = [(new_name, data_torch)]
output_name = self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT)
tok_embd_name = self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD)

if new_name == self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD):
assert self.tensor_names is not None
new_name = self.map_tensor_name(name)

if all(s not in self.tensor_names for s in ("lm_head.weight", "output.weight")):
# copy tok_embd.weight to output.weight
tensors.append((self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), data_torch))
# assuming token_embd.weight is seen before output.weight
if not self._has_tok_embd and new_name == self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT):
# even though the tensor file(s) does not contain the word embeddings they are still in the weight map
if self.tensor_names and "transformer.wte.weight" in self.tensor_names:
logger.debug(f"{tok_embd_name} not found before {output_name}, assuming they are tied")
self.tensor_names.remove("transformer.wte.weight")
elif new_name == tok_embd_name:
self._has_tok_embd = True

return tensors
return [(new_name, data_torch)]


@Model.register("InternLM2ForCausalLM")
Expand Down Expand Up @@ -3366,7 +3385,7 @@ class Gemma3Model(Model):

# we need to merge the text_config into the root level of hparams
def __init__(self, *args, **kwargs):
hparams = Model.load_hparams(kwargs["dir_model"])
hparams = kwargs["hparams"] if "hparams" in kwargs else Model.load_hparams(args[0])
if "text_config" in hparams:
hparams = {**hparams, **hparams["text_config"]}
kwargs["hparams"] = hparams
Expand Down Expand Up @@ -5339,7 +5358,7 @@ def main() -> None:
logger.error(f"Model {model_architecture} is not supported")
sys.exit(1)

model_instance = model_class(dir_model=dir_model, ftype=output_type, fname_out=fname_out,
model_instance = model_class(dir_model, output_type, fname_out,
is_big_endian=args.bigendian, use_temp_file=args.use_temp_file,
eager=args.no_lazy,
metadata_override=args.metadata, model_name=args.model_name,
Expand Down
1 change: 1 addition & 0 deletions convert_hf_to_gguf_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "deepseek-v3", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/DeepSeek-V3"},
{"name": "deepseek-r1-qwen", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"},
{"name": "gpt-4o", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Xenova/gpt-4o", },
{"name": "superbpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/UW/OLMo2-8B-SuperBPE-t180k", },
]


Expand Down
Loading
Loading