Skip to content

Commit 1ace6d7

Browse files
authored
Merge branch 'ggml-org:master' into master
2 parents cbde69e + 37b9f0d commit 1ace6d7

File tree

117 files changed

+11053
-7185
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

117 files changed

+11053
-7185
lines changed

.github/workflows/build.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1766,16 +1766,17 @@ jobs:
17661766
if: ${{ github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'Ascend NPU') }}
17671767
defaults:
17681768
run:
1769-
shell: bash -el {0}
1770-
runs-on: ubuntu-24.04-arm
1769+
shell: bash -el {0}
17711770
strategy:
17721771
matrix:
1772+
arch: [x86, aarch64]
17731773
cann:
17741774
- '8.1.RC1.alpha001-910b-openeuler22.03-py3.10'
17751775
device:
17761776
- 'ascend910b3'
17771777
build:
17781778
- 'Release'
1779+
runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
17791780
container: ascendai/cann:${{ matrix.cann }}
17801781
steps:
17811782
- name: Checkout

Makefile

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -780,10 +780,6 @@ ifdef GGML_HIP
780780

781781
MK_CPPFLAGS += -DGGML_USE_HIP -DGGML_USE_CUDA
782782

783-
ifdef GGML_HIP_UMA
784-
MK_CPPFLAGS += -DGGML_HIP_UMA
785-
endif # GGML_HIP_UMA
786-
787783
MK_LDFLAGS += -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib
788784
MK_LDFLAGS += -L$(ROCM_PATH)/lib64 -Wl,-rpath=$(ROCM_PATH)/lib64
789785
MK_LDFLAGS += -lhipblas -lamdhip64 -lrocblas

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
9797
- [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
9898
- [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
9999
- [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat)
100+
- [x] [GLM-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e)
100101
- [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
101102
- [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
102103
- [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
@@ -259,7 +260,9 @@ The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](htt
259260
- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
260261
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
261262

262-
You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from Hugging Face by using this CLI argument: `-hf <user>/<model>[:quant]`
263+
You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, such as [ModelScope](https://modelscope.cn/), by using this CLI argument: `-hf <user>/<model>[:quant]`.
264+
265+
By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. For example, you may opt to downloading model checkpoints from ModelScope or other model sharing communities by setting the environment variable, e.g. `MODEL_ENDPOINT=https://www.modelscope.cn/`.
263266

264267
After downloading a model, use the CLI tools to run it locally - see below.
265268

common/arg.cpp

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -228,12 +228,13 @@ static bool common_download_file_single(const std::string & url, const std::stri
228228
curl_easy_setopt(curl.get(), CURLOPT_URL, url.c_str());
229229
curl_easy_setopt(curl.get(), CURLOPT_FOLLOWLOCATION, 1L);
230230

231+
http_headers.ptr = curl_slist_append(http_headers.ptr, "User-Agent: llama-cpp");
231232
// Check if hf-token or bearer-token was specified
232233
if (!bearer_token.empty()) {
233234
std::string auth_header = "Authorization: Bearer " + bearer_token;
234235
http_headers.ptr = curl_slist_append(http_headers.ptr, auth_header.c_str());
235-
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr);
236236
}
237+
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr);
237238

238239
#if defined(_WIN32)
239240
// CURLSSLOPT_NATIVE_CA tells libcurl to use standard certificate store of
@@ -544,7 +545,10 @@ static struct common_hf_file_res common_get_hf_file(const std::string & hf_repo_
544545
curl_ptr curl(curl_easy_init(), &curl_easy_cleanup);
545546
curl_slist_ptr http_headers;
546547
std::string res_str;
547-
std::string url = "https://huggingface.co/v2/" + hf_repo + "/manifests/" + tag;
548+
549+
std::string model_endpoint = get_model_endpoint();
550+
551+
std::string url = model_endpoint + "v2/" + hf_repo + "/manifests/" + tag;
548552
curl_easy_setopt(curl.get(), CURLOPT_URL, url.c_str());
549553
curl_easy_setopt(curl.get(), CURLOPT_NOPROGRESS, 1L);
550554
typedef size_t(*CURLOPT_WRITEFUNCTION_PTR)(void * ptr, size_t size, size_t nmemb, void * data);
@@ -659,13 +663,8 @@ static void common_params_handle_model(
659663
}
660664
}
661665

662-
std::string hf_endpoint = "https://huggingface.co/";
663-
const char * hf_endpoint_env = getenv("HF_ENDPOINT");
664-
if (hf_endpoint_env) {
665-
hf_endpoint = hf_endpoint_env;
666-
if (hf_endpoint.back() != '/') hf_endpoint += '/';
667-
}
668-
model.url = hf_endpoint + model.hf_repo + "/resolve/main/" + model.hf_file;
666+
std::string model_endpoint = get_model_endpoint();
667+
model.url = model_endpoint + model.hf_repo + "/resolve/main/" + model.hf_file;
669668
// make sure model path is present (for caching purposes)
670669
if (model.path.empty()) {
671670
// this is to avoid different repo having same file name, or same file name in different subdirs

common/chat.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1622,7 +1622,7 @@ static common_chat_params common_chat_templates_apply_jinja(
16221622
}
16231623

16241624
// Hermes 2/3 Pro, Qwen 2.5 Instruct (w/ tools)
1625-
if (src.find("<tool_call>") != std::string::npos && params.json_schema.is_null()) {
1625+
if (src.find("<tool_call>") != std::string::npos && params.json_schema.is_null() && params.tools.is_array() && params.json_schema.is_null()) {
16261626
return common_chat_params_init_hermes_2_pro(tmpl, params);
16271627
}
16281628

common/common.cpp

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -830,7 +830,7 @@ std::string fs_get_cache_directory() {
830830
if (getenv("LLAMA_CACHE")) {
831831
cache_directory = std::getenv("LLAMA_CACHE");
832832
} else {
833-
#ifdef __linux__
833+
#if defined(__linux__) || defined(__FreeBSD__) || defined(_AIX)
834834
if (std::getenv("XDG_CACHE_HOME")) {
835835
cache_directory = std::getenv("XDG_CACHE_HOME");
836836
} else {
@@ -840,7 +840,9 @@ std::string fs_get_cache_directory() {
840840
cache_directory = std::getenv("HOME") + std::string("/Library/Caches/");
841841
#elif defined(_WIN32)
842842
cache_directory = std::getenv("LOCALAPPDATA");
843-
#endif // __linux__
843+
#else
844+
# error Unknown architecture
845+
#endif
844846
cache_directory = ensure_trailing_slash(cache_directory);
845847
cache_directory += "llama.cpp";
846848
}
@@ -1027,6 +1029,19 @@ struct common_init_result common_init_from_params(common_params & params) {
10271029
return iparams;
10281030
}
10291031

1032+
std::string get_model_endpoint() {
1033+
const char * model_endpoint_env = getenv("MODEL_ENDPOINT");
1034+
// We still respect the use of environment-variable "HF_ENDPOINT" for backward-compatibility.
1035+
const char * hf_endpoint_env = getenv("HF_ENDPOINT");
1036+
const char * endpoint_env = model_endpoint_env ? model_endpoint_env : hf_endpoint_env;
1037+
std::string model_endpoint = "https://huggingface.co/";
1038+
if (endpoint_env) {
1039+
model_endpoint = endpoint_env;
1040+
if (model_endpoint.back() != '/') model_endpoint += '/';
1041+
}
1042+
return model_endpoint;
1043+
}
1044+
10301045
void common_set_adapter_lora(struct llama_context * ctx, std::vector<common_adapter_lora_info> & lora) {
10311046
llama_clear_adapter_lora(ctx);
10321047
for (auto & la : lora) {

common/common.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -543,6 +543,8 @@ struct ggml_threadpool_params ggml_threadpool_params_from_cpu_params(const cpu_p
543543
// clear LoRA adapters from context, then apply new list of adapters
544544
void common_set_adapter_lora(struct llama_context * ctx, std::vector<common_adapter_lora_info> & lora);
545545

546+
std::string get_model_endpoint();
547+
546548
//
547549
// Batch utils
548550
//

convert_hf_to_gguf.py

Lines changed: 50 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -735,6 +735,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
735735
if chkhsh == "d353350c764d8c3b39c763113960e4fb4919bea5fbf208a0e3b22e8469dc7406":
736736
# ref: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
737737
res = "llama4"
738+
if chkhsh == "a1336059768a55c99a734006ffb02203cd450fed003e9a71886c88acf24fdbc2":
739+
# ref: https://huggingface.co/THUDM/glm-4-9b-hf
740+
res = "glm4"
738741

739742
if res is None:
740743
logger.warning("\n")
@@ -4419,6 +4422,10 @@ def set_vocab(self):
44194422
self._set_vocab_gpt2()
44204423

44214424
def set_gguf_parameters(self):
4425+
4426+
# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
4427+
self.hparams["num_key_value_heads"] = 1
4428+
44224429
super().set_gguf_parameters()
44234430
hparams = self.hparams
44244431

@@ -4427,8 +4434,13 @@ def set_gguf_parameters(self):
44274434
if "q_lora_rank" in hparams and hparams["q_lora_rank"] is not None:
44284435
self.gguf_writer.add_q_lora_rank(hparams["q_lora_rank"])
44294436
self.gguf_writer.add_kv_lora_rank(hparams["kv_lora_rank"])
4430-
self.gguf_writer.add_key_length(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
4431-
self.gguf_writer.add_value_length(hparams["v_head_dim"])
4437+
4438+
# note: deepseek2 using MLA converts into MQA with larger heads, then decompresses to MHA
4439+
self.gguf_writer.add_key_length(hparams["kv_lora_rank"] + hparams["qk_rope_head_dim"])
4440+
self.gguf_writer.add_value_length(hparams["kv_lora_rank"])
4441+
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
4442+
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
4443+
44324444
self.gguf_writer.add_expert_feed_forward_length(hparams["moe_intermediate_size"])
44334445
self.gguf_writer.add_expert_count(hparams["n_routed_experts"])
44344446
self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
@@ -4497,6 +4509,26 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
44974509
else:
44984510
return []
44994511

4512+
# note: MLA with the absorption optimization, needs these two split and k_b_proj transposed
4513+
if name.endswith("kv_b_proj.weight"):
4514+
name_kb = name.replace("kv_b_proj", "k_b_proj")
4515+
name_vb = name.replace("kv_b_proj", "v_b_proj")
4516+
4517+
n_head_kv = self.hparams["num_key_value_heads"]
4518+
v_head_dim = self.hparams["v_head_dim"]
4519+
qk_nope_head_dim = self.hparams["qk_nope_head_dim"]
4520+
4521+
assert data_torch.shape[0] == n_head_kv * (v_head_dim + qk_nope_head_dim)
4522+
4523+
kv_b = data_torch.view(n_head_kv, v_head_dim + qk_nope_head_dim, data_torch.shape[-1])
4524+
k_b, v_b = torch.split(kv_b, [qk_nope_head_dim, v_head_dim], dim=1)
4525+
k_b = k_b.transpose(1, 2)
4526+
4527+
return [
4528+
(self.map_tensor_name(name_kb), k_b),
4529+
(self.map_tensor_name(name_vb), v_b)
4530+
]
4531+
45004532
return [(self.map_tensor_name(name), data_torch)]
45014533

45024534
def prepare_tensors(self):
@@ -4897,6 +4929,22 @@ def prepare_tensors(self):
48974929
self.gguf_writer.add_max_alibi_bias(self.max_alibi_bias)
48984930

48994931

4932+
@Model.register("Glm4ForCausalLM")
4933+
class Glm4Model(Model):
4934+
model_arch = gguf.MODEL_ARCH.GLM4
4935+
4936+
def set_vocab(self):
4937+
self._set_vocab_gpt2()
4938+
4939+
def set_gguf_parameters(self):
4940+
super().set_gguf_parameters()
4941+
if self.hparams.get("rope_scaling") is not None and "factor" in self.hparams["rope_scaling"]:
4942+
if self.hparams["rope_scaling"].get("type") == "yarn":
4943+
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN)
4944+
self.gguf_writer.add_rope_scaling_factor(self.hparams["rope_scaling"]["factor"])
4945+
self.gguf_writer.add_rope_scaling_orig_ctx_len(self.hparams["rope_scaling"]["original_max_position_embeddings"])
4946+
4947+
49004948
@Model.register("GlmForCausalLM", "ChatGLMModel", "ChatGLMForConditionalGeneration")
49014949
class ChatGLMModel(Model):
49024950
model_arch = gguf.MODEL_ARCH.CHATGLM
@@ -5588,7 +5636,6 @@ def main() -> None:
55885636
with torch.inference_mode():
55895637
output_type = ftype_map[args.outtype]
55905638
model_architecture = hparams["architectures"][0]
5591-
55925639
try:
55935640
model_class = Model.from_model_architecture(model_architecture)
55945641
except NotImplementedError:

convert_hf_to_gguf_update.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ class TOKENIZER_TYPE(IntEnum):
114114
{"name": "trillion", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/trillionlabs/Trillion-7B-preview", },
115115
{"name": "bailingmoe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/inclusionAI/Ling-lite", },
116116
{"name": "llama4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct", },
117+
{"name": "glm4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", },
117118
]
118119

119120

docs/build.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -259,8 +259,6 @@ You can download it from your Linux distro's package manager or from here: [ROCm
259259
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release \
260260
&& cmake --build build --config Release -- -j 16
261261
```
262-
On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DGGML_HIP_UMA=ON`.
263-
However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
264262

265263
To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the `-DGGML_HIP_ROCWMMA_FATTN=ON` option. This requires rocWMMA headers to be installed on the build system.
266264

@@ -296,6 +294,10 @@ You can download it from your Linux distro's package manager or from here: [ROCm
296294
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
297295
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
298296

297+
### Unified Memory
298+
299+
On Linux it is possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`. However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
300+
299301
## Vulkan
300302

301303
**Windows**

0 commit comments

Comments
 (0)