Skip to content

Commit f0f07bd

Browse files
committed
Merge branch 'master' into quantize
2 parents 04c07b3 + 00681df commit f0f07bd

File tree

115 files changed

+5871
-3304
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

115 files changed

+5871
-3304
lines changed

.github/workflows/build.yml

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1063,21 +1063,46 @@ jobs:
10631063
run: |
10641064
git clone https://github.com/rocm/rocwmma --branch rocm-6.2.4 --depth 1
10651065
1066-
- name: Install
1066+
- name: Cache ROCm Installation
1067+
id: cache-rocm
1068+
uses: actions/cache@v4
1069+
with:
1070+
path: C:\Program Files\AMD\ROCm
1071+
key: rocm-6.1-${{ runner.os }}-v1
1072+
restore-keys: |
1073+
rocm-6.1-${{ runner.os }}-
1074+
1075+
- name: Install ROCm
1076+
if: steps.cache-rocm.outputs.cache-hit != 'true'
10671077
id: depends
10681078
run: |
10691079
$ErrorActionPreference = "Stop"
10701080
write-host "Downloading AMD HIP SDK Installer"
10711081
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
10721082
write-host "Installing AMD HIP SDK"
10731083
$proc = Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru
1074-
$proc.WaitForExit(600000)
1084+
$completed = $proc.WaitForExit(600000)
1085+
if (-not $completed) {
1086+
Write-Error "ROCm installation timed out after 10 minutes. Killing the process"
1087+
$proc.Kill()
1088+
exit 1
1089+
}
1090+
if ($proc.ExitCode -ne 0) {
1091+
Write-Error "ROCm installation failed with exit code $($proc.ExitCode)"
1092+
exit 1
1093+
}
10751094
write-host "Completed AMD HIP SDK installation"
10761095
10771096
- name: Verify ROCm
10781097
id: verify
10791098
run: |
1080-
& 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' --version
1099+
# Find and test ROCm installation
1100+
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
1101+
if (-not $clangPath) {
1102+
Write-Error "ROCm installation not found"
1103+
exit 1
1104+
}
1105+
& $clangPath.FullName --version
10811106
10821107
- name: Install ccache
10831108
uses: ggml-org/[email protected]

.github/workflows/release.yml

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -544,27 +544,52 @@ jobs:
544544
run: |
545545
git clone https://github.com/rocm/rocwmma --branch rocm-6.2.4 --depth 1
546546
547+
- name: Cache ROCm Installation
548+
id: cache-rocm
549+
uses: actions/cache@v4
550+
with:
551+
path: C:\Program Files\AMD\ROCm
552+
key: rocm-6.1-${{ runner.os }}-v1
553+
restore-keys: |
554+
rocm-6.1-${{ runner.os }}-
555+
547556
- name: ccache
548557
uses: ggml-org/[email protected]
549558
with:
550559
key: windows-latest-cmake-hip-${{ matrix.name }}-x64
551560
evict-old-files: 1d
552561

553-
- name: Install
562+
- name: Install ROCm
563+
if: steps.cache-rocm.outputs.cache-hit != 'true'
554564
id: depends
555565
run: |
556566
$ErrorActionPreference = "Stop"
557567
write-host "Downloading AMD HIP SDK Installer"
558568
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
559569
write-host "Installing AMD HIP SDK"
560570
$proc = Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru
561-
$proc.WaitForExit(600000)
571+
$completed = $proc.WaitForExit(600000)
572+
if (-not $completed) {
573+
Write-Error "ROCm installation timed out after 10 minutes. Killing the process"
574+
$proc.Kill()
575+
exit 1
576+
}
577+
if ($proc.ExitCode -ne 0) {
578+
Write-Error "ROCm installation failed with exit code $($proc.ExitCode)"
579+
exit 1
580+
}
562581
write-host "Completed AMD HIP SDK installation"
563582
564583
- name: Verify ROCm
565584
id: verify
566585
run: |
567-
& 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' --version
586+
# Find and test ROCm installation
587+
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
588+
if (-not $clangPath) {
589+
Write-Error "ROCm installation not found"
590+
exit 1
591+
}
592+
& $clangPath.FullName --version
568593
569594
- name: Build
570595
id: cmake_build

CONTRIBUTING.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@
1616
- Use the following format for the squashed commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : fix typo in utils.py (#1234)`
1717
- Optionally pick a `<module>` from here: https://github.com/ggml-org/llama.cpp/wiki/Modules
1818
- Consider adding yourself to [CODEOWNERS](CODEOWNERS)
19+
- Let authors, who are also collaborators, merge their own PRs
20+
- When merging a PR by a contributor, make sure you have a good understanding of the changes
21+
- Be mindful of maintenance: most of the work going into a feature happens after the PR is merged. If the PR author is not committed to contribute long-term, someone else needs to take responsibility (you)
1922

2023
# Coding guidelines
2124

common/chat.cpp

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -631,6 +631,7 @@ const char * common_chat_format_name(common_chat_format format) {
631631
case COMMON_CHAT_FORMAT_FIREFUNCTION_V2: return "FireFunction v2";
632632
case COMMON_CHAT_FORMAT_FUNCTIONARY_V3_2: return "Functionary v3.2";
633633
case COMMON_CHAT_FORMAT_FUNCTIONARY_V3_1_LLAMA_3_1: return "Functionary v3.1 Llama 3.1";
634+
case COMMON_CHAT_FORMAT_DEEPSEEK_V3_1: return "DeepSeek V3.1";
634635
case COMMON_CHAT_FORMAT_HERMES_2_PRO: return "Hermes 2 Pro";
635636
case COMMON_CHAT_FORMAT_COMMAND_R7B: return "Command R7B";
636637
case COMMON_CHAT_FORMAT_GRANITE: return "Granite";
@@ -698,11 +699,13 @@ static void parse_json_tool_calls(
698699
size_t from = std::string::npos;
699700
auto first = true;
700701
while (true) {
702+
auto start_pos = builder.pos();
701703
auto res = function_regex_start_only && first
702704
? builder.try_consume_regex(*function_regex_start_only)
703705
: function_regex
704706
? builder.try_find_regex(*function_regex, from)
705707
: std::nullopt;
708+
706709
if (res) {
707710
std::string name;
708711
if (get_function_name) {
@@ -737,6 +740,8 @@ static void parse_json_tool_calls(
737740
return;
738741
}
739742
throw common_chat_msg_partial_exception("incomplete tool call");
743+
} else {
744+
builder.move_to(start_pos);
740745
}
741746
break;
742747
}
@@ -1388,6 +1393,71 @@ static common_chat_params common_chat_params_init_deepseek_r1(const common_chat_
13881393
}
13891394
return data;
13901395
}
1396+
1397+
static common_chat_params common_chat_params_init_deepseek_v3_1(const common_chat_template & tmpl, const struct templates_params & inputs) {
1398+
common_chat_params data;
1399+
1400+
// Pass thinking context for DeepSeek V3.1 template
1401+
json additional_context = {
1402+
{"thinking", inputs.enable_thinking},
1403+
};
1404+
1405+
auto prompt = apply(tmpl, inputs,
1406+
/* messages_override= */ inputs.messages,
1407+
/* tools_override= */ std::nullopt,
1408+
additional_context);
1409+
data.prompt = prompt;
1410+
data.format = COMMON_CHAT_FORMAT_DEEPSEEK_V3_1;
1411+
if (string_ends_with(data.prompt, "<think>")) {
1412+
if (!inputs.enable_thinking) {
1413+
data.prompt += "</think>";
1414+
} else {
1415+
data.thinking_forced_open = true;
1416+
}
1417+
}
1418+
if (inputs.tools.is_array() && !inputs.tools.empty()) {
1419+
data.grammar_lazy = inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_REQUIRED && inputs.json_schema.is_null();
1420+
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
1421+
std::vector<std::string> tool_rules;
1422+
foreach_function(inputs.tools, [&](const json & tool) {
1423+
const auto & function = tool.at("function");
1424+
std::string name = function.at("name");
1425+
auto parameters = function.at("parameters");
1426+
builder.resolve_refs(parameters);
1427+
tool_rules.push_back(builder.add_rule(name + "-call",
1428+
"( \"<|tool▁call▁begin|>\" )? \"" + name + "<|tool▁sep|>"
1429+
"\" " + builder.add_schema(name + "-args", parameters) + " "
1430+
"\"<|tool▁call▁end|>\""));
1431+
});
1432+
// Distill Qwen 7B & 32B models seem confused re/ syntax of their tool call opening tag,
1433+
// so we accept common variants (then it's all constrained)
1434+
builder.add_rule("root",
1435+
std::string(data.thinking_forced_open ? "( \"</think>\" space )? " : "") +
1436+
"( \"<|tool▁calls▁begin|>\" | \"<|tool_calls_begin|>\" | \"<|tool calls begin|>\" | \"<|tool\\\\_calls\\\\_begin|>\" | \"<|tool▁calls|>\" ) "
1437+
"(" + string_join(tool_rules, " | ") + ")" + (inputs.parallel_tool_calls ? "*" : "") + " "
1438+
"\"<|tool▁calls▁end|>\""
1439+
" space");
1440+
data.grammar_triggers.push_back({
1441+
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL,
1442+
// If thinking_forced_open, then we capture the </think> tag in the grammar,
1443+
// (important for required tool choice) and in the trigger's first capture (decides what is sent to the grammar)
1444+
std::string(data.thinking_forced_open ? "[\\s\\S]*?(</think>\\s*)" : "(?:<think>[\\s\\S]*?</think>\\s*)?") +
1445+
"(<|tool▁calls▁begin|>|<|tool_calls_begin|>|<|tool calls begin|>|<|tool\\\\_calls\\\\_begin|>|<|tool▁calls|>)[\\s\\S]*"
1446+
});
1447+
data.preserved_tokens = {
1448+
"<think>",
1449+
"</think>",
1450+
"<|tool▁calls▁begin|>",
1451+
"<|tool▁call▁begin|>",
1452+
"<|tool▁sep|>",
1453+
"<|tool▁call▁end|>",
1454+
"<|tool▁calls▁end|>",
1455+
};
1456+
});
1457+
}
1458+
return data;
1459+
}
1460+
13911461
static void common_chat_parse_deepseek_r1(common_chat_msg_parser & builder) {
13921462
builder.try_parse_reasoning("<think>", "</think>");
13931463
if (!builder.syntax().parse_tool_calls) {
@@ -1409,6 +1479,66 @@ static void common_chat_parse_deepseek_r1(common_chat_msg_parser & builder) {
14091479
tool_calls_end);
14101480
}
14111481

1482+
static void common_chat_parse_deepseek_v3_1_content(common_chat_msg_parser & builder) {
1483+
static const common_regex function_regex("(?:<|tool▁call▁begin|>)?([^\\n<]+)(?:<|tool▁sep|>)");
1484+
1485+
static const common_regex close_regex("(?:[\\s]*)?<|tool▁call▁end|>");
1486+
static const common_regex tool_calls_begin("(?:<|tool▁calls▁begin|>|<|tool_calls_begin|>|<|tool calls begin|>|<|tool\\\\_calls\\\\_begin|>|<|tool▁calls|>)");
1487+
static const common_regex tool_calls_end("<|tool▁calls▁end|>");
1488+
1489+
if (!builder.syntax().parse_tool_calls) {
1490+
LOG_DBG("%s: not parse_tool_calls\n", __func__);
1491+
builder.add_content(builder.consume_rest());
1492+
return;
1493+
}
1494+
1495+
LOG_DBG("%s: parse_tool_calls\n", __func__);
1496+
1497+
parse_json_tool_calls(
1498+
builder,
1499+
/* block_open= */ tool_calls_begin,
1500+
/* function_regex_start_only= */ std::nullopt,
1501+
function_regex,
1502+
close_regex,
1503+
tool_calls_end);
1504+
}
1505+
1506+
static void common_chat_parse_deepseek_v3_1(common_chat_msg_parser & builder) {
1507+
// DeepSeek V3.1 outputs reasoning content between "<think>" and "</think>" tags, followed by regular content
1508+
// First try to parse using the standard reasoning parsing method
1509+
LOG_DBG("%s: thinking_forced_open: %s\n", __func__, std::to_string(builder.syntax().thinking_forced_open).c_str());
1510+
1511+
auto start_pos = builder.pos();
1512+
auto found_end_think = builder.try_find_literal("</think>");
1513+
builder.move_to(start_pos);
1514+
1515+
if (builder.syntax().thinking_forced_open && !builder.is_partial() && !found_end_think) {
1516+
LOG_DBG("%s: no end_think, not partial, adding content\n", __func__);
1517+
common_chat_parse_deepseek_v3_1_content(builder);
1518+
} else if (builder.try_parse_reasoning("<think>", "</think>")) {
1519+
// If reasoning was parsed successfully, the remaining content is regular content
1520+
LOG_DBG("%s: parsed reasoning, adding content\n", __func__);
1521+
// </think><|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>NAME\n```json\nJSON\n```<|tool▁call▁end|><|tool▁calls▁end|>
1522+
common_chat_parse_deepseek_v3_1_content(builder);
1523+
} else {
1524+
if (builder.syntax().reasoning_format == COMMON_REASONING_FORMAT_NONE) {
1525+
LOG_DBG("%s: reasoning_format none, adding content\n", __func__);
1526+
common_chat_parse_deepseek_v3_1_content(builder);
1527+
return;
1528+
}
1529+
// If no reasoning tags found, check if we should treat everything as reasoning
1530+
if (builder.syntax().thinking_forced_open) {
1531+
// If thinking is forced open but no tags found, treat everything as reasoning
1532+
LOG_DBG("%s: thinking_forced_open, adding reasoning content\n", __func__);
1533+
builder.add_reasoning_content(builder.consume_rest());
1534+
} else {
1535+
LOG_DBG("%s: no thinking_forced_open, adding content\n", __func__);
1536+
// <|tool▁call▁begin|>NAME<|tool▁sep|>JSON<|tool▁call▁end|>
1537+
common_chat_parse_deepseek_v3_1_content(builder);
1538+
}
1539+
}
1540+
}
1541+
14121542
static common_chat_params common_chat_params_init_gpt_oss(const common_chat_template & tmpl, const struct templates_params & inputs) {
14131543
common_chat_params data;
14141544
auto prompt = apply(tmpl, inputs);
@@ -2365,6 +2495,12 @@ static common_chat_params common_chat_templates_apply_jinja(
23652495
}
23662496
}
23672497

2498+
// DeepSeek V3.1: detect based on specific patterns in the template
2499+
if (src.find("message['prefix'] is defined and message['prefix'] and thinking") != std::string::npos &&
2500+
params.json_schema.is_null()) {
2501+
return common_chat_params_init_deepseek_v3_1(tmpl, params);
2502+
}
2503+
23682504
// DeepSeek R1: use handler in all cases except json schema (thinking / tools).
23692505
if (src.find("<|tool▁calls▁begin|>") != std::string::npos && params.json_schema.is_null()) {
23702506
return common_chat_params_init_deepseek_r1(tmpl, params);
@@ -2537,6 +2673,9 @@ static void common_chat_parse(common_chat_msg_parser & builder) {
25372673
case COMMON_CHAT_FORMAT_DEEPSEEK_R1:
25382674
common_chat_parse_deepseek_r1(builder);
25392675
break;
2676+
case COMMON_CHAT_FORMAT_DEEPSEEK_V3_1:
2677+
common_chat_parse_deepseek_v3_1(builder);
2678+
break;
25402679
case COMMON_CHAT_FORMAT_FUNCTIONARY_V3_2:
25412680
common_chat_parse_functionary_v3_2(builder);
25422681
break;

common/chat.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ enum common_chat_format {
107107
COMMON_CHAT_FORMAT_FIREFUNCTION_V2,
108108
COMMON_CHAT_FORMAT_FUNCTIONARY_V3_2,
109109
COMMON_CHAT_FORMAT_FUNCTIONARY_V3_1_LLAMA_3_1,
110+
COMMON_CHAT_FORMAT_DEEPSEEK_V3_1,
110111
COMMON_CHAT_FORMAT_HERMES_2_PRO,
111112
COMMON_CHAT_FORMAT_COMMAND_R7B,
112113
COMMON_CHAT_FORMAT_GRANITE,

common/json-schema-to-grammar.cpp

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -843,9 +843,10 @@ class SchemaConverter {
843843
_build_object_rule(
844844
properties, required, name,
845845
schema.contains("additionalProperties") ? schema["additionalProperties"] : json()));
846-
} else if ((schema_type.is_null() || schema_type == "object") && schema.contains("allOf")) {
846+
} else if ((schema_type.is_null() || schema_type == "object" || schema_type == "string") && schema.contains("allOf")) {
847847
std::unordered_set<std::string> required;
848848
std::vector<std::pair<std::string, json>> properties;
849+
std::map<std::string, size_t> enum_values;
849850
std::string hybrid_name = name;
850851
std::function<void(const json &, bool)> add_component = [&](const json & comp_schema, bool is_required) {
851852
if (comp_schema.contains("$ref")) {
@@ -857,6 +858,14 @@ class SchemaConverter {
857858
required.insert(prop.key());
858859
}
859860
}
861+
} else if (comp_schema.contains("enum")) {
862+
for (const auto & v : comp_schema["enum"]) {
863+
const auto rule = _generate_constant_rule(v);
864+
if (enum_values.find(rule) == enum_values.end()) {
865+
enum_values[rule] = 0;
866+
}
867+
enum_values[rule] += 1;
868+
}
860869
} else {
861870
// todo warning
862871
}
@@ -870,6 +879,17 @@ class SchemaConverter {
870879
add_component(t, true);
871880
}
872881
}
882+
if (!enum_values.empty()) {
883+
std::vector<std::string> enum_intersection;
884+
for (const auto & p : enum_values) {
885+
if (p.second == schema["allOf"].size()) {
886+
enum_intersection.push_back(p.first);
887+
}
888+
}
889+
if (!enum_intersection.empty()) {
890+
return _add_rule(rule_name, "(" + string_join(enum_intersection, " | ") + ") space");
891+
}
892+
}
873893
return _add_rule(rule_name, _build_object_rule(properties, required, hybrid_name, json()));
874894
} else if ((schema_type.is_null() || schema_type == "array") && (schema.contains("items") || schema.contains("prefixItems"))) {
875895
json items = schema.contains("items") ? schema["items"] : schema["prefixItems"];

convert_hf_to_gguf.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5128,6 +5128,20 @@ class EmbeddingGemma(Gemma3Model):
51285128

51295129
def set_gguf_parameters(self):
51305130
super().set_gguf_parameters()
5131+
5132+
# Override the sliding window size as it gets adjusted by the Gemma3TextConfig
5133+
# constructor. We want to use the value from the original model's config.json.
5134+
# ref: https://github.com/huggingface/transformers/pull/40700
5135+
with open(self.dir_model / "config.json", "r", encoding="utf-8") as f:
5136+
config = json.load(f)
5137+
orig_sliding_window = config.get("sliding_window")
5138+
if orig_sliding_window is None:
5139+
raise ValueError("sliding_window not found in model config - this is required for the model")
5140+
5141+
logger.info(f"Using original sliding_window from config: {orig_sliding_window} "
5142+
f"instead of {self.hparams['sliding_window']}")
5143+
self.gguf_writer.add_sliding_window(orig_sliding_window)
5144+
51315145
self._try_set_pooling_type()
51325146

51335147

@@ -6687,6 +6701,8 @@ def set_gguf_parameters(self):
66876701
self.gguf_writer.add_embedding_length(self.hparams["d_model"])
66886702
self.gguf_writer.add_feed_forward_length(self.hparams["d_ff"])
66896703
self.gguf_writer.add_block_count(self.hparams["num_layers"])
6704+
if (dec_n_layer := self.hparams.get("num_decoder_layers")) is not None:
6705+
self.gguf_writer.add_decoder_block_count(dec_n_layer)
66906706
self.gguf_writer.add_head_count(self.hparams["num_heads"])
66916707
self.gguf_writer.add_key_length(self.hparams["d_kv"])
66926708
self.gguf_writer.add_value_length(self.hparams["d_kv"])

docs/backend/CANN.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -314,3 +314,7 @@ Converting the matmul weight format from ND to NZ to improve performance. Enable
314314
### GGML_CANN_ACL_GRAPH
315315

316316
Operators are executed using ACL graph execution, rather than in op-by-op (eager) mode. Enabled by default.
317+
318+
### GGML_CANN_GRAPH_CACHE_CAPACITY
319+
320+
Maximum number of compiled CANN graphs kept in the LRU cache, default is 12. When the number of cached graphs exceeds this capacity, the least recently used graph will be evicted.

0 commit comments

Comments
 (0)