You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,12 @@
1
1
# Pull requests (for contributors)
2
2
3
+
- llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the [examples in the ggml repository](https://github.com/ggml-org/ggml/tree/master/examples/). [simple](https://github.com/ggml-org/ggml/tree/master/examples/simple) shows the bare minimum for using ggml. [gpt-2](https://github.com/ggml-org/ggml/tree/master/examples/gpt-2) has minimal implementations for language model inference using GPT-2. [mnist](https://github.com/ggml-org/ggml/tree/master/examples/mnist) demonstrates how to train and evaluate a simple image classifier
3
4
- Test your changes:
4
5
- Execute [the full CI locally on your machine](ci/README.md) before publishing
5
6
- Verify that the perplexity and the performance are not affected negatively by your changes (use `llama-perplexity` and `llama-bench`)
6
7
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
7
8
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
9
+
- Create separate PRs for each feature or fix. Avoid combining unrelated changes in a single PR
8
10
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
9
11
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments
Copy file name to clipboardExpand all lines: docs/backend/SYCL.md
+14-2Lines changed: 14 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,16 @@ The following release is verified with good quality:
42
42
43
43
## News
44
44
45
+
- 2025.2
46
+
- Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
47
+
|GPU|Base tokens/s|Increased tokens/s|Percent|
48
+
|-|-|-|-|
49
+
|PVC 1550|39|73|+87%|
50
+
|Flex 170|39|50|+28%|
51
+
|Arc770|42|55|+30%|
52
+
|MTL|13|16|+23%|
53
+
|ARL-H|14|17|+21%|
54
+
45
55
- 2024.11
46
56
- Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer.
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
673
+
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features based on Intel GPU type, to compare the performance increase |
663
674
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
The environment variable [`MUSA_VISIBLE_DEVICES`](https://docs.mthreads.com/musa-sdk/musa-sdk-doc-online/programming_guide/Z%E9%99%84%E5%BD%95/) can be used to specify which GPU(s) will be used.
210
218
211
219
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted.
0 commit comments