sync : ggml #2561

ggerganov · 2024-11-15T06:44:16Z

This is before the ggml-org/llama.cpp#10256 changes. After this sync will start another that will propagate the ggml-org/llama.cpp#10256 updates.

Signed-off-by: Xiaodong Ye <[email protected]>

… MobileVLM model. (llama/9763) * ggml: Add POOL2D OP for GPU ACC to the Vulkan. - The MobileVLM model now supports inference acceleration through GPU by utilizing the Vulkan backend. - A GGML_OP_POOL_2D shader has been added. (Pooling) - The encoding performance of the CLIP model improved from 2.8s on the CPU to 0.7s on the GPU. Signed-off-by: Changyeon Kim <[email protected]> * [fix] Correct the incorrect order of the parameters. fix casting to int. Signed-off-by: Changyeon Kim <[email protected]> --------- Signed-off-by: Changyeon Kim <[email protected]>

* ggml : RISC-V vector gemv for q4_0_8x8 * ggml : Added WIP rvv q4_0_8x8 gemm * ggml : Added initial implementation of rvv gemm * ggml : optimize gemm to avoid register spillover * ggml : Fix GCC rvv load alignment issue * ggml : Format gemm rvv code * ggml : Fix a typo in RVV q4_0_8_8 GEMM

* ggml : fix gguf string leak when reading kv pairs fails * ggml : avoid crashing with GGML_ABORT when the KV has an invalid type * ggml : avoid crashing on failed memory allocations when loading a gguf file

Get in line with the other backends by supporting the newer backend/device registry interfaces. Signed-off-by: Sergio Lopez <[email protected]>

This is a more or less direct translation from the Metal implementation to GLSL. Signed-off-by: Sergio Lopez <[email protected]>

* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE

ggml-ci

* llama : add simple-chat example --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

* metal : minor fixup in FA kernel ggml-ci * metal : use the unrolled loop variable * metal : remove unused var

remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.

This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org/llama.cpp#10144

…a/10167)

* q6_k instruction reordering attempt * better subtract method * should be theoretically faster small improvement with shuffle lut, likely because all loads are already done at that stage * optimize bit fiddling * handle -32 offset separately. bsums exists for a reason! * use shift * Update ggml-quants.c * have to update ci macos version to 13 as 12 doesnt work now. 13 is still x86

* metal : add quantized FA (vec) support ggml-ci * metal : add quantized FA (non-vec) support * metal : fix support check ggml-ci * metal : clean-up * metal : clean-up (cont) * metal : fix shared memory calc + reduce smem + comments * metal : float-correctness * metal : minor [no ci]

ggml-ci

* ggml : add initial BF16 support ggml-ci * metal : add mul_mat_id BF16 support ggml-ci * metal : check for bfloat support on the Metal device ggml-ci * metal : better var names [no ci] * metal : do not build bfloat kernels when not supported ggml-ci * metal : try to fix BF16 support check ggml-ci * metal : this should correctly check bfloat support

…eleration (llama/10133) * rwkv6: rename to wkv6 * rwkv6: support avx2 avx512 armv8 armv9 * rwkv6: update cuda file name * rwkv6: rename params * wkv on sycl * sycl: add some ops * sycl: Enhance OP support judgment * wkv6: drop armv9 and tranfer to GGML style ggml-ci * sync : ggml * update the function to use appropriate types * fix define error * Update ggml/src/ggml-cpu.c * add appropriate asserts * move element-wise functions outside * put the declaration outside the loop * rewrite to be more inline with the common pattern for distributing threads * use recommended way GGML_TENSOR_LOCALS --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: Plamen Minev <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: Meng, Hengyu <[email protected]>

Co-authored-by: EC2 Default User <[email protected]>

* ggml : add ggml_flash_attn_ext_get_prec * metal : use F16 precision in FA kernels ggml-ci * metal : minor clean-up * metal : compile-guard bf16 FA kernels ggml-ci * build : remove obsolete compile flag [no ci] * metal : prevent int overflows [no ci] * cuda : disable BF16 FA ggml-ci * metal : fix BF16 requirement for FA kernels ggml-ci * make : clean-up [no ci]

* metal : opt-in compile flag for BF16 ggml-ci * ci : use BF16 ggml-ci * swift : switch back to v12 * metal : has_float -> use_float ggml-ci * metal : fix BF16 check in MSL ggml-ci

…a/10156) This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for FP32 datatype. This change results in a consistent 90% improvement in input processing time, and 20% to 80% improvement in output processing time, across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>

…ator when ‘ne’ is small (#10213)

* metal : reorder write loop * metal : int -> short, style ggml-ci

…ma/10226)

… (llama/10222) Fixes #9582 Spawning too many concurrent copies of glslc leads to "Failed to create pipes" errors on Linux. This change applies the same throttling we use for multithreaded pipeline creation.

* tests: Fix memory bandwidth calculation for perf tests Add a flops calculation for flash attention. Add one GGML_OP_CPY perf test. * vulkan: Optimize contiguous copies Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead. Apply similar changes to the scale shader, since scale is always contiguous. Add a "progress bar" for shader compiles.

* Fixes broken build for the SYCL CUDA backend caused by non-explicit gemm call in outprod (merged in with RWKV6 in Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration #10133) * Marks permuted MUL_MAT as unsupported to be able to run test-backend-ops * Fixes asserts in norm to fix debug builds.

ggerganov and others added 30 commits November 15, 2024 08:37

scripts : update sync

62eeaaf

metal : fix minor string leaks (ggml/1004)

a53ac6f

cmake : make it possible linking ggml as external lib (ggml/1003)

787b66f

musa: workaround for Guilty Lockup in cleaning src0 (llama/10042)

85c678c

Signed-off-by: Xiaodong Ye <[email protected]>

llama : refactor model loader with backend registry (llama/10026)

a0ea7d4

ggml : fix memory leaks when loading invalid gguf files (llama/10094)

4cbca54

* ggml : fix gguf string leak when reading kv pairs fails * ggml : avoid crashing with GGML_ABORT when the KV has an invalid type * ggml : avoid crashing on failed memory allocations when loading a gguf file

kompute: add backend registry / device interfaces (llama/10045)

d378f19

Get in line with the other backends by supporting the newer backend/device registry interfaces. Signed-off-by: Sergio Lopez <[email protected]>

kompute: add mul_mat_q4_k shader (llama/10097)

1812284

This is a more or less direct translation from the Metal implementation to GLSL. Signed-off-by: Sergio Lopez <[email protected]>

ggml : check tensor name lengths in gguf files (llama/10100)

1c83752

llama : fix buffer checks for mamba and rwk (llama/10111)

6b7f6be

* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE

build: fix build error in Windows env with OneAPI setup (llama/10107)

72cbb25

ggml : remove ggml_scratch (llama/10121)

6352fcd

ggml-ci

vulkan : improve ggml_vk_create_buffer error handling (llama/9898)

c28c6e8

llama : use smart pointers for ggml resources (llama/10117)

c7c5a95

llama : add simple-chat example (llama/10124)

749d287

* llama : add simple-chat example --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

metal : minor fixup in FA kernel (llama/10143)

384ee00

* metal : minor fixup in FA kernel ggml-ci * metal : use the unrolled loop variable * metal : remove unused var

ggml : move CPU backend to a separate file (llama/10144)

63f7286

CANN: adjust backend registry refactor. (llama/10158)

fa240b2

remove buffer->iface.get_name that used in cann as it was removed in backend registry refactor PR.

metal : move dequantize templates to beginning of MSL source (llama/0)

e75a453

metal : simplify f16 and f32 dequant kernels (llama/0)

e72fc8a

cuda : clear error after changing peer access (llama/10153)

801fdc2

fix build break on arm64 linux (llama/10166)

03b75f4

This fixes the build break from the recent changes to move the CPU backend to separate files ggml-org/llama.cpp#10144

ggml : fix q4xx mat mul, increase ggml_aligned_malloc alignment (llam…

c7655fe

…a/10167)

ggml : fix gelu tables initialization (llama/10172)

45ecfd9

ggml : fix arch check in bf16_to_fp32 (llama/10164)

7580d7e

ggml : adjust is_first_call init value (llama/10193)

354191f

ggml-ci

ggerganov and others added 21 commits November 15, 2024 08:37

fix q4_0_8_8 format for corrupted tokens issue (llama/10198)

3cae70b

Co-authored-by: EC2 Default User <[email protected]>

ggml : add ggml-cpu.h to the public headers (llama/10204)

44c0abf

metal : improve clarity (minor) (llama/10171)

2b11c93

metal : opt-in compile flag for BF16 (llama/10218)

6998ecf

* metal : opt-in compile flag for BF16 ggml-ci * ci : use BF16 ggml-ci * swift : switch back to v12 * metal : has_float -> use_float ggml-ci * metal : fix BF16 check in MSL ggml-ci

ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL oper…

d54b0d2

…ator when ‘ne’ is small (#10213)

metal : hide debug messages from normal log

065fc31

metal : fix F32 accumulation in FA vec kernel (llama/10232)

be6999e

metal : fix build and some more comments (llama/10229)

e47d0eb

metal : reorder write loop in mul mat kernel + style (llama/10231)

8536022

* metal : reorder write loop * metal : int -> short, style ggml-ci

vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (lla…

db5507a

…ma/10226)

metal : more precise Q*K in FA vec kernel (llama/10247)

c4c4d88

vulkan: Throttle the number of shader compiles during the build step.…

b606ad2

… (llama/10222) Fixes #9582 Spawning too many concurrent copies of glslc leads to "Failed to create pipes" errors on Linux. This change applies the same throttling we use for multithreaded pipeline creation.

sync : ggml

3c337b2

whisper : fix build (#0)

3df5e16

talk-llama : sync llama.cpp

d93631c

ggerganov force-pushed the sync branch 3 times, most recently from 64fc546 to be609e3 Compare November 15, 2024 08:35

ggerganov added 3 commits November 15, 2024 13:47

build : fixes

19927ad

whisper : include ggml-cpu.h (#0)

463849a

cmake : fix ppc64 check (#0)

f94863e

ggerganov force-pushed the sync branch from 558a43a to f94863e Compare November 15, 2024 11:47

ggerganov merged commit e23721f into master Nov 15, 2024
87 of 89 checks passed

ggerganov deleted the sync branch November 15, 2024 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #2561

sync : ggml #2561

Uh oh!

ggerganov commented Nov 15, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

sync : ggml #2561

sync : ggml #2561

Uh oh!

Conversation

ggerganov commented Nov 15, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants