Releases · agray3/llama.cpp

09 Oct 16:45

e702206

b3901

perplexity : fix integer overflow (#9783)

* perplexity : fix integer overflow

ggml-ci

* perplexity : keep n_vocab as int and make appropriate casts

ggml-ci

Assets 22

17 Sep 07:57

github-actions

b3774

0d2ec43

b3774

llama : support IBM Granite architecture (#9412)

* feat(gguf-py): Add Granite model and params to gguf-py

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(convert_hf_to_gguf): Add registration and param setup for Granite

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): Add config parsing for Granite multiplier params

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): First pass at full port of granite deviations from llama

Something is still not working right since the results are mostly terrible,
but on occasion it's producing relevant results at this point, so
_something_ is working.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama.cpp): Determine granite language 3b instruct by vocab size

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(convert_hf_to_gguf): Use LlamaModel as base for GraniteModel

The defaults in LlamaModel are needed for Granite as well

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama.cpp): Switch Granite param names to use _scale for consistency

Other scalar multipliers are called *_scale, so this provides a more
consistent naming convention.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(convert_hf_to_gguf/gguf-py): _multiplier -> _scale

The transformers names with _multiplier will now be converted to the _scale
equivalent during conversion.

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama.cpp): Use separate switch clause for granite in llm_load_hparams

Branch: GraniteLM

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>

Assets 19

12 Aug 17:02

github-actions

b3577

0fd93cd

b3577

llama : model-based max number of graph nodes calculation (#8970)

* llama : model-based max number of graph nodes calculation

* Update src/llama.cpp

---------

Co-authored-by: slaren <[email protected]>

Assets 20

08 Aug 14:38

github-actions

b3549

afd27f0

b3549

scripts : sync cann files (#0)

Assets 20

08 Jul 09:58

github-actions

b3342

470939d

b3342

common : preallocate sampling token data vector (#8363)

`emplace_back` repeatedly-called is slower than preallocating the vector to the vocab size and directly inserting the data. Some rudimentary profiling with `chrono` improves the performance of this block of code from ~500us/op to ~40us/op.

Overall, this slightly improves the sampling performance which has a more substantial impact for the `examples/lookahead` implementation -- I am able to see a ~10% performance boost in lookahead inference.

Assets 20

04 Jun 13:17

github-actions

b3082

987d743

b3082

Improve hipBLAS support in CMake (#7696)

* Improve hipBLAS support in CMake

This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK.

* Set ROCM_PATH correctly

Assets 21

03 Jun 08:08

github-actions

b3072

549279d

b3072

llama : avoid double token-to-piece cache (#7654)

ggml-ci

Assets 21

30 May 11:13

github-actions

b3044

d5c0582

b3044

ggml : fix loongarch build (O2 issue) (#7636)

Assets 21

27 May 12:43

github-actions

b3008

1d8fca7

b3008

metal : add GGML_OP_REPEAT kernels (#7557)

ggml-ci

Assets 21

24 May 12:49

github-actions

b2987

0df0aa8

b2987

add build shared lib in win release package (#7438)

Assets 21

Releases: agray3/llama.cpp

b3901

Uh oh!

b3774

Uh oh!

b3577

Uh oh!

b3549

Uh oh!

b3342

Uh oh!

b3082

Uh oh!

b3072

Uh oh!

b3044

Uh oh!

b3008

Uh oh!

b2987

Uh oh!