Skip to content

Conversation

@jan-service-account
Copy link

Updates dev branch with latest release (b6724) from ggml-org/llama.cpp

sfallah and others added 9 commits October 9, 2025 09:39
…odules (ggml-org#16367)

* model: EmbeddingGemma sentence-transformers dense linear projections support

* model: add support for EmbeddingGemma SentenceTransformers dense linear projections

Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.

See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/

* model: add support for EmbeddingGemma SentenceTransformers dense linear projections

- converting model with dense-layers is optional
- introduced dense config params

* Update convert_hf_to_gguf.py

Co-authored-by: Daniel Bevenius <[email protected]>

* fixed formatting issues

* Update src/llama-graph.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims

* fix python lint

* fix ubuntu gcc build warning

* - fixed thread-safety test
- moved asserts to load_hparams

* - tidying up code
- simplifying graph-context expecting both dense weights

* minor : add TODO

---------

Co-authored-by: Daniel Bevenius <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* refactor to support soft_max_ext

* fix error and support soft_max_back

* rm unused functions

* fix format issue

---------

Co-authored-by: Zhang Jianyu <[email protected]>
* CANN: improve ACL graph matching

Record `ne` and `nb` information for src tensors and include them in the
graph matching check. This enhances the robustness of ACL graph matching
by preventing incorrect matches when src tensors share the same data
address but differ in shape or stride.

* CANN: add op_params match
* model-conversion : add support for SentenceTransformers

This commit adds support for models that use SentenceTransformer layers.

The motivation for this is that if converted model includes any of the
numbered layers specified in the original models repository then these
changes enable these models to be used and verified. Currently the
model-conversion only support the base model output without any of
the additional transformation layers.

Usage:
Convert the model that also includes the SentenceTransformer layers:
```console
(venv) $ export EMBEDDING_MODEL_PATH="~/google/embeddinggemma-300M"
(venv) make embedding-convert-model
```

Verify the produced embeddings from the converted model against the
original model embeddings:
```console
(venv) make embedding-verify-logits-st
```

The original model can be run using SentenceTransformer:
```console
(venv) make embedding-run-original-model-st
```

Run the converted model using "SentenceTransformer" layers whic
enables pooling and normalization:
```console
(venv) make embedding-run-converted-model-st
```

* add model-conversion example requirements

* add support for -st flag in embedding model conversion

This commit add support for the -st flag in the embedding model
conversion script. This will enable models to be converted using
sentence transformers dense layers.
* fix: let the model think in plaintext

* chore: npm run format + npm run build
* minor : code style

* server : fix prompt similarity calculation

* server : initial host-memory prompt caching

* cont

* server : refactor

* cont

* cont : make the server task of the slot const

* cont : minor [no ci]

* server : cache prompts and checkpoints only for completion tasks

* server : improve prompt caching logic

* cont : fix check for number of cached prompts [no ci]

* server : improve caching logic, add -cram CLI arg

* server : print prompt mismatch info

* cont : better naming [no ci]

* server : improve prompt cache loading logic

* server : add option to debug the slot contents (ggml-org#16482)

* server : add option to debug the slot contents

* Update tools/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <[email protected]>

* server : add option to disable prompt cache

---------

Co-authored-by: Xuan-Son Nguyen <[email protected]>
* ggml-cpu: optimize norm operation to use intrinsics or Accelerate

          rename function

          add endif macro comment

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Aaron Teo <[email protected]>

* implement s390x SIMD suggested by @taronaeo

* add TODO comment

* tidy up spaces

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Aaron Teo <[email protected]>
@jan-service-account jan-service-account merged commit 7034082 into dev Oct 10, 2025
3 checks passed
@jan-service-account jan-service-account deleted the update-dev-from-master-2025-10-10-00-33 branch October 10, 2025 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.