-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Add ability to choose a tokenizer before embedding
Motivation
Currently, the embedding example has a very complete flow that works really well with sentence-transformers/all-MiniLM-L6-v2
and other small models. Unfortunately, it does not work very well with models that are not of the DistilBERT architecture. For example: all-mpnet-base-v2
performs badly when compared with the Python implementation.
I suppose this is due to the tokenizer. llama.cpp llama_tokenize is intended at LLMs, not sentence embedders. It would be very valuable to add modularity to this.
Possible Implementation
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request