Skip to content

Prefixes for embedding models (possibly also others) #39

@schubidoo

Description

@schubidoo

Issue Description

Some embedding models require prefixes for using them, particularly Nomic (which is one of the models on the embedding model list in aichat). For other models it is recommended to use them. However, when setting Nomic as the embedder, no such prefixes are sent, at least according to the --loglevel debug output. It seems that only the chunks are sent for embedding, and only the query for requests later, both without the prefix.

Nomic expects:

"search_document:" as prefix during the creation of embedding vectors, and
"search_query:" during the retrieval process

It also supports "clustering" and "classification" as prefixes.

Not sure how critical this is, but in the documentation on huggingface (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5), it says:

Important: the text prompt must include a task instruction prefix, instructing the model which task is being performed.

For example, if you are implementing a RAG application, you embed your documents as search_document: and embed your user queries as search_query: .

I classified this as a bug based on that description, but it is certainly not a critically breaking one...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions