-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support embeddings models #3252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| from pydantic_ai.models.instrumented import InstrumentationSettings | ||
| from pydantic_ai.providers import infer_provider | ||
|
|
||
| KnownEmbeddingModelName = TypeAliasType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test like this one to verify this is up to date:
| def test_known_model_names(): # pragma: lax no cover |
Docs Preview
|
|
Thanks for starting this and please do let me know if you need help :) One thing you might want to support from the start is having as part of the Embedding models have a limit of how many tokens of input they can handle. Most providers will raise ( All this is well explained here I would not necessarily truncate like in the cookbook and still just raise, but I would be grateful to have available from the model side the The only difficulty I see with this is that not all providers expose the tokenizers, for example Ollama does not. But still, would be nice to have it for the providers that do support it, as it's a crucial step when you are trying to chunk a document for embedding. In Edit: I am not suggesting that calling |
gvanrossum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to be able to comment on the API, but there are no tests showing how to call it.
|
@gvanrossum I'll make some progress on the PR today, but this is the API as it stands today: import asyncio
from pydantic_ai.embeddings import Embedder
embedder = Embedder("openai:text-embedding-3-large")
async def main():
result = await embedder.embed("Hello, world!")
print(result)
if __name__ == "__main__":
asyncio.run(main())With Azure OpenAI you currently have to create the model and provider manually, but we'll make import asyncio
from pydantic_ai.embeddings import Embedder
from pydantic_ai.embeddings.openai import OpenAIEmbeddingModel
from pydantic_ai.providers.azure import AzureProvider
model = OpenAIEmbeddingModel("text-embedding-3-large", provider=AzureProvider())
embedder = Embedder(model)
async def main():
result = await embedder.embed("Hello, world!")
print(result)
if __name__ == "__main__":
asyncio.run(main()) |
|
Nice. Do you have a bulk API too? That's essential for typeagent. |
|
@gvanrossum Yep, the |
|
@gvanrossum In case you'd like to give it a try pre-release, I've made some progress today, including support for |
|
Unfortunately I haven't managed to get to this this week. Next week should be better. |
# Conflicts: # pydantic_ai_slim/pydantic_ai/models/__init__.py
|
Following this PR now!
|
|
It might be nice to be able to do this as a single function call so you don't always need to create the embedder ahead of time, but I'm not sure if this fits with the rest of pydantic.ai ? |
|
@stuartaxonHO I personally don't think it's worth adding a helper function when the "verbose" version is just |
|
Think I was spoiled by the litellm version |
|
I like @DouweM 's current approach, as initializing the embedder ahead of time will always reduce embedding latency.
|
|
@tomaarsen Note that I moved the |
# Conflicts: # pydantic_ai_slim/pydantic_ai/models/__init__.py # pydantic_ai_slim/pyproject.toml
| class CohereEmbeddingSettings(EmbeddingSettings, total=False): | ||
| """Settings used for a Cohere embedding model request.""" | ||
|
|
||
| # ALL FIELDS MUST BE `cohere_` PREFIXED SO YOU CAN MERGE THEM WITH OTHER MODELS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is the case, should we just make cohere a top level field? Are there any fields that are shared between providers? I guess this isn't that important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmontagu I'd prefer for it to be consistent with ModelSettings where we took this prefix route
|
Do we need to include any utilities for actually doing queries on embeddings? I guess it's mostly going to be vector manipulation like numpy stuff or otherwise integrating with vector DBs, and maybe we don't want that in the library (yet?), but it seems like an obvious need for anyone working with these. I'm fine if we just start with the API wrappers but it feels like something that could definitely merit inclusion at some point |
As your typical user, I can tell that I do not need (and would not expect) pydantic ai to be doing the similarity calculations or any of the vector stuff. This would happen typically at the vector db level or for custom needs by coding. |
|
Same here. |
|
This might already be covered, but: one of the really annoying things is how all the low level APIs return data in different formats (though it's understandable there). LiteLLM helps the user by translating everything into openai format,. Langchain doesn't and everyone doing an embedding has to write code to work out where to get the floats from. It would be good pydantic.ai can avoid this by converting to a common format. It's great being able to call models of any name and provider, but if the user then also has to fiddle with formats on the other end some of the utility is lost. |
@stuaxo We return a |
# Conflicts: # pydantic_ai_slim/pyproject.toml
|
@ggozad @gvanrossum makes sense, appreciate the feedback |
Started this in collaboration with @DouweM, I'd like to ensure consensus on the API design before adding the remaining-providers/logfire-instrumentation/docs/tests.
This is inspired by the approach in haiku.rag, though we adapted it to be a bit closer to the
AgentAPIs are used (and how you can override model, settings, etc.).Closes #58
Example:
To do:
Embedder.embed_synccount_tokensmax_input_tokenslogfire.instrument_pydantic_ai(): Instrument Pydantic AI embedders and embedding models logfire#1575ModelAPIError)