-
-
Notifications
You must be signed in to change notification settings - Fork 102
[Platform][Ollama] Add prompt cache #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
51fa81c
to
1fa590a
Compare
bf5a1fe
to
5ef4417
Compare
$result = $agent->call($messages, [ | ||
'prompt_cache_key' => 'chat', | ||
]); | ||
|
||
echo $result->getContent().\PHP_EOL; | ||
|
||
$secondResult = $agent->call($messages, [ | ||
'prompt_cache_key' => 'chat', | ||
]); | ||
|
||
echo $result->getContent().\PHP_EOL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we ensure, that it really uses the cache and does not just return the exact same answer twice?
5ef4417
to
cc5f431
Compare
->arrayNode('ollama') | ||
->children() | ||
->scalarNode('host_url')->defaultValue('http://127.0.0.1:11434')->end() | ||
->scalarNode('cache')->end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might end with the same cache repeated again and again in every platform.
Should we introduce a cache config key at a higher level ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, IMHO, we should introduce a "root" key for it and allowing the override "per platform", @OskarStark @chr-hertel, any idea?
8e15d3d
to
46bca63
Compare
914eddf
to
8112ab2
Compare
$metadata->add('cached', true); | ||
$metadata->add('prompt_cache_key', $options['prompt_cache_key']); | ||
$metadata->add('cached_prompt_count', $data['prompt_eval_count']); | ||
$metadata->add('cached_completion_count', $data['eval_count']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to group this data into a DTO, like there is TokenUsage
, and then add that DTO to metadata or perhaps even reuse the said DTO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not convinced about the benefits of using an object here, we're only storing an integer, I don't see the benefits to be honest 🤔
@OskarStark @chr-hertel Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree, it would be great to have an object like CacheUsage
similar to TokenUsage
$firstCall = $platform->invoke(new Ollama(Ollama::LLAMA_3_2), [ | ||
'messages' => [ | ||
[ | ||
'role' => 'user', | ||
'content' => 'Say hello world', | ||
], | ||
], | ||
'model' => 'llama3.2', | ||
], [ | ||
'prompt_cache_key' => 'foo', | ||
]); | ||
|
||
$result = $firstCall->getResult(); | ||
|
||
$this->assertSame('Hello world', $result->getContent()); | ||
$this->assertSame(10, $result->getMetadata()->get('cached_prompt_count')); | ||
$this->assertSame(10, $result->getMetadata()->get('cached_completion_count')); | ||
|
||
$secondCall = $platform->invoke(new Ollama(Ollama::LLAMA_3_2), [ | ||
'messages' => [ | ||
[ | ||
'role' => 'user', | ||
'content' => 'Say hello world', | ||
], | ||
], | ||
'model' => 'llama3.2', | ||
], [ | ||
'prompt_cache_key' => 'foo', | ||
]); | ||
|
||
$secondResult = $secondCall->getResult(); | ||
|
||
$this->assertSame('Hello world', $secondResult->getContent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$firstCall = $platform->invoke(new Ollama(Ollama::LLAMA_3_2), [ | |
'messages' => [ | |
[ | |
'role' => 'user', | |
'content' => 'Say hello world', | |
], | |
], | |
'model' => 'llama3.2', | |
], [ | |
'prompt_cache_key' => 'foo', | |
]); | |
$result = $firstCall->getResult(); | |
$this->assertSame('Hello world', $result->getContent()); | |
$this->assertSame(10, $result->getMetadata()->get('cached_prompt_count')); | |
$this->assertSame(10, $result->getMetadata()->get('cached_completion_count')); | |
$secondCall = $platform->invoke(new Ollama(Ollama::LLAMA_3_2), [ | |
'messages' => [ | |
[ | |
'role' => 'user', | |
'content' => 'Say hello world', | |
], | |
], | |
'model' => 'llama3.2', | |
], [ | |
'prompt_cache_key' => 'foo', | |
]); | |
$secondResult = $secondCall->getResult(); | |
$this->assertSame('Hello world', $secondResult->getContent()); | |
$firstCall = $platform->invoke(new Ollama(Ollama::LLAMA_3_2), [ | |
'messages' => [ | |
[ | |
'role' => 'user', | |
'content' => 'Say hello world', | |
], | |
], | |
'model' => 'llama3.2', | |
], [ | |
'prompt_cache_key' => 'foo', | |
]); | |
$secondCall = $platform->invoke(new Ollama(Ollama::LLAMA_3_2), [ | |
'messages' => [ | |
[ | |
'role' => 'user', | |
'content' => 'Say hello world', | |
], | |
], | |
'model' => 'llama3.2', | |
], [ | |
'prompt_cache_key' => 'foo', | |
]); | |
$firstResult = $firstCall->getResult(); | |
$secondResult = $secondCall->getResult(); | |
$this->assertSame('Hello world', $firstResult->getContent()); | |
$this->assertSame(10, $firstResult->getMetadata()->get('cached_prompt_count')); | |
$this->assertSame(10, $firstResult->getMetadata()->get('cached_completion_count')); | |
$this->assertSame('Hello world', $secondResult->getContent()); |
de85ef7
to
e038555
Compare
Let's zoom a bit out here, for two reasons:
|
Ollama does a "context caching" and/or a K/V caching, it stores the X latest messages for the model window (or pending tokens to speed TTFT), it's not a cache that returns the generated response if the request already exist.
Well, because that's the one that I use the most and the easiest to implement first but we can integrate it for every platform if that's the question, we just need to use the API contract, both Anthropic and OpenAI already does it natively 🤔 If the question is: Could we implement it at the platform layer for every platform without relying on API calls, well, that's not a big deal to be honest and we could easily integrate it 🙂 |
What do you think about having it as decorator |
I like the idea of |
e038555
to
194f7a4
Compare
if ('ollama' === $type) { | ||
$arguments = [ | ||
$platform['host_url'], | ||
new Reference('http_client', ContainerInterface::NULL_ON_INVALID_REFERENCE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new Reference('http_client', ContainerInterface::NULL_ON_INVALID_REFERENCE), | |
new Reference('http_client', ContainerInterface::NULL_ON_INVALID_REFERENCE), | |
new Reference('ai.platform.model_catalog.ollama'), |
if (\array_key_exists('cache', $platform)) { | ||
$arguments[] = new Reference($platform['cache'], ContainerInterface::NULL_ON_INVALID_REFERENCE); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$arguments is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes here would also belong into CachedPlatform
so every bridge can benefit from this decorator
Hi 👋🏻
This PR aim to introduce a caching layer for
Ollama
platform (like OpenAI, Anthropic and more already does).