Skip to content

Commit e92a73d

Browse files
Merge pull request #14831 from otaviofbrito/minor/fix-context-caching-vertex
Vertex AI Context Caching: use Vertex ai API v1 instead of v1beta1 and accept 'cachedContent' param
2 parents 6964b5a + 2e7d9d1 commit e92a73d

File tree

3 files changed

+83
-16
lines changed

3 files changed

+83
-16
lines changed

docs/my-website/docs/providers/vertex.md

Lines changed: 71 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -815,6 +815,77 @@ Use Vertex AI context caching is supported by calling provider api directly. (Un
815815

816816
[**Go straight to provider**](../pass_through/vertex_ai.md#context-caching)
817817

818+
#### 1. Create the Cache
819+
820+
First, create the cache by sending a `POST` request to the `cachedContents` endpoint via the LiteLLM proxy.
821+
822+
<Tabs>
823+
<TabItem value="proxy" label="PROXY">
824+
825+
```bash
826+
curl http://0.0.0.0:4000/vertex_ai/v1/projects/{project_id}/locations/{location}/cachedContents \
827+
-H "Content-Type: application/json" \
828+
-H "Authorization: Bearer $LITELLM_KEY" \
829+
-d '{
830+
"model": "projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.5-flash",
831+
"displayName": "example_cache",
832+
"contents": [{
833+
"role": "user",
834+
"parts": [{
835+
"text": ".... a long book to be cached"
836+
}]
837+
}]
838+
}'
839+
```
840+
841+
</TabItem>
842+
</Tabs>
843+
844+
#### 2. Get the Cache Name from the Response
845+
846+
Vertex AI will return a response containing the `name` of the cached content. This name is the identifier for your cached data.
847+
848+
```json
849+
{
850+
"name": "projects/12341234/locations/{location}/cachedContents/123123123123123",
851+
"model": "projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.5-flash",
852+
"createTime": "2025-09-23T19:13:50.674976Z",
853+
"updateTime": "2025-09-23T19:13:50.674976Z",
854+
"expireTime": "2025-09-23T20:13:50.655988Z",
855+
"displayName": "example_cache",
856+
"usageMetadata": {
857+
"totalTokenCount": 1246,
858+
"textCount": 5132
859+
}
860+
}
861+
```
862+
863+
#### 3. Use the Cached Content
864+
865+
Use the `name` from the response as `cachedContent` or `cached_content` in subsequent API calls to reuse the cached information. This is passed in the body of your request to `/chat/completions`.
866+
867+
<Tabs>
868+
<TabItem value="proxy" label="PROXY">
869+
870+
```bash
871+
872+
curl http://0.0.0.0:4000/chat/completions \
873+
-H "Content-Type: application/json" \
874+
-H "Authorization: Bearer $LITELLM_KEY" \
875+
-d '{
876+
"cachedContent": "projects/545201925769/locations/us-central1/cachedContents/4511135542628319232",
877+
"model": "gemini-2.5-flash",
878+
"messages": [
879+
{
880+
"role": "user",
881+
"content": "what is the book about?"
882+
}
883+
]
884+
}'
885+
```
886+
887+
</TabItem>
888+
818889

819890
## Pre-requisites
820891
* `pip install google-cloud-aiplatform` (pre-installed on proxy docker image)
@@ -2724,7 +2795,3 @@ Once that's done, when you deploy the new container in the Google Cloud Run serv
27242795

27252796

27262797
s/o @[Darien Kindlund](https://www.linkedin.com/in/kindlund/) for this tutorial
2727-
2728-
2729-
2730-

litellm/llms/vertex_ai/gemini/transformation.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -537,7 +537,11 @@ def sync_transform_request_body(
537537
logging_obj=logging_obj,
538538
)
539539
else: # [TODO] implement context caching for gemini as well
540-
cached_content = optional_params.pop("cached_content", None)
540+
cached_content = None
541+
if "cached_content" in optional_params:
542+
cached_content = optional_params.pop("cached_content")
543+
elif "cachedContent" in optional_params:
544+
cached_content = optional_params.pop("cachedContent")
541545

542546
return _transform_request_body(
543547
messages=messages,
@@ -584,7 +588,11 @@ async def async_transform_request_body(
584588
logging_obj=logging_obj,
585589
)
586590
else: # [TODO] implement context caching for gemini as well
587-
cached_content = optional_params.pop("cached_content", None)
591+
cached_content = None
592+
if "cached_content" in optional_params:
593+
cached_content = optional_params.pop("cached_content")
594+
elif "cachedContent" in optional_params:
595+
cached_content = optional_params.pop("cachedContent")
588596

589597
return _transform_request_body(
590598
messages=messages,
@@ -649,5 +657,3 @@ def _transform_system_message(
649657
return SystemInstructions(parts=system_content_blocks), messages
650658

651659
return None, messages
652-
653-

litellm/llms/vertex_ai/vertex_llm_base.py

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -271,17 +271,11 @@ def _ensure_access_token(
271271

272272
def is_using_v1beta1_features(self, optional_params: dict) -> bool:
273273
"""
274-
VertexAI only supports ContextCaching on v1beta1
275-
276274
use this helper to decide if request should be sent to v1 or v1beta1
277275
278-
Returns v1beta1 if context caching is enabled
279-
Returns v1 in all other cases
276+
Returns true if any beta feature is enabled
277+
Returns false in all other cases
280278
"""
281-
if "cached_content" in optional_params:
282-
return True
283-
if "CachedContent" in optional_params:
284-
return True
285279
return False
286280

287281
def _check_custom_proxy(

0 commit comments

Comments
 (0)