Merge pull request #346 from Portkey-AI/chore/gemini-thinking

vrushankportkey · web-flow · commit 236e349686fd · 2025-05-28T14:35:53.000+05:30
update docs for gemini thinking
diff --git a/integrations/llms/gemini.mdx b/integrations/llms/gemini.mdx
@@ -238,8 +238,19 @@ Grounding is invoked by passing the `google_search` tool (for newer models like
 If you mix regular tools with grounding tools, vertex might throw an error saying only one tool can be used at a time.
 </Warning>
 
-## thinking models
 
+## Extended Thinking (Reasoning Models) (Beta)
+
+<Note>
+The assistants thinking response is returned in the `response_chunk.choices[0].delta.content_blocks` array, not the `response.choices[0].message.content` string.
+</Note>
+
+Models like `gemini-2.5-flash-preview-04-17` `gemini-2.5-flash-preview-04-17` support [extended thinking](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#claude-3-7-sonnet).
+This is similar to openai thinking, but you get the model's reasoning as it processes the request as well.
+
+Note that you will have to set [`strict_open_ai_compliance=False`](/product/ai-gateway/strict-open-ai-compliance) in the headers to use this feature.
+
+### Single turn conversation
 <CodeGroup>
     ```py Python
     from portkey_ai import Portkey
@@ -273,6 +284,16 @@ If you mix regular tools with grounding tools, vertex might throw an error sayin
       ]
     )
     print(response)
+    # in case of streaming responses you'd have to parse the response_chunk.choices[0].delta.content_blocks array
+    # response = portkey.chat.completions.create(
+    #   ...same config as above but with stream: true
+    # )
+    # for chunk in response:
+    #     if chunk.choices[0].delta:
+    #         content_blocks = chunk.choices[0].delta.get("content_blocks")
+    #         if content_blocks is not None:
+    #             for content_block in content_blocks:
+    #                 print(content_block)
     ```
     ```ts NodeJS
     import Portkey from 'portkey-ai';
@@ -307,7 +328,18 @@ If you mix regular tools with grounding tools, vertex might throw an error sayin
           ]
         });
         console.log(response);
-
+      // in case of streaming responses you'd have to parse the response_chunk.choices[0].delta.content_blocks array
+      // const response = await portkey.chat.completions.create({
+      //   ...same config as above but with stream: true
+      // });
+      // for await (const chunk of response) {
+      //   if (chunk.choices[0].delta?.content_blocks) {
+      //     for (const contentBlock of chunk.choices[0].delta.content_blocks) {
+      //       console.log(contentBlock);
+      //     }
+      //   }
+      // }
+      }
     // Call the function
     getChatCompletionFunctions();
     ```
@@ -349,6 +381,17 @@ If you mix regular tools with grounding tools, vertex might throw an error sayin
       });
 
       console.log(response)
+      // in case of streaming responses you'd have to parse the response_chunk.choices[0].delta.content_blocks array
+      // const response = await openai.chat.completions.create({
+      //   ...same config as above but with stream: true
+      // });
+      // for await (const chunk of response) {
+      //   if (chunk.choices[0].delta?.content_blocks) {
+      //     for (const contentBlock of chunk.choices[0].delta.content_blocks) {
+      //       console.log(contentBlock);
+      //     }
+      //   }
+      // }
     }
     await getChatCompletionFunctions();
     ```
@@ -421,10 +464,10 @@ If you mix regular tools with grounding tools, vertex might throw an error sayin
 </CodeGroup>
 
 <Note>
-    To disable thinking for gemini models like `google.gemini-2.5-flash-preview-04-17`, you are required to explicitly set `budget_tokens` to `0` and `type` to `disabled`.
+    To disable thinking for gemini models like `gemini-2.5-flash-preview-04-17`, you are required to explicitly set `budget_tokens` to `0`.
     ```json
     "thinking": {
-        "type": "disabled",
+        "type": "enabled",
         "budget_tokens": 0
     }
     ```
diff --git a/integrations/llms/vertex-ai.mdx b/integrations/llms/vertex-ai.mdx
@@ -265,7 +265,7 @@ curl --location 'https://api.portkey.ai/v1/chat/completions' \
 <Note>
 The assistants thinking response is returned in the `response_chunk.choices[0].delta.content_blocks` array, not the `response.choices[0].message.content` string.
 
-Gemini models do no return their chain-of-thought-messages, so content_blocks are not required for Gemini models.
+Gemini models do not support plugging back the reasoning into multi turn conversations, so you don't need to send the thinking message back to the model.
 </Note>
 
 Models like `google.gemini-2.5-flash-preview-04-17` `anthropic.claude-3-7-sonnet@20250219` support [extended thinking](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude#claude-3-7-sonnet).