-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Retry Causes Concatenation #2210
Description
DISCLAIMER: I personally validated an unexpected and unwanted behavior exists, but I used an LLM to do a root cause analysis for the bug, which I have not yet validated.
- This is actually a bug report.
- I am not getting good LLM Results
- I have tried asking for help in the community on discord or discussions and have not received a response.
- I have tried searching the documentation and have not found an answer.
What Model are you using?
- gpt-3.5-turbo
- gpt-4-turbo
- gpt-4
- Other (please specify)
Gemini 2.5 Flash via google-genai SDK (GENAI_STRUCTURED_OUTPUTS mode)
Describe the bug
When a Gemini response is truncated due to hitting the output token limit (finish_reason=MAX_TOKENS), the GENAI_STRUCTURED_OUTPUTS code path does not detect this. Instead of raising IncompleteOutputException (non-retryable), instructor tries to parse the truncated JSON, gets a ValidationError (retryable), and enters the retry loop. Each retry appends the full truncated output to the prompt via reask_genai_structured_outputs, causing exponential prompt growth:
| Attempt | Prompt tokens | Output tokens | finish_reason |
|---|---|---|---|
| 1 | 15 | 1 | MAX_TOKENS |
| 2 | 450 | 4 | MAX_TOKENS |
| 3 | 1,227 | 1 | MAX_TOKENS |
Other providers already check for truncation before parsing — for example OpenAI checks finish_reason == "length" and Anthropic checks stop_reason == "max_tokens", both in function_calls.py. The parse_genai_structured_outputs method is missing this check.
In production with the default 65,536 max output token limit, this burns ~590K output tokens and ~920K prompt tokens per failure when the model happens to generate long string content (e.g. repetitive text inside a {"text": "..."} schema).
To Reproduce
import os
import instructor
from instructor.core.exceptions import InstructorRetryException
from google import genai
from pydantic import BaseModel
client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
structured_client = instructor.from_genai(
client, mode=instructor.Mode.GENAI_STRUCTURED_OUTPUTS
)
class Response(BaseModel):
text: str
try:
result = structured_client.chat.completions.create(
model="gemini-2.5-flash",
response_model=Response,
messages=[
{"role": "user", "content": "List all prime numbers between 1 and 500."}
],
max_retries=3,
generation_config={"max_tokens": 5}, # force truncation
)
except InstructorRetryException as e:
for attempt in e.failed_attempts:
resp = attempt.completion
candidate = resp.candidates[0]
usage = resp.usage_metadata
print(f"Attempt {attempt.attempt_number}: "
f"finish_reason={candidate.finish_reason}, "
f"prompt_tokens={usage.prompt_token_count}, "
f"exception={type(attempt.exception).__name__}")Output:
Attempt 1: finish_reason=FinishReason.MAX_TOKENS, prompt_tokens=15, exception=ValidationError
Attempt 2: finish_reason=FinishReason.MAX_TOKENS, prompt_tokens=450, exception=ValidationError
Attempt 3: finish_reason=FinishReason.MAX_TOKENS, prompt_tokens=1227, exception=ValidationError
Expected behavior
parse_genai_structured_outputs should check finish_reason before parsing and raise IncompleteOutputException when the response was truncated, matching the behavior of all other provider paths. Suggested fix:
# In instructor/processing/function_calls.py, parse_genai_structured_outputs:
@classmethod
def parse_genai_structured_outputs(cls, completion, validation_context=None, strict=None):
from google.genai import types
if (
hasattr(completion, "candidates")
and completion.candidates
and completion.candidates[0].finish_reason == types.FinishReason.MAX_TOKENS
):
raise IncompleteOutputException(last_completion=completion)
return cls.model_validate_json(
completion.text, context=validation_context, strict=strict
)Versions
instructor==1.14.4google-genai==1.46.0- Python 3.12