Skip to content

Respect Retry-After header in OpenAI retry decorator#20813

Open
debu-sinha wants to merge 3 commits intorun-llama:mainfrom
debu-sinha:fix/retry-after-header
Open

Respect Retry-After header in OpenAI retry decorator#20813
debu-sinha wants to merge 3 commits intorun-llama:mainfrom
debu-sinha:fix/retry-after-header

Conversation

@debu-sinha
Copy link
Contributor

Fixes #15649

Description

The OpenAI retry decorator for both LLM and embeddings integrations currently uses a fixed exponential backoff for all retryable errors. When the server responds with a 429 status and includes a Retry-After header, the client should wait the server-specified duration instead of guessing with exponential backoff.

Without this fix, the retry loop can either wait too long (wasting time when the server says "try again in 1 second" but backoff says "wait 30 seconds") or not long enough (retrying before the rate limit window resets, burning through all retries uselessly).

Changes

New: _WaitRetryAfter wait strategy (added to both embeddings and LLM utils.py):

  • Subclasses tenacity's wait_base to integrate cleanly with the existing retry stack
  • On RateLimitError: extracts Retry-After from response.headers (httpx.Headers, case-insensitive)
  • Caps the wait at 120 seconds to prevent a misbehaving server from stalling indefinitely
  • Falls back to the existing exponential backoff for all other errors, missing headers, or unparseable values

New: _parse_retry_after helper:

  • Extracts and validates the Retry-After header value
  • Handles edge cases: missing response, missing headers, non-numeric values, negative values, empty strings

No breaking changes: The function signature of create_retry_decorator() is unchanged. Existing behavior is preserved for non-RateLimitError exceptions and when the Retry-After header is absent.

Files Changed

File Change
llama-index-integrations/embeddings/.../openai/utils.py Added _WaitRetryAfter, _parse_retry_after, updated create_retry_decorator
llama-index-integrations/llms/.../openai/utils.py Same changes (LLM counterpart)
llama-index-integrations/embeddings/.../tests/test_retry_after.py 17 new tests
llama-index-integrations/llms/.../tests/test_retry_after.py 18 new tests

Testing

35 new unit and integration tests covering:

  • Header parsing: integer, float, zero, missing, non-numeric (HTTP-date), negative, empty, case-insensitive, no response object
  • Wait strategy: uses header value, caps at 120s maximum, falls back for missing header, falls back for non-RateLimitError, falls back for unparseable header, falls back when outcome is None
  • Integration: full decorator stack respects Retry-After, retries exhaust at max_retries, non-RateLimitError still retries with exponential backoff
# Embeddings: 31 passed (6 existing + 17 new + 8 utils)
cd llama-index-integrations/embeddings/llama-index-embeddings-openai
pytest tests/ -v

# LLM: 41 passed (18 new + 1 existing retry + 22 existing utils)
cd llama-index-integrations/llms/llama-index-llms-openai
pytest tests/test_retry_after.py tests/test_openai_utils.py tests/test_openai.py::test_completion_model_with_retry -v

All existing tests pass unchanged.

Context

This is a follow-up to #14801 / PR #20712 (token-bucket rate limiter) which added proactive rate limiting. This PR addresses the reactive side: when a 429 does occur, the client now waits the exact amount of time the server specifies rather than guessing.

Azure OpenAI inherits from OpenAI (class AzureOpenAI(OpenAI)) so this fix applies to Azure automatically.

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Feb 27, 2026
Copy link
Member

@AstraBert AstraBert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, as usual you need to bump the version of the integrations you have modified in order for them to be published

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 2, 2026
The retry decorator for both OpenAI LLM and embeddings integrations
previously used a fixed exponential backoff for all retryable errors,
including RateLimitError. When the server sends a Retry-After header,
the client should wait the specified duration instead of guessing with
exponential backoff.

This adds a custom tenacity wait strategy (_WaitRetryAfter) that
extracts the Retry-After header from RateLimitError responses and
uses it as the sleep duration, capped at 120 seconds. For all other
errors or when the header is missing, it falls back to the existing
exponential backoff behavior.

Fixes run-llama#15649

Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
@debu-sinha debu-sinha force-pushed the fix/retry-after-header branch from 77b500b to 571f4e8 Compare March 2, 2026 17:16
@debu-sinha
Copy link
Contributor Author

Good call, bumped both packages:

  • llama-index-embeddings-openai: 0.5.1 -> 0.5.2
  • llama-index-llms-openai: 0.6.23 -> 0.6.24

Also rebased on latest main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Indexing hit rate limit error and keeps endless retrying

2 participants