Respect Retry-After header in OpenAI retry decorator by debu-sinha · Pull Request #20813 · run-llama/llama_index

debu-sinha · 2026-02-27T15:00:00Z

Description

The OpenAI retry decorator for both LLM and embeddings integrations currently uses a fixed exponential backoff for all retryable errors. When the server responds with a 429 status and includes a Retry-After header, the client should wait the server-specified duration instead of guessing with exponential backoff.

Without this fix, the retry loop can either wait too long (wasting time when the server says "try again in 1 second" but backoff says "wait 30 seconds") or not long enough (retrying before the rate limit window resets, burning through all retries uselessly).

Changes

New: _WaitRetryAfter wait strategy (added to both embeddings and LLM utils.py):

Subclasses tenacity's wait_base to integrate cleanly with the existing retry stack
On RateLimitError: extracts Retry-After from response.headers (httpx.Headers, case-insensitive)
Caps the wait at 120 seconds to prevent a misbehaving server from stalling indefinitely
Falls back to the existing exponential backoff for all other errors, missing headers, or unparseable values

New: _parse_retry_after helper:

Extracts and validates the Retry-After header value
Handles edge cases: missing response, missing headers, non-numeric values, negative values, empty strings

No breaking changes: The function signature of create_retry_decorator() is unchanged. Existing behavior is preserved for non-RateLimitError exceptions and when the Retry-After header is absent.

Files Changed

File	Change
`llama-index-integrations/embeddings/.../openai/utils.py`	Added `_WaitRetryAfter`, `_parse_retry_after`, updated `create_retry_decorator`
`llama-index-integrations/llms/.../openai/utils.py`	Same changes (LLM counterpart)
`llama-index-integrations/embeddings/.../tests/test_retry_after.py`	17 new tests
`llama-index-integrations/llms/.../tests/test_retry_after.py`	18 new tests

Testing

35 new unit and integration tests covering:

Header parsing: integer, float, zero, missing, non-numeric (HTTP-date), negative, empty, case-insensitive, no response object
Wait strategy: uses header value, caps at 120s maximum, falls back for missing header, falls back for non-RateLimitError, falls back for unparseable header, falls back when outcome is None
Integration: full decorator stack respects Retry-After, retries exhaust at max_retries, non-RateLimitError still retries with exponential backoff

# Embeddings: 31 passed (6 existing + 17 new + 8 utils)
cd llama-index-integrations/embeddings/llama-index-embeddings-openai
pytest tests/ -v

# LLM: 41 passed (18 new + 1 existing retry + 22 existing utils)
cd llama-index-integrations/llms/llama-index-llms-openai
pytest tests/test_retry_after.py tests/test_openai_utils.py tests/test_openai.py::test_completion_model_with_retry -v

All existing tests pass unchanged.

Context

This is a follow-up to #14801 / PR #20712 (token-bucket rate limiter) which added proactive rate limiting. This PR addresses the reactive side: when a 429 does occur, the client now waits the exact amount of time the server specifies rather than guessing.

Azure OpenAI inherits from OpenAI (class AzureOpenAI(OpenAI)) so this fix applies to Azure automatically.

AstraBert

Looks good, as usual you need to bump the version of the integrations you have modified in order for them to be published

The retry decorator for both OpenAI LLM and embeddings integrations previously used a fixed exponential backoff for all retryable errors, including RateLimitError. When the server sends a Retry-After header, the client should wait the specified duration instead of guessing with exponential backoff. This adds a custom tenacity wait strategy (_WaitRetryAfter) that extracts the Retry-After header from RateLimitError responses and uses it as the sleep duration, capped at 120 seconds. For all other errors or when the header is missing, it falls back to the existing exponential backoff behavior. Fixes run-llama#15649 Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha · 2026-03-02T17:16:39Z

Good call, bumped both packages:

llama-index-embeddings-openai: 0.5.1 -> 0.5.2
llama-index-llms-openai: 0.6.23 -> 0.6.24

Also rebased on latest main.

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Feb 27, 2026

debu-sinha mentioned this pull request Feb 27, 2026

[Feature Request]: Built-in LLM Failover for Reliability #19631

Open

AstraBert approved these changes Mar 2, 2026

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 2, 2026

debu-sinha mentioned this pull request Mar 2, 2026

Add future annotations to prompt_helper for Python 3.9 compat #20850

Open

debu-sinha added 3 commits March 2, 2026 12:15

Fix ruff lint and format issues in retry-after code

d0cd49a

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

Bump embeddings-openai to 0.5.2 and llms-openai to 0.6.24

571f4e8

Signed-off-by: debu-sinha <debusinha2009@gmail.com>

debu-sinha force-pushed the fix/retry-after-header branch from 77b500b to 571f4e8 Compare March 2, 2026 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect Retry-After header in OpenAI retry decorator#20813

Respect Retry-After header in OpenAI retry decorator#20813
debu-sinha wants to merge 3 commits intorun-llama:mainfrom
debu-sinha:fix/retry-after-header

debu-sinha commented Feb 27, 2026

Uh oh!

AstraBert left a comment

Uh oh!

debu-sinha commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

debu-sinha commented Feb 27, 2026

Description

Changes

Files Changed

Testing

Context

Uh oh!

AstraBert left a comment

Choose a reason for hiding this comment

Uh oh!

debu-sinha commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants