|
| 1 | +# Rate Limit Handling Implementation |
| 2 | + |
| 3 | +This document describes the rate limit handling implementation added to MCP as a Judge to handle `litellm.RateLimitError` with exponential backoff. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The implementation uses the popular `tenacity` library to provide robust retry logic with exponential backoff specifically for rate limit errors from LiteLLM. This addresses the issue where OpenAI and other LLM providers return rate limit errors when token limits are exceeded. |
| 8 | + |
| 9 | +## Implementation Details |
| 10 | + |
| 11 | +### Dependencies Added |
| 12 | + |
| 13 | +- **tenacity>=8.0.0**: Popular Python retry library with decorators |
| 14 | + |
| 15 | +### Files Modified |
| 16 | + |
| 17 | +1. **`pyproject.toml`**: Added tenacity dependency |
| 18 | +2. **`src/mcp_as_a_judge/llm/llm_client.py`**: Added rate limit handling with retry logic |
| 19 | +3. **`tests/test_rate_limit_handling.py`**: Comprehensive tests for rate limit handling |
| 20 | +4. **`examples/rate_limit_demo.py`**: Demonstration script |
| 21 | + |
| 22 | +### Key Features |
| 23 | + |
| 24 | +#### Retry Configuration |
| 25 | + |
| 26 | +```python |
| 27 | +@retry( |
| 28 | + retry=retry_if_exception_type(litellm.RateLimitError), |
| 29 | + stop=stop_after_attempt(5), |
| 30 | + wait=wait_exponential(multiplier=2, min=2, max=120), |
| 31 | + reraise=True, |
| 32 | +) |
| 33 | +``` |
| 34 | + |
| 35 | +- **Max retries**: 5 attempts (total of 6 tries including initial) |
| 36 | +- **Base delay**: 2 seconds |
| 37 | +- **Max delay**: 120 seconds (2 minutes) |
| 38 | +- **Exponential multiplier**: 2.0 |
| 39 | +- **Jitter**: Built into tenacity's exponential wait |
| 40 | + |
| 41 | +#### Delay Pattern |
| 42 | + |
| 43 | +The exponential backoff follows this pattern: |
| 44 | +- Attempt 1: Immediate |
| 45 | +- Attempt 2: ~2 seconds delay |
| 46 | +- Attempt 3: ~4 seconds delay |
| 47 | +- Attempt 4: ~8 seconds delay |
| 48 | +- Attempt 5: ~16 seconds delay |
| 49 | +- Attempt 6: ~32 seconds delay |
| 50 | + |
| 51 | +Total maximum wait time: ~62 seconds across all retries. |
| 52 | + |
| 53 | +### Error Handling |
| 54 | + |
| 55 | +#### Rate Limit Errors |
| 56 | +- **Specific handling**: `litellm.RateLimitError` is caught and retried with exponential backoff |
| 57 | +- **Logging**: Each retry attempt is logged with timing information |
| 58 | +- **Final failure**: After all retries are exhausted, a clear error message is provided |
| 59 | + |
| 60 | +#### Other Errors |
| 61 | +- **No retry**: Non-rate-limit errors (e.g., authentication, validation) fail immediately |
| 62 | +- **Preserved behavior**: Existing error handling for other exception types is unchanged |
| 63 | + |
| 64 | +### Code Structure |
| 65 | + |
| 66 | +#### New Method: `_generate_text_with_retry` |
| 67 | + |
| 68 | +```python |
| 69 | +@retry(...) |
| 70 | +async def _generate_text_with_retry(self, completion_params: dict[str, Any]) -> Any: |
| 71 | + """Generate text with retry logic for rate limit errors.""" |
| 72 | +``` |
| 73 | + |
| 74 | +This method is decorated with tenacity retry logic and handles the actual LiteLLM completion call. |
| 75 | + |
| 76 | +#### Modified Method: `generate_text` |
| 77 | + |
| 78 | +The main `generate_text` method now: |
| 79 | +1. Builds completion parameters |
| 80 | +2. Calls `_generate_text_with_retry` for the actual LLM call |
| 81 | +3. Handles response parsing |
| 82 | +4. Provides specific error messages for rate limit vs. other errors |
| 83 | + |
| 84 | +## Usage Examples |
| 85 | + |
| 86 | +### Automatic Retry on Rate Limits |
| 87 | + |
| 88 | +```python |
| 89 | +from mcp_as_a_judge.llm.llm_client import LLMClient |
| 90 | +from mcp_as_a_judge.llm.llm_integration import LLMConfig, LLMVendor |
| 91 | + |
| 92 | +config = LLMConfig( |
| 93 | + api_key="your-api-key", |
| 94 | + model_name="gpt-4", |
| 95 | + vendor=LLMVendor.OPENAI, |
| 96 | +) |
| 97 | + |
| 98 | +client = LLMClient(config) |
| 99 | +messages = [{"role": "user", "content": "Hello!"}] |
| 100 | + |
| 101 | +# This will automatically retry on rate limit errors |
| 102 | +try: |
| 103 | + response = await client.generate_text(messages) |
| 104 | + print(f"Success: {response}") |
| 105 | +except Exception as e: |
| 106 | + print(f"Failed after retries: {e}") |
| 107 | +``` |
| 108 | + |
| 109 | +### Error Types |
| 110 | + |
| 111 | +#### Rate Limit Error (with retries) |
| 112 | +``` |
| 113 | +ERROR: Rate limit exceeded after retries: litellm.RateLimitError: RateLimitError: OpenAIException - Request too large for gpt-4.1... |
| 114 | +``` |
| 115 | + |
| 116 | +#### Other Errors (immediate failure) |
| 117 | +``` |
| 118 | +ERROR: LLM generation failed: Invalid API key |
| 119 | +``` |
| 120 | + |
| 121 | +## Testing |
| 122 | + |
| 123 | +### Test Coverage |
| 124 | + |
| 125 | +The implementation includes comprehensive tests: |
| 126 | + |
| 127 | +1. **Successful retry**: Rate limit errors followed by success |
| 128 | +2. **Retry exhaustion**: All retries fail with rate limit errors |
| 129 | +3. **Non-retryable errors**: Other errors fail immediately without retries |
| 130 | +4. **Successful generation**: Normal operation without retries |
| 131 | +5. **Timing verification**: Exponential backoff timing validation |
| 132 | + |
| 133 | +### Running Tests |
| 134 | + |
| 135 | +```bash |
| 136 | +# Run rate limit specific tests |
| 137 | +uv run pytest tests/test_rate_limit_handling.py -v |
| 138 | + |
| 139 | +# Run all LLM-related tests |
| 140 | +uv run pytest tests/ -k "llm" --tb=short |
| 141 | + |
| 142 | +# Run the demo |
| 143 | +uv run python examples/rate_limit_demo.py |
| 144 | +``` |
| 145 | + |
| 146 | +## Benefits |
| 147 | + |
| 148 | +1. **Resilience**: Automatic recovery from temporary rate limit issues |
| 149 | +2. **User Experience**: Reduces failed requests due to rate limiting |
| 150 | +3. **Efficiency**: Exponential backoff prevents overwhelming the API |
| 151 | +4. **Transparency**: Clear logging and error messages |
| 152 | +5. **Selective**: Only retries appropriate errors, fails fast on others |
| 153 | + |
| 154 | +## Configuration |
| 155 | + |
| 156 | +The retry behavior is currently hardcoded but can be easily made configurable by: |
| 157 | + |
| 158 | +1. Adding retry settings to `LLMConfig` |
| 159 | +2. Passing configuration to the retry decorator |
| 160 | +3. Supporting environment variables for retry tuning |
| 161 | + |
| 162 | +## Monitoring |
| 163 | + |
| 164 | +The implementation provides detailed logging: |
| 165 | + |
| 166 | +- Debug logs for each attempt |
| 167 | +- Warning logs for retry attempts with timing |
| 168 | +- Error logs for final failures |
| 169 | +- Success logs when retries succeed |
| 170 | + |
| 171 | +This allows for monitoring and tuning of the retry behavior in production environments. |
0 commit comments