Skip to content

Feature/retry timeout handling#6

Open
stephane-experis wants to merge 14 commits intoIBM:mainfrom
stephane-experis:feature/retry-timeout-handling
Open

Feature/retry timeout handling#6
stephane-experis wants to merge 14 commits intoIBM:mainfrom
stephane-experis:feature/retry-timeout-handling

Conversation

@stephane-experis
Copy link

Add Automatic Retry Mechanism with Exponential Backoff

Description

This PR implements automatic retry functionality at the HTTP transport layer to handle transient failures gracefully, improving the reliability of the IBM watsonx Orchestrate ADK.

Problem Solved

  • Transient network failures causing agent operations to fail unnecessarily
  • No automatic recovery from temporary server overload (503 errors)
  • Manual retry logic required in user code for handling timeouts

Implementation Details

  • Added retry_handler.py with configurable retry logic using exponential backoff
  • Integrated retry mechanism into all BaseAPIClient HTTP methods (_get, _post, _put, _patch, _delete)
  • Support for environment variables for runtime configuration
  • Exponential backoff with jitter to prevent thundering herd problem
  • Proper error classification to distinguish retryable vs non-retryable errors
  • Special handling for rate limits (HTTP 429) with extended backoff

Testing

  • ✅ 48 unit tests covering all retry scenarios
  • ✅ Integration tests with mock server verifying retry behavior
  • ✅ Tested with environment variable configuration and overrides
  • ✅ All existing tests continue to pass

Documentation

  • Added comprehensive configuration guide (docs/RETRY_CONFIGURATION.md)
  • Added comparison with flow node-level retries (docs/RETRY_COMPARISON.md)
  • Updated README with retry functionality section
  • Added example usage in examples/client/retry_configuration_example.py
  • Created .env.example with all retry configuration options

Configuration

All retry behavior is configurable via environment variables:

  • ADK_MAX_RETRIES - Maximum retry attempts (default: 3)
  • ADK_RETRY_INTERVAL - Initial retry interval in milliseconds (default: 1000)
  • ADK_TIMEOUT - Request timeout in seconds (default: 300)
  • ADK_BACKOFF_MULTIPLIER - Exponential backoff multiplier (default: 2.0)
  • ADK_JITTER_PERCENTAGE - Jitter percentage to add randomness (default: 0.2)

Retryable Errors

  • ✅ Network timeouts (requests.Timeout)
  • ✅ Connection errors (requests.ConnectionError, ChunkedEncodingError)
  • ✅ Server errors (HTTP 500, 502, 503, 504)
  • ✅ Rate limits (HTTP 429)

Non-Retryable Errors (Fail Fast)

  • ❌ Client errors (HTTP 400, 401, 403, 404)
  • ❌ Authentication failures

Backwards Compatibility

  • No breaking changes to existing API
  • Retry mechanism is transparent to existing code
  • Can be disabled by setting ADK_MAX_RETRIES=0

Code Quality

  • Follows repository conventions and code of conduct
  • Uses conventional commit format
  • Maintains existing code patterns and idioms

erifsx and others added 14 commits July 7, 2025 09:28
Signed-off-by: Eric Marcoux <eric.marcoux@ibm.com>
)

Signed-off-by: Eric Marcoux <eric.marcoux@ibm.com>
* updates verison to 1.12.1

* fix(developer-edition): hardcodes langflow version

* release of 1.12.2

---------

Signed-off-by: Eric Marcoux <eric.marcoux@ibm.com>
- Add retry_handler.py with exponential backoff and jitter
- Integrate retry decorator into BaseAPIClient for all HTTP methods
- Support environment variable configuration (ADK_MAX_RETRIES, ADK_RETRY_INTERVAL, ADK_TIMEOUT)
- Smart error classification for retryable vs non-retryable errors
- Special handling for rate limits (HTTP 429) with extended backoff

BREAKING CHANGE: None - fully backward compatible with existing code
- Add retry_handler.py with configurable retry logic
- Support environment variables (ADK_MAX_RETRIES, ADK_RETRY_INTERVAL, etc.)
- Implement exponential backoff with jitter to prevent thundering herd
- Classify retryable vs non-retryable errors (including connection errors)
- Handle rate limits with extended backoff
- Integrate retry mechanism into all BaseAPIClient HTTP methods
- Add comprehensive unit tests (48 test cases)
- Add integration tests with mock server
- Add extensive documentation and configuration guides
- Add example usage in examples/client/
- Update README with retry functionality section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants