Skip to content

Stabilize SDK HTTP Layer - Retries, Pooling, and Error Handling #718

@theexplorist

Description

@theexplorist

Problem

The SDK's HTTP transport(src/aiod/calls/calls.py)has no resilience. Every API call is a single barerequests.get()— if anything goes wrong, the user's code crashes. This document proposes 4 targeted changes to fix this, with dry runs showing exactly what breaks today and what changes.

Change 1: Add Automatic Retries

res = requests.get(url, timeout=config.request_timeout_seconds)   # ← bare call
resources = format_response(res.json(), data_format)              # ← no status check

Dry Run:

User runs:
    aiod.datasets.get_list()
Step 1: SDK builds URL → "https://api.aiod.eu/v2/datasets?offset=0&limit=10"
Step 2: SDK calls requests.get(url, timeout=10)
Step 3: The server is momentarily overloaded, returns HTTP 503 (Service Unavailable)
Step 4: SDK calls res.json() on the 503 response
Step 5: The 503 response body is HTML: "<html><body>Service Unavailable</body></html>"
Step 6: res.json() crashes → json.decoder.JSONDecodeError: Expecting value: line 1 column 1
Step 7: User sees a cryptic traceback. Script dies.

Proposed Fix -

3 automatic retries with backoff using tenacity which is battle tested production grade library for retry/backoff

Change 2: Reuse Connections (Connection Pooling)

# get_list → new connection
res = requests.get(url, timeout=config.request_timeout_seconds)
# get_asset → another new connection
res = requests.get(url, headers=_get_auth_headers(...), timeout=config.request_timeout_seconds)
# search → yet another new connection
res = requests.get(url, timeout=config.request_timeout_seconds)

Dry Run:

User runs:
    for id in [1, 2, 3, 4, 5]:
        aiod.datasets.get_asset(id)

Request 1 (id=1):
    DNS lookup for api.aiod.eu          
    TCP handshake (SYN → SYN-ACK → ACK)
    TLS handshake (certificate exchange)  
    Send HTTP GET, receive response
    Close connection 

Request 2 (id=2):
    DNS lookup for api.aiod.eu          
    TCP handshake                       
    TLS handshake                        
    Send HTTP GET, receive response      
    Close connection                  


repeat for id=3, 4, 5...

Proposed Fix -

httpx.Client() — built-in persistent connections

Change 3: Centralized Error Handling

There are 3 different error-handling patterns in calls.py today, and some functions with NO error handling:

  • checks status but crashes if detail is None → TypeError
    Used in: get_asset(), delete_asset(), put_asset(), patch_asset(), get_asset_from_platform()
  • returns raw Response object on failure (user has to inspect it themselves)
    Used in: post_asset()
  • NO error handling at all, crashes on non-JSON responses
    Used in: get_list(), search(), counts(), get_content()

Proposed Fix - Centralized Error Handling Or Response Interceptor Instead Of Adhoc Or Scattered Error Handling

Change 4: One HTTP Library Instead of Two

The SDK uses requests for sync and aiohttp for async. They're two completely different libraries.

Sync call (requests)
requests.get(url, timeout=10)

Async call (aiohttp) — different syntax, different timeout rules
async with session.get(url, timeout=10) as response:
    await response.json()

timeout=10 means 10 seconds per request in requests, but 10 seconds for ALL requests combined in aiohttp. Same number, different behavior. Your async batch of 1000 assets times out, your sync call doesn't.

Proposed Fix:

httpx does both sync and async with the same API

Change 5 - Adding health checks in docker-compose.yaml for sqlserver, keycloak, elasticsearch.

Note -

Side Benefit: These Stability Fixes Also Unlock Content Delivery
Franz has mentioned that content delivery (trained models, weights) may be added in the future. The stability changes above aren't designed for that but they happen to enable it as a free bonus.

The problem: get_content() can't handle large files
Today, downloading content loads the entire file into RAM at once:


res = requests.get(url, timeout=config.request_timeout_seconds)
distribution = res.content  # ← waits for FULL download, holds ALL bytes in memory

httpx has native streaming support same library, just one method change and adding stream=true.hence no architectural changes required for further.

Also there are certain flaky tests for which i can see issues are already been raised.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions