-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Problem
The SDK's HTTP transport(src/aiod/calls/calls.py)has no resilience. Every API call is a single barerequests.get()— if anything goes wrong, the user's code crashes. This document proposes 4 targeted changes to fix this, with dry runs showing exactly what breaks today and what changes.
Change 1: Add Automatic Retries
res = requests.get(url, timeout=config.request_timeout_seconds) # ← bare call
resources = format_response(res.json(), data_format) # ← no status check
Dry Run:
User runs:
aiod.datasets.get_list()
Step 1: SDK builds URL → "https://api.aiod.eu/v2/datasets?offset=0&limit=10"
Step 2: SDK calls requests.get(url, timeout=10)
Step 3: The server is momentarily overloaded, returns HTTP 503 (Service Unavailable)
Step 4: SDK calls res.json() on the 503 response
Step 5: The 503 response body is HTML: "<html><body>Service Unavailable</body></html>"
Step 6: res.json() crashes → json.decoder.JSONDecodeError: Expecting value: line 1 column 1
Step 7: User sees a cryptic traceback. Script dies.
Proposed Fix -
3 automatic retries with backoff using tenacity which is battle tested production grade library for retry/backoff
Change 2: Reuse Connections (Connection Pooling)
# get_list → new connection
res = requests.get(url, timeout=config.request_timeout_seconds)
# get_asset → another new connection
res = requests.get(url, headers=_get_auth_headers(...), timeout=config.request_timeout_seconds)
# search → yet another new connection
res = requests.get(url, timeout=config.request_timeout_seconds)
Dry Run:
User runs:
for id in [1, 2, 3, 4, 5]:
aiod.datasets.get_asset(id)
Request 1 (id=1):
DNS lookup for api.aiod.eu
TCP handshake (SYN → SYN-ACK → ACK)
TLS handshake (certificate exchange)
Send HTTP GET, receive response
Close connection
Request 2 (id=2):
DNS lookup for api.aiod.eu
TCP handshake
TLS handshake
Send HTTP GET, receive response
Close connection
repeat for id=3, 4, 5...
Proposed Fix -
httpx.Client() — built-in persistent connections
Change 3: Centralized Error Handling
There are 3 different error-handling patterns in calls.py today, and some functions with NO error handling:
- checks status but crashes if detail is None → TypeError
Used in: get_asset(), delete_asset(), put_asset(), patch_asset(), get_asset_from_platform() - returns raw Response object on failure (user has to inspect it themselves)
Used in: post_asset() - NO error handling at all, crashes on non-JSON responses
Used in: get_list(), search(), counts(), get_content()
Proposed Fix - Centralized Error Handling Or Response Interceptor Instead Of Adhoc Or Scattered Error Handling
Change 4: One HTTP Library Instead of Two
The SDK uses requests for sync and aiohttp for async. They're two completely different libraries.
Sync call (requests)
requests.get(url, timeout=10)
Async call (aiohttp) — different syntax, different timeout rules
async with session.get(url, timeout=10) as response:
await response.json()
timeout=10 means 10 seconds per request in requests, but 10 seconds for ALL requests combined in aiohttp. Same number, different behavior. Your async batch of 1000 assets times out, your sync call doesn't.
Proposed Fix:
httpx does both sync and async with the same API
Change 5 - Adding health checks in docker-compose.yaml for sqlserver, keycloak, elasticsearch.
Note -
Side Benefit: These Stability Fixes Also Unlock Content Delivery
Franz has mentioned that content delivery (trained models, weights) may be added in the future. The stability changes above aren't designed for that but they happen to enable it as a free bonus.
The problem: get_content() can't handle large files
Today, downloading content loads the entire file into RAM at once:
res = requests.get(url, timeout=config.request_timeout_seconds)
distribution = res.content # ← waits for FULL download, holds ALL bytes in memory
httpx has native streaming support same library, just one method change and adding stream=true.hence no architectural changes required for further.
Also there are certain flaky tests for which i can see issues are already been raised.