Stabilize SDK HTTP Layer - Retries, Pooling, and Error Handling

## Problem
The SDK's HTTP transport` (src/aiod/calls/calls.py) `has no resilience. Every API call is a single bare` requests.get() `— if anything goes wrong, the user's code crashes. This document proposes 4 targeted changes to fix this, with dry runs showing exactly what breaks today and what changes.

### Change 1: Add Automatic Retries
```
res = requests.get(url, timeout=config.request_timeout_seconds)   # ← bare call
resources = format_response(res.json(), data_format)              # ← no status check
```

#### Dry Run:
```
User runs:
    aiod.datasets.get_list()
Step 1: SDK builds URL → "https://api.aiod.eu/v2/datasets?offset=0&limit=10"
Step 2: SDK calls requests.get(url, timeout=10)
Step 3: The server is momentarily overloaded, returns HTTP 503 (Service Unavailable)
Step 4: SDK calls res.json() on the 503 response
Step 5: The 503 response body is HTML: "<html><body>Service Unavailable</body></html>"
Step 6: res.json() crashes → json.decoder.JSONDecodeError: Expecting value: line 1 column 1
Step 7: User sees a cryptic traceback. Script dies.
```

#### Proposed Fix -
3 automatic retries with backoff using tenacity which is battle tested production grade library for retry/backoff

### Change 2: Reuse Connections (Connection Pooling)

```
# get_list → new connection
res = requests.get(url, timeout=config.request_timeout_seconds)
# get_asset → another new connection
res = requests.get(url, headers=_get_auth_headers(...), timeout=config.request_timeout_seconds)
# search → yet another new connection
res = requests.get(url, timeout=config.request_timeout_seconds)
```

#### Dry Run:

```
User runs:
    for id in [1, 2, 3, 4, 5]:
        aiod.datasets.get_asset(id)

Request 1 (id=1):
    DNS lookup for api.aiod.eu          
    TCP handshake (SYN → SYN-ACK → ACK)
    TLS handshake (certificate exchange)  
    Send HTTP GET, receive response
    Close connection 

Request 2 (id=2):
    DNS lookup for api.aiod.eu          
    TCP handshake                       
    TLS handshake                        
    Send HTTP GET, receive response      
    Close connection                  


repeat for id=3, 4, 5...
```
#### Proposed Fix -
httpx.Client() — built-in persistent connections

### Change 3: Centralized Error Handling

There are 3 different error-handling patterns in calls.py today, and some functions with NO error handling:

* checks status but crashes if detail is None → TypeError
Used in: get_asset(), delete_asset(), put_asset(), patch_asset(), get_asset_from_platform()
* returns raw Response object on failure (user has to inspect it themselves)
Used in: post_asset()
* NO error handling at all, crashes on non-JSON responses
Used in: get_list(), search(), counts(), get_content()


#### Proposed Fix - Centralized Error Handling Or Response Interceptor Instead Of Adhoc Or Scattered Error Handling

### Change 4: One HTTP Library Instead of Two

The SDK uses requests for sync and aiohttp for async. They're two completely different libraries.

```
Sync call (requests)
requests.get(url, timeout=10)

Async call (aiohttp) — different syntax, different timeout rules
async with session.get(url, timeout=10) as response:
    await response.json()
```

timeout=10 means 10 seconds per request in requests, but 10 seconds for ALL requests combined in aiohttp. Same number, different behavior. Your async batch of 1000 assets times out, your sync call doesn't.

#### Proposed Fix: 
httpx does both sync and async with the same API

### Change 5 - Adding health checks in docker-compose.yaml for sqlserver, keycloak, elasticsearch.

#### Note - 
Side Benefit: These Stability Fixes Also Unlock Content Delivery
Franz  has mentioned that content delivery (trained models, weights) may be added in the future. The stability changes above aren't designed for that but they happen to enable it as a free bonus.

The problem: get_content() can't handle large files
Today, downloading content loads the entire file into RAM at once:
```

res = requests.get(url, timeout=config.request_timeout_seconds)
distribution = res.content  # ← waits for FULL download, holds ALL bytes in memory
```
httpx has native streaming support same library, just one method change and adding stream=true.hence no architectural changes required for further.

Also there are certain flaky tests for which i can see issues are already been raised.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize SDK HTTP Layer - Retries, Pooling, and Error Handling #718

Problem

Change 1: Add Automatic Retries

Dry Run:

Proposed Fix -

Change 2: Reuse Connections (Connection Pooling)

Dry Run:

Proposed Fix -

Change 3: Centralized Error Handling

Proposed Fix - Centralized Error Handling Or Response Interceptor Instead Of Adhoc Or Scattered Error Handling

Change 4: One HTTP Library Instead of Two

Proposed Fix:

Change 5 - Adding health checks in docker-compose.yaml for sqlserver, keycloak, elasticsearch.

Note -

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stabilize SDK HTTP Layer - Retries, Pooling, and Error Handling #718

Description

Problem

Change 1: Add Automatic Retries

Dry Run:

Proposed Fix -

Change 2: Reuse Connections (Connection Pooling)

Dry Run:

Proposed Fix -

Change 3: Centralized Error Handling

Proposed Fix - Centralized Error Handling Or Response Interceptor Instead Of Adhoc Or Scattered Error Handling

Change 4: One HTTP Library Instead of Two

Proposed Fix:

Change 5 - Adding health checks in docker-compose.yaml for sqlserver, keycloak, elasticsearch.

Note -

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions