You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Address some issues in the split pdf logic (#165)
We've encountered some bugs in the split pdf code. For one, these
requests are not retried. With the new `split_pdf_allow_failed=False`
behavior, this means one transient network error can interrupt the whole
doc. We've also had some asyncio warnings such as `... was never
awaited`.
This PR adds retries, cleans up the logic, and gives us a much better
base for the V2 client release.
# Changes
## Return a "dummy request" in the split BeforeRequestHook
When the BeforeRequestHook is called, we would split up the doc into N
requests, issue coroutines for N-1 requests, and return the last one for
the SDK to run. This adds two paths for recombining the results.
Instead, the BeforeRequest can return a dummy request that will get a
200. This takes us straight to the AfterSuccessHook, which awaits all of
the splits and builds the response.
## Add retries to the split requests
This is a copy of the autogenerated code in `retry.py`, which will work
for the async calls. At some point, we should be able to reuse the SDK
for this so we aren't hardcoding the retry config values here. Need to
work with Speakeasy on this.
## Clean up error handling
When the retries fail and we do have to bubble up an error, we pass it
to `create_failure_response` before returning to the SDK. This pops a
500 status code into the response, only so the SDK does not see a
502/503/504, and proceed to retry the entire doc.
## Set a httpx timeout
Many of the failing requests right now are hi_res calls. This is because
the default httpx client timeout is 5 seconds, and we immediately throw
a ReadTimeout. For now, set this timeout to 10 minutes. This should be
sufficient in the splitting code, where page size per request will be
controlled. This is another hardcoded value that should go away once
we're able to send our splits back into `sdk.general.partition`
# Testing
Any pipelines that have failed consistently should work now. For more
fine grained testing, I tend to mock up my local server to return a
retryable error for specific pages, a certain number of times. In the
`general_partition` function, I add something like
```
global num_bounces # Initialize this somewhere up above
page_num = form_params.starting_page_number or 1
if num_bounces > 0 and page_num == 3:
num_bounces -= 1
logger.info(page_num)
raise HTTPException(status_code=502, detail="BOUNCE")
```
Then, send a SDK request to your local server and verify that the split
request for page 3 of your doc is retrying up to the number of times you
want.
Also, setting the max concurrency to 15 should reproduce the issue.
Choose some 50+ page pdf and try the following with the current 0.25.5
branch. It will likely fail with `ServerError: {}`. Then try a local pip
install off this branch.
```
s = UnstructuredClient(
api_key_auth="my-api-key",
)
filename = "some-large-pdf"
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = operations.PartitionRequest(
shared.PartitionParameters(
files=files,
split_pdf_page=True,
strategy="hi_res",
split_pdf_allow_failed=False,
split_pdf_concurrency_level=15,
),
)
resp = s.general.partition(req)
if num_elements := len(resp.elements):
print(f"Succeeded with {num_elements}")
```
0 commit comments