You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: fix "not a valid pdf error" in parallel mode (#186)
This was a subtle bug that came out in the retry logic. When we get a 500 during the requests.post,
we'll try again. However, the pdf was stored in a BytesIO, which had already been read the first
time we sent it. The next request sends an empty file, which results in a 400 response masking the
original error.
Steps to verify:
* First, checkout `main`
* Start up the api in parallel mode
```
export UNSTRUCTURED_PARALLEL_MODE_ENABLED=true
export UNSTRUCTURED_PARALLEL_MODE_URL=http://localhost:8000/general/v0/general
make run-web-app
```
* Insert a 500 error into `prepline_general/api/general.py:partition_pdf_splits()`
```
# If it's small enough, just process locally
if len(pdf_pages) <= pages_per_pdf:
raise HTTPException(status_code=500) # Throw an error here
return partition(
file=file, file_filename=file_filename, content_type=content_type, **partition_kwargs
)
```
* Send a document and see that the 500 is hidden behind a 400 error
```
$ curl 'http://localhost:8000/general/v0/general' --header 'Accept: application/json' --form files=@sample-docs/layout-parser-paper-fast.pdf
{"detail":"layout-parser-paper-fast.pdf does not appear to be a valid PDF"}%
```
* Switch to this branch and do it again - you should now get a 500 `Internal server error` response
0 commit comments