RAG API taking extremely long to process documents over 1mb? #4081
-
What happened?When I upload a 1.3mb pdf file as a test, Its taking over 1 hour or more to get into the rag database. And its not letting me cancel the upload once it starts, so the file is just stuck sitting there for 1 hour, blocking any other file from being uploaded in the UI. This makes uploading any kind of decent length document, just not usable at all, Steps to Reproduce
What browsers are you seeing the problem on?Chrome Relevant log output2024-09-12 19:17:35,652 - multipart.multipart - DEBUG - Calling on_part_data with data[0:1]
2024-09-12 19:17:35,653 - multipart.multipart - DEBUG - Calling on part_data with data[48218:130477]
2024-09-12 19:17:35,655 - multipart.multipart - DEBUG - Calling on part_data with data[0:10103]
2024-09-12 19:17:35,655 - multipart.multipart - DEBUG - Calling on_part_data with data[0:1]
2024-09-12 19:17:35,655 - multipart.multipart - DEBUG - Calling on part_data with data[10104:36314]
2024-09-12 19:17:35,656 - multipart.multipart - DEBUG - Calling on part_data with data[0:1] ScreenshotsCode of Conduct
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
This is not the expected experience at all, I can upload 10+ mb PDFs rather quickly. Really the main thing that affects upload time is the amount of text content from the files, no matter what size or file type it is. Is this a text "heavy" PDF you are trying to upload? Are there any other files that work for you? First be sure everything else is configured right, too: https://www.librechat.ai/docs/configuration/rag_api |
Beta Was this translation helpful? Give feedback.
Hey Danny,
Good news!!
Found out the source of this issue,
It was the "PDF_EXTRACT_IMAGES" parameter in our .env file, this needs to be set to false,
This parameter gets read into the "extract_images" parameter in the pypdfloader library,
When this was set to True, it would take 45seconds or more per page to get through the loader.load() function in main.py in rag_api repo. When this is set to False, it completes this function in half a second or less.
Thank you so much for posting this video of you uploading it on yourside,
This was exactly the confirmation we needed to know it was definitely coming from our side!