RAG API taking extremely long to process documents over 1mb? #4081

ggomp2885 · 2024-09-16T22:12:26Z

ggomp2885
Sep 16, 2024

What happened?

When I upload a 1.3mb pdf file as a test, Its taking over 1 hour or more to get into the rag database. And its not letting me cancel the upload once it starts, so the file is just stuck sitting there for 1 hour, blocking any other file from being uploaded in the UI.
I have to force-kill the enitre LibreChat app and restart it to have this reset.

This makes uploading any kind of decent length document, just not usable at all,
is there a setting that I am missing?

Steps to Reproduce

Open LibreChat
Set embedding model to: Azure, "text-embedding-3-small"
Attempt to upload a 1mb or larger pdf file into the rag_api

What browsers are you seeing the problem on?

Chrome

Relevant log output

2024-09-12 19:17:35,652 - multipart.multipart - DEBUG - Calling on_part_data with data[0:1]
2024-09-12 19:17:35,653 - multipart.multipart - DEBUG - Calling on part_data with data[48218:130477]
2024-09-12 19:17:35,655 - multipart.multipart - DEBUG - Calling on part_data with data[0:10103]
2024-09-12 19:17:35,655 - multipart.multipart - DEBUG - Calling on_part_data with data[0:1]
2024-09-12 19:17:35,655 - multipart.multipart - DEBUG - Calling on part_data with data[10104:36314]
2024-09-12 19:17:35,656 - multipart.multipart - DEBUG - Calling on part_data with data[0:1]

Screenshots

Code of Conduct

I agree to follow this project's Code of Conduct

Answered by ggomp2885

Sep 21, 2024

Hey Danny,
Good news!!
Found out the source of this issue,
It was the "PDF_EXTRACT_IMAGES" parameter in our .env file, this needs to be set to false,

This parameter gets read into the "extract_images" parameter in the pypdfloader library,

When this was set to True, it would take 45seconds or more per page to get through the loader.load() function in main.py in rag_api repo. When this is set to False, it completes this function in half a second or less.

Thank you so much for posting this video of you uploading it on yourside,
This was exactly the confirmation we needed to know it was definitely coming from our side!

View full answer

danny-avila · 2024-09-17T13:46:38Z

danny-avila
Sep 17, 2024
Maintainer

This is not the expected experience at all, I can upload 10+ mb PDFs rather quickly. Really the main thing that affects upload time is the amount of text content from the files, no matter what size or file type it is.

Is this a text "heavy" PDF you are trying to upload? Are there any other files that work for you?

First be sure everything else is configured right, too: https://www.librechat.ai/docs/configuration/rag_api

3 replies

ggomp2885 Sep 18, 2024
Author

Hey Danny,
Thanks so much for your response!
Here is the example we are working with,

Even with just the first 5 pages of this document, It is taking 3min 30 seconds to get into the rag_api,
Example here: first_5_pages.pdf

Whereas, Chatgpt.com (as a reference point) can upload the same 5 pages in 1 second.

Attaching my both my LibreChat .env file and my rag_api .env file here,
librechat_env.txt
rag_api_env.txt

We are running this locally on a very large server, using the "npm run backend" command,
Would you expect this to make any difference in the rag_api speed versus running it as a docker container?
Is 3min 30sec for 5 pages of text the expected experience?

danny-avila Sep 19, 2024
Maintainer

No it's not expected at all, watch my video, it takes approx. 1 second as well with the pdf you attached. No clue what the bottleneck is for you

Recording.2024-09-19.105912.mp4

ggomp2885 Sep 21, 2024
Author

Hey Danny,
Good news!!
Found out the source of this issue,
It was the "PDF_EXTRACT_IMAGES" parameter in our .env file, this needs to be set to false,

This parameter gets read into the "extract_images" parameter in the pypdfloader library,

When this was set to True, it would take 45seconds or more per page to get through the loader.load() function in main.py in rag_api repo. When this is set to False, it completes this function in half a second or less.

Thank you so much for posting this video of you uploading it on yourside,
This was exactly the confirmation we needed to know it was definitely coming from our side!

Answer selected by ggomp2885

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RAG API taking extremely long to process documents over 1mb? #4081

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

RAG API taking extremely long to process documents over 1mb? #4081

Uh oh!

ggomp2885 Sep 16, 2024

What happened?

Steps to Reproduce

What browsers are you seeing the problem on?

Relevant log output

Screenshots

Code of Conduct

Replies: 1 comment · 3 replies

Uh oh!

danny-avila Sep 17, 2024 Maintainer

Uh oh!

Uh oh!

ggomp2885 Sep 18, 2024 Author

Uh oh!

danny-avila Sep 19, 2024 Maintainer

Uh oh!

Uh oh!

ggomp2885 Sep 21, 2024 Author

ggomp2885
Sep 16, 2024

Replies: 1 comment 3 replies

danny-avila
Sep 17, 2024
Maintainer

ggomp2885 Sep 18, 2024
Author

danny-avila Sep 19, 2024
Maintainer

ggomp2885 Sep 21, 2024
Author