- 
                Notifications
    You must be signed in to change notification settings 
- Fork 41
FEATURE: Native PDF support #1127
Conversation
This amends it so we use PDF Reader gem to extract text from PDFs
| if upload.local? | ||
| Discourse.store.path_for(upload) | ||
| else | ||
| Discourse.store.download_safe(upload, max_file_size_kb: MAX_PDF_SIZE)&.path | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like we are cleaning up this file after downloading it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we clean up anywhere though? this is a bit tricky, no clean pattern for this.
| context "when improving PDF extraction with LLM" do | ||
| it "can properly simulate a file" do | ||
| if ENV["CI"] | ||
| skip "This test requires imagemagick is installed with ghostscript support - which is not available in CI" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm alittle confused here because CI runs tests with the discourse/discourse_test:release image that contains the same imagemagick binary as production. If this feature doesn't work in CI, it would mean that it won't work in production as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried and it was missing gs... which is odd, I guess we can double check here, was sure @xfalcox added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgxworld looks like this regressed on discourse/discourse_docker#889. Can you help us there?
Co-authored-by: Alan Guo Xiang Tan <[email protected]>
| @tgxworld pushed the fixes except for upload cleanup, not sure there is a good pattern for it in core | 
| Needs a fix for regression introduced at discourse/discourse_docker#889 making we miss gs and poppler | 
| I think it is fine to merge this now cause the pdf OCR using gs was behind a hidden setting and remains like so after the merge of this change. | 
This amends it so we use PDF Reader gem to extract text from PDFs