Skip to content
This repository was archived by the owner on Jul 22, 2025. It is now read-only.

Conversation

@SamSaffron
Copy link
Member

This amends it so we use PDF Reader gem to extract text from PDFs

if upload.local?
Discourse.store.path_for(upload)
else
Discourse.store.download_safe(upload, max_file_size_kb: MAX_PDF_SIZE)&.path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like we are cleaning up this file after downloading it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we clean up anywhere though? this is a bit tricky, no clean pattern for this.

context "when improving PDF extraction with LLM" do
it "can properly simulate a file" do
if ENV["CI"]
skip "This test requires imagemagick is installed with ghostscript support - which is not available in CI"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm alittle confused here because CI runs tests with the discourse/discourse_test:release image that contains the same imagemagick binary as production. If this feature doesn't work in CI, it would mean that it won't work in production as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried and it was missing gs... which is odd, I guess we can double check here, was sure @xfalcox added it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgxworld looks like this regressed on discourse/discourse_docker#889. Can you help us there?

@SamSaffron
Copy link
Member Author

@tgxworld pushed the fixes except for upload cleanup, not sure there is a good pattern for it in core

@xfalcox
Copy link
Member

xfalcox commented Feb 17, 2025

Needs a fix for regression introduced at discourse/discourse_docker#889 making we miss gs and poppler

@SamSaffron
Copy link
Member Author

I think it is fine to merge this now cause the pdf OCR using gs was behind a hidden setting and remains like so after the merge of this change.

@SamSaffron SamSaffron merged commit ce79a18 into main Feb 17, 2025
6 checks passed
@SamSaffron SamSaffron deleted the pdf_fixes branch February 17, 2025 22:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants