-
Notifications
You must be signed in to change notification settings - Fork 40
FEATURE: PDF support for rag pipeline #1118
Conversation
(this starts by defining the extraction routines)
| ) | ||
| end | ||
|
|
||
| if %w[png jpg jpeg].include?(upload.extension) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of my above comment, we can easily call checks for extensions if we create a helper somewhere.
| if %w[png jpg jpeg].include?(upload.extension) | |
| if FileHelper.ai_supported_images.include?(upload.extension) |
| class AiPersona < ActiveRecord::Base | ||
| # TODO remove this line 01-1-2025 | ||
| self.ignored_columns = %i[commands allow_chat mentionable] | ||
| # TODO remove this line 01-10-2025 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean 02-10-2025? Although, this date has also passed now, should we update it to something further in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dates ... this is D-M-Y ... generally I try to stick with that for code comments. dates are hard, maybe for comments like this we should go with October-2025, a lot less ambiguous
| get acceptedFileTypes() { | ||
| if (this.args?.allowPdfsAndImages) { | ||
| return ".txt,.md,.pdf,.png,.jpg,.jpeg"; | ||
| } else { | ||
| return ".txt,.md"; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also ensure these extensions are siteSettings.authorized_extensions_for_staff ? We can use authorizedExtensions from discourse/lib/uploads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this does raise issues if you half do the change, we already have a problem with txt/md which is probably more serious.
I think the right thing to do here is for the plugin to be allowed to override everything and just upload the things it wants. Not sure.
| end | ||
| end | ||
|
|
||
| puts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing empty puts
| puts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was deliberate, but I will clean up more stuff in the eval system to make this stuff clearer.
|
Still thinking how to better add our self-hosted LLMs to evals, but this is looking great! |
Co-authored-by: Joffrey JAFFEUX <[email protected]>
Co-authored-by: Joffrey JAFFEUX <[email protected]>
Co-authored-by: Joffrey JAFFEUX <[email protected]>
Co-authored-by: Joffrey JAFFEUX <[email protected]>
Co-authored-by: Keegan George <[email protected]>
Co-authored-by: Keegan George <[email protected]>
This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes:
1. LLM Model Association for RAG and Personas:
rag_llm_model_idto bothai_personasandai_toolstables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Addsdefault_llm_idandquestion_consolidator_llm_idtoai_personas.20250210032345_migrate_persona_to_llm_model_id.rb) to populate the newdefault_llm_idandquestion_consolidator_llm_idcolumns inai_personasbased on the existingdefault_llmandquestion_consolidator_llmstring columns, and a post migration to remove the latter.AiPersonaandAiToolmodels nowbelong_toanLlmModelviarag_llm_model_id. TheLlmModel.proxymethod now accepts anLlmModelinstance instead of just an identifier.AiPersonanow hasdefault_llm_idandquestion_consolidator_llm_idattributes.AiCustomToolSerializer,AiCustomToolListSerializer,LocalizedAiPersonaSerializer) have been updated to include the newrag_llm_model_id,default_llm_idandquestion_consolidator_llm_idattributes.2. PDF and Image Support for RAG:
ai_rag_pdf_images_enabled, to control whether PDF and image files can be indexed for RAG. This defaults tofalse.RagDocumentFragmentsControllernow checks theai_rag_pdf_images_enabledsetting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled.DiscourseAi::Utils::PdfToImages, which uses ImageMagick (magick) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced.DiscourseAi::Utils::ImageToText, is included to handle OCR for the images and PDFs.DigestRagUploadjob now handles PDF and image uploads. It usesPdfToImagesandImageToTextto extract text and create document fragments.ai_rag_pdf_images_enabledis true. The UI text is adjusted to indicate supported file types.3. Refactoring and Improvements:
DiscourseAi::Configuration::LlmEnumeratornow provides avalues_for_serializationmethod, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend.AiHelper::Assistantnow takes optionalhelper_llmandimage_caption_llmparameters in its constructor, allowing for greater flexibility.DiscourseAi::Completions::Endpoints::Basenow formats raw request payloads as pretty JSON for easier auditing.4. Testing:
/evals