Skip to content
This repository was archived by the owner on Jul 22, 2025. It is now read-only.

Commit 5e80f93

Browse files
authored
FEATURE: PDF support for rag pipeline (#1118)
This PR introduces several enhancements and refactorings to the AI Persona and RAG (Retrieval-Augmented Generation) functionalities within the discourse-ai plugin. Here's a breakdown of the changes: **1. LLM Model Association for RAG and Personas:** - **New Database Columns:** Adds `rag_llm_model_id` to both `ai_personas` and `ai_tools` tables. This allows specifying a dedicated LLM for RAG indexing, separate from the persona's primary LLM. Adds `default_llm_id` and `question_consolidator_llm_id` to `ai_personas`. - **Migration:** Includes a migration (`20250210032345_migrate_persona_to_llm_model_id.rb`) to populate the new `default_llm_id` and `question_consolidator_llm_id` columns in `ai_personas` based on the existing `default_llm` and `question_consolidator_llm` string columns, and a post migration to remove the latter. - **Model Changes:** The `AiPersona` and `AiTool` models now `belong_to` an `LlmModel` via `rag_llm_model_id`. The `LlmModel.proxy` method now accepts an `LlmModel` instance instead of just an identifier. `AiPersona` now has `default_llm_id` and `question_consolidator_llm_id` attributes. - **UI Updates:** The AI Persona and AI Tool editors in the admin panel now allow selecting an LLM for RAG indexing (if PDF/image support is enabled). The RAG options component displays an LLM selector. - **Serialization:** The serializers (`AiCustomToolSerializer`, `AiCustomToolListSerializer`, `LocalizedAiPersonaSerializer`) have been updated to include the new `rag_llm_model_id`, `default_llm_id` and `question_consolidator_llm_id` attributes. **2. PDF and Image Support for RAG:** - **Site Setting:** Introduces a new hidden site setting, `ai_rag_pdf_images_enabled`, to control whether PDF and image files can be indexed for RAG. This defaults to `false`. - **File Upload Validation:** The `RagDocumentFragmentsController` now checks the `ai_rag_pdf_images_enabled` setting and allows PDF, PNG, JPG, and JPEG files if enabled. Error handling is included for cases where PDF/image indexing is attempted with the setting disabled. - **PDF Processing:** Adds a new utility class, `DiscourseAi::Utils::PdfToImages`, which uses ImageMagick (`magick`) to convert PDF pages into individual PNG images. A maximum PDF size and conversion timeout are enforced. - **Image Processing:** A new utility class, `DiscourseAi::Utils::ImageToText`, is included to handle OCR for the images and PDFs. - **RAG Digestion Job:** The `DigestRagUpload` job now handles PDF and image uploads. It uses `PdfToImages` and `ImageToText` to extract text and create document fragments. - **UI Updates:** The RAG uploader component now accepts PDF and image file types if `ai_rag_pdf_images_enabled` is true. The UI text is adjusted to indicate supported file types. **3. Refactoring and Improvements:** - **LLM Enumeration:** The `DiscourseAi::Configuration::LlmEnumerator` now provides a `values_for_serialization` method, which returns a simplified array of LLM data (id, name, vision_enabled) suitable for use in serializers. This avoids exposing unnecessary details to the frontend. - **AI Helper:** The `AiHelper::Assistant` now takes optional `helper_llm` and `image_caption_llm` parameters in its constructor, allowing for greater flexibility. - **Bot and Persona Updates:** Several updates were made across the codebase, changing the string based association to a LLM to the new model based. - **Audit Logs:** The `DiscourseAi::Completions::Endpoints::Base` now formats raw request payloads as pretty JSON for easier auditing. - **Eval Script:** An evaluation script is included. **4. Testing:** - The PR introduces a new eval system for LLMs, this allows us to test how functionality works across various LLM providers. This lives in `/evals`
1 parent e2afbc2 commit 5e80f93

File tree

54 files changed

+1329
-141
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+1329
-141
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,5 @@ node_modules
22
/gems
33
/auto_generated
44
.env
5+
evals/log
6+
evals/cases

admin/assets/javascripts/discourse/routes/admin-plugins-show-discourse-ai-tools-edit.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,7 @@ export default class DiscourseAiToolsEditRoute extends DiscourseRoute {
1414

1515
controller.set("allTools", toolsModel);
1616
controller.set("presets", toolsModel.resultSetMeta.presets);
17+
controller.set("llms", toolsModel.resultSetMeta.llms);
18+
controller.set("settings", toolsModel.resultSetMeta.settings);
1719
}
1820
}

admin/assets/javascripts/discourse/routes/admin-plugins-show-discourse-ai-tools-new.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,7 @@ export default class DiscourseAiToolsNewRoute extends DiscourseRoute {
1111

1212
controller.set("allTools", toolsModel);
1313
controller.set("presets", toolsModel.resultSetMeta.presets);
14+
controller.set("llms", toolsModel.resultSetMeta.llms);
15+
controller.set("settings", toolsModel.resultSetMeta.settings);
1416
}
1517
}

admin/assets/javascripts/discourse/templates/admin-plugins/show/discourse-ai-tools/edit.hbs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,7 @@
33
@tools={{this.allTools}}
44
@model={{this.model}}
55
@presets={{this.presets}}
6+
@llms={{this.llms}}
7+
@settings={{this.settings}}
68
/>
79
</section>

admin/assets/javascripts/discourse/templates/admin-plugins/show/discourse-ai-tools/new.hbs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,7 @@
33
@tools={{this.allTools}}
44
@model={{this.model}}
55
@presets={{this.presets}}
6+
@llms={{this.llms}}
7+
@settings={{this.settings}}
68
/>
79
</section>

app/controllers/discourse_ai/admin/ai_personas_controller.rb

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,19 @@ def index
3232
}
3333
end
3434
llms =
35-
DiscourseAi::Configuration::LlmEnumerator
36-
.values(allowed_seeded_llms: SiteSetting.ai_bot_allowed_seeded_models)
37-
.map { |hash| { id: hash[:value], name: hash[:name] } }
38-
render json: { ai_personas: ai_personas, meta: { tools: tools, llms: llms } }
35+
DiscourseAi::Configuration::LlmEnumerator.values_for_serialization(
36+
allowed_seeded_llm_ids: SiteSetting.ai_bot_allowed_seeded_models_map,
37+
)
38+
render json: {
39+
ai_personas: ai_personas,
40+
meta: {
41+
tools: tools,
42+
llms: llms,
43+
settings: {
44+
rag_pdf_images_enabled: SiteSetting.ai_rag_pdf_images_enabled,
45+
},
46+
},
47+
}
3948
end
4049

4150
def new
@@ -187,15 +196,16 @@ def ai_persona_params
187196
:priority,
188197
:top_p,
189198
:temperature,
190-
:default_llm,
199+
:default_llm_id,
191200
:user_id,
192201
:max_context_posts,
193202
:vision_enabled,
194203
:vision_max_pixels,
195204
:rag_chunk_tokens,
196205
:rag_chunk_overlap_tokens,
197206
:rag_conversation_chunks,
198-
:question_consolidator_llm,
207+
:rag_llm_model_id,
208+
:question_consolidator_llm_id,
199209
:allow_chat_channel_mentions,
200210
:allow_chat_direct_messages,
201211
:allow_topic_mentions,

app/controllers/discourse_ai/admin/ai_tools_controller.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ def ai_tool_params
9090
:summary,
9191
:rag_chunk_tokens,
9292
:rag_chunk_overlap_tokens,
93+
:rag_llm_model_id,
9394
rag_uploads: [:id],
9495
parameters: [:name, :type, :description, :required, enum: []],
9596
)

app/controllers/discourse_ai/admin/rag_document_fragments_controller.rb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ def upload_file
4949
def validate_extension!(filename)
5050
extension = File.extname(filename)[1..-1] || ""
5151
authorized_extensions = %w[txt md]
52+
authorized_extensions.concat(%w[pdf png jpg jpeg]) if SiteSetting.ai_rag_pdf_images_enabled
5253
if !authorized_extensions.include?(extension)
5354
raise Discourse::InvalidParameters.new(
5455
I18n.t(

app/jobs/regular/digest_rag_upload.rb

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ def execute(args)
2828

2929
# Check if this is the first time we process this upload.
3030
if fragment_ids.empty?
31-
document = get_uploaded_file(upload)
31+
document = get_uploaded_file(upload: upload, target: target)
3232
return if document.nil?
3333

3434
RagDocumentFragment.publish_status(upload, { total: 0, indexed: 0, left: 0 })
@@ -163,7 +163,38 @@ def first_chunk(text, chunk_tokens:, tokenizer:, splitters: ["\n\n", "\n", ".",
163163
[buffer, split_char]
164164
end
165165

166-
def get_uploaded_file(upload)
166+
def get_uploaded_file(upload:, target:)
167+
if %w[pdf png jpg jpeg].include?(upload.extension) && !SiteSetting.ai_rag_pdf_images_enabled
168+
raise Discourse::InvalidAccess.new(
169+
"The setting ai_rag_pdf_images_enabled is false, can not index images and pdfs.",
170+
)
171+
end
172+
if upload.extension == "pdf"
173+
pages =
174+
DiscourseAi::Utils::PdfToImages.new(
175+
upload: upload,
176+
user: Discourse.system_user,
177+
).uploaded_pages
178+
179+
return(
180+
DiscourseAi::Utils::ImageToText.as_fake_file(
181+
uploads: pages,
182+
llm_model: target.rag_llm_model,
183+
user: Discourse.system_user,
184+
)
185+
)
186+
end
187+
188+
if %w[png jpg jpeg].include?(upload.extension)
189+
return(
190+
DiscourseAi::Utils::ImageToText.as_fake_file(
191+
uploads: [upload],
192+
llm_model: target.rag_llm_model,
193+
user: Discourse.system_user,
194+
)
195+
)
196+
end
197+
167198
store = Discourse.store
168199
@file ||=
169200
if store.external?

app/models/ai_persona.rb

Lines changed: 46 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# frozen_string_literal: true
22

33
class AiPersona < ActiveRecord::Base
4-
# TODO remove this line 01-1-2025
5-
self.ignored_columns = %i[commands allow_chat mentionable]
4+
# TODO remove this line 01-10-2025
5+
self.ignored_columns = %i[default_llm question_consolidator_llm]
66

77
# places a hard limit, so per site we cache a maximum of 500 classes
88
MAX_PERSONAS_PER_SITE = 500
@@ -12,7 +12,7 @@ class AiPersona < ActiveRecord::Base
1212
validates :system_prompt, presence: true, length: { maximum: 10_000_000 }
1313
validate :system_persona_unchangeable, on: :update, if: :system
1414
validate :chat_preconditions
15-
validate :allowed_seeded_model, if: :default_llm
15+
validate :allowed_seeded_model, if: :default_llm_id
1616
validates :max_context_posts, numericality: { greater_than: 0 }, allow_nil: true
1717
# leaves some room for growth but sets a maximum to avoid memory issues
1818
# we may want to revisit this in the future
@@ -30,6 +30,10 @@ class AiPersona < ActiveRecord::Base
3030
belongs_to :created_by, class_name: "User"
3131
belongs_to :user
3232

33+
belongs_to :default_llm, class_name: "LlmModel"
34+
belongs_to :question_consolidator_llm, class_name: "LlmModel"
35+
belongs_to :rag_llm_model, class_name: "LlmModel"
36+
3337
has_many :upload_references, as: :target, dependent: :destroy
3438
has_many :uploads, through: :upload_references
3539

@@ -62,7 +66,7 @@ def self.persona_users(user: nil)
6266
user_id: persona.user_id,
6367
username: persona.user.username_lower,
6468
allowed_group_ids: persona.allowed_group_ids,
65-
default_llm: persona.default_llm,
69+
default_llm_id: persona.default_llm_id,
6670
force_default_llm: persona.force_default_llm,
6771
allow_chat_channel_mentions: persona.allow_chat_channel_mentions,
6872
allow_chat_direct_messages: persona.allow_chat_direct_messages,
@@ -157,12 +161,12 @@ def class_instance
157161
user_id
158162
system
159163
mentionable
160-
default_llm
164+
default_llm_id
161165
max_context_posts
162166
vision_enabled
163167
vision_max_pixels
164168
rag_conversation_chunks
165-
question_consolidator_llm
169+
question_consolidator_llm_id
166170
allow_chat_channel_mentions
167171
allow_chat_direct_messages
168172
allow_topic_mentions
@@ -302,7 +306,7 @@ def chat_preconditions
302306
if (
303307
allow_chat_channel_mentions || allow_chat_direct_messages || allow_topic_mentions ||
304308
force_default_llm
305-
) && !default_llm
309+
) && !default_llm_id
306310
errors.add(:default_llm, I18n.t("discourse_ai.ai_bot.personas.default_llm_required"))
307311
end
308312
end
@@ -332,13 +336,12 @@ def ensure_not_system
332336
end
333337

334338
def allowed_seeded_model
335-
return if default_llm.blank?
339+
return if default_llm_id.blank?
336340

337-
llm = LlmModel.find_by(id: default_llm.split(":").last.to_i)
338-
return if llm.nil?
339-
return if !llm.seeded?
341+
return if default_llm.nil?
342+
return if !default_llm.seeded?
340343

341-
return if SiteSetting.ai_bot_allowed_seeded_models.include?(llm.id.to_s)
344+
return if SiteSetting.ai_bot_allowed_seeded_models_map.include?(default_llm.id.to_s)
342345

343346
errors.add(:default_llm, I18n.t("discourse_ai.llm.configuration.invalid_seeded_model"))
344347
end
@@ -348,36 +351,37 @@ def allowed_seeded_model
348351
#
349352
# Table name: ai_personas
350353
#
351-
# id :bigint not null, primary key
352-
# name :string(100) not null
353-
# description :string(2000) not null
354-
# system_prompt :string(10000000) not null
355-
# allowed_group_ids :integer default([]), not null, is an Array
356-
# created_by_id :integer
357-
# enabled :boolean default(TRUE), not null
358-
# created_at :datetime not null
359-
# updated_at :datetime not null
360-
# system :boolean default(FALSE), not null
361-
# priority :boolean default(FALSE), not null
362-
# temperature :float
363-
# top_p :float
364-
# user_id :integer
365-
# default_llm :text
366-
# max_context_posts :integer
367-
# vision_enabled :boolean default(FALSE), not null
368-
# vision_max_pixels :integer default(1048576), not null
369-
# rag_chunk_tokens :integer default(374), not null
370-
# rag_chunk_overlap_tokens :integer default(10), not null
371-
# rag_conversation_chunks :integer default(10), not null
372-
# question_consolidator_llm :text
373-
# tool_details :boolean default(TRUE), not null
374-
# tools :json not null
375-
# forced_tool_count :integer default(-1), not null
376-
# allow_chat_channel_mentions :boolean default(FALSE), not null
377-
# allow_chat_direct_messages :boolean default(FALSE), not null
378-
# allow_topic_mentions :boolean default(FALSE), not null
379-
# allow_personal_messages :boolean default(TRUE), not null
380-
# force_default_llm :boolean default(FALSE), not null
354+
# id :bigint not null, primary key
355+
# name :string(100) not null
356+
# description :string(2000) not null
357+
# system_prompt :string(10000000) not null
358+
# allowed_group_ids :integer default([]), not null, is an Array
359+
# created_by_id :integer
360+
# enabled :boolean default(TRUE), not null
361+
# created_at :datetime not null
362+
# updated_at :datetime not null
363+
# system :boolean default(FALSE), not null
364+
# priority :boolean default(FALSE), not null
365+
# temperature :float
366+
# top_p :float
367+
# user_id :integer
368+
# max_context_posts :integer
369+
# vision_enabled :boolean default(FALSE), not null
370+
# vision_max_pixels :integer default(1048576), not null
371+
# rag_chunk_tokens :integer default(374), not null
372+
# rag_chunk_overlap_tokens :integer default(10), not null
373+
# rag_conversation_chunks :integer default(10), not null
374+
# tool_details :boolean default(TRUE), not null
375+
# tools :json not null
376+
# forced_tool_count :integer default(-1), not null
377+
# allow_chat_channel_mentions :boolean default(FALSE), not null
378+
# allow_chat_direct_messages :boolean default(FALSE), not null
379+
# allow_topic_mentions :boolean default(FALSE), not null
380+
# allow_personal_messages :boolean default(TRUE), not null
381+
# force_default_llm :boolean default(FALSE), not null
382+
# rag_llm_model_id :bigint
383+
# default_llm_id :bigint
384+
# question_consolidator_llm_id :bigint
381385
#
382386
# Indexes
383387
#

0 commit comments

Comments
 (0)