Skip to content
This repository was archived by the owner on Jul 22, 2025. It is now read-only.

Commit 3e74eea

Browse files
authored
FEATURE: add context and llm controls to researcher, fix username filter (#1401)
Adds context length controls to researcher (max tokens per post and batch) Allow picking LLM for researcher Fix bug where unicode usernames were not working Fix documentation of OR logic
1 parent 4f980d5 commit 3e74eea

File tree

5 files changed

+131
-32
lines changed

5 files changed

+131
-32
lines changed

config/locales/server.en.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -344,6 +344,15 @@ en:
344344
searching: "Searching for: '%{query}'"
345345
tool_options:
346346
researcher:
347+
researcher_llm:
348+
name: "LLM"
349+
description: "Language model to use for research (default to current persona's LLM)"
350+
max_tokens_per_batch:
351+
name: "Maximum tokens per batch"
352+
description: "Maximum number of tokens to use for each batch in the research"
353+
max_tokens_per_post:
354+
name: "Maximum tokens per post"
355+
description: "Maximum number of tokens to use for each post in the research"
347356
max_results:
348357
name: "Maximum number of results"
349358
description: "Maximum number of results to include in a filter"

lib/personas/tools/researcher.rb

Lines changed: 52 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -31,26 +31,28 @@ def signature
3131

3232
def filter_description
3333
<<~TEXT
34-
Filter string to target specific content.
35-
- Supports user (@username)
36-
- post_type:first - only includes first posts in topics
37-
- post_type:reply - only replies in topics
38-
- date ranges (after:YYYY-MM-DD, before:YYYY-MM-DD for posts; topic_after:YYYY-MM-DD, topic_before:YYYY-MM-DD for topics)
39-
- categories (category:category1,category2 or categories:category1,category2)
40-
- tags (tag:tag1,tag2 or tags:tag1,tag2)
41-
- groups (group:group1,group2 or groups:group1,group2)
42-
- status (status:open, status:closed, status:archived, status:noreplies, status:single_user)
43-
- keywords (keywords:keyword1,keyword2) - searches for specific words within post content using full-text search
44-
- topic_keywords (topic_keywords:keyword1,keyword2) - searches for keywords within topics, returns all posts from matching topics
45-
- topics (topic:topic_id1,topic_id2 or topics:topic_id1,topic_id2) - target specific topics by ID
46-
- max_results (max_results:10) - limits the maximum number of results returned (optional)
47-
- order (order:latest, order:oldest, order:latest_topic, order:oldest_topic, order:likes) - controls result ordering (optional, defaults to latest posts)
48-
49-
Multiple filters can be combined with spaces for AND logic. Example: '@sam after:2023-01-01 tag:feature'
50-
51-
Use OR to combine filter segments for inclusive logic.
52-
Example: 'category:feature,bug OR tag:feature-tag' - includes posts in feature OR bug categories, OR posts with feature-tag tag
53-
Example: '@sam category:bug' - includes posts by @sam AND in bug category
34+
Filter string to target specific content. Space-separated filters use AND logic, OR creates separate filter groups.
35+
36+
**Filters:**
37+
- username:user1 or usernames:user1,user2 - posts by specific users
38+
- group:group1 or groups:group1,group2 - posts by users in specific groups
39+
- post_type:first|reply - first posts only or replies only
40+
- keywords:word1,word2 - full-text search in post content
41+
- topic_keywords:word1,word2 - full-text search in topics (returns all posts from matching topics)
42+
- topic:123 or topics:123,456 - specific topics by ID
43+
- category:name1 or categories:name1,name2 - posts in categories (by name/slug)
44+
- tag:tag1 or tags:tag1,tag2 - posts in topics with tags
45+
- after:YYYY-MM-DD, before:YYYY-MM-DD - filter by post creation date
46+
- topic_after:YYYY-MM-DD, topic_before:YYYY-MM-DD - filter by topic creation date
47+
- status:open|closed|archived|noreplies|single_user - topic status filters
48+
- max_results:N - limit results (per OR group)
49+
- order:latest|oldest|latest_topic|oldest_topic|likes - sort order
50+
51+
**OR Logic:** Each OR group processes independently - filters don't cross boundaries.
52+
53+
Examples:
54+
- 'username:sam after:2023-01-01' - sam's posts after date
55+
- 'max_results:50 category:bugs OR tag:urgent' - (≤50 bug posts) OR (all urgent posts)
5456
TEXT
5557
end
5658

@@ -60,9 +62,11 @@ def name
6062

6163
def accepted_options
6264
[
65+
option(:researcher_llm, type: :llm),
6366
option(:max_results, type: :integer),
6467
option(:include_private, type: :boolean),
6568
option(:max_tokens_per_post, type: :integer),
69+
option(:max_tokens_per_batch, type: :integer),
6670
]
6771
end
6872
end
@@ -134,17 +138,32 @@ def description_args
134138
protected
135139

136140
MIN_TOKENS_FOR_RESEARCH = 8000
141+
MIN_TOKENS_FOR_POST = 50
142+
137143
def process_filter(filter, goals, post, &blk)
138-
if llm.max_prompt_tokens < MIN_TOKENS_FOR_RESEARCH
144+
if researcher_llm.max_prompt_tokens < MIN_TOKENS_FOR_RESEARCH
139145
raise ArgumentError,
140146
"LLM max tokens too low for research. Minimum is #{MIN_TOKENS_FOR_RESEARCH}."
141147
end
148+
149+
max_tokens_per_batch = options[:max_tokens_per_batch].to_i
150+
if max_tokens_per_batch <= MIN_TOKENS_FOR_RESEARCH
151+
max_tokens_per_batch = researcher_llm.max_prompt_tokens - 2000
152+
end
153+
154+
max_tokens_per_post = options[:max_tokens_per_post]
155+
if max_tokens_per_post.nil?
156+
max_tokens_per_post = 2000
157+
elsif max_tokens_per_post < MIN_TOKENS_FOR_POST
158+
max_tokens_per_post = MIN_TOKENS_FOR_POST
159+
end
160+
142161
formatter =
143162
DiscourseAi::Utils::Research::LlmFormatter.new(
144163
filter,
145-
max_tokens_per_batch: llm.max_prompt_tokens - 2000,
146-
tokenizer: llm.tokenizer,
147-
max_tokens_per_post: options[:max_tokens_per_post] || 2000,
164+
max_tokens_per_batch: max_tokens_per_batch,
165+
tokenizer: researcher_llm.tokenizer,
166+
max_tokens_per_post: max_tokens_per_post,
148167
)
149168

150169
results = []
@@ -164,6 +183,14 @@ def process_filter(filter, goals, post, &blk)
164183
end
165184
end
166185

186+
def researcher_llm
187+
@researcher_llm ||=
188+
(
189+
options[:researcher_llm].present? &&
190+
LlmModel.find_by(id: options[:researcher_llm].to_i)&.to_llm
191+
) || self.llm
192+
end
193+
167194
def run_inference(chunk_text, goals, post, &blk)
168195
return if context.cancel_manager&.cancelled?
169196

@@ -179,7 +206,7 @@ def run_inference(chunk_text, goals, post, &blk)
179206
)
180207

181208
results = []
182-
llm.generate(
209+
researcher_llm.generate(
183210
prompt,
184211
user: post.user,
185212
feature_name: context.feature_name,

lib/utils/research/filter.rb

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -153,12 +153,12 @@ def self.word_to_date(str)
153153
end
154154
end
155155

156-
register_filter(/\A\@(\w+)\z/i) do |relation, username, filter|
157-
user = User.find_by(username_lower: username.downcase)
158-
if user
159-
relation.where("posts.user_id = ?", user.id)
156+
register_filter(/\Ausernames?:(.+)\z/i) do |relation, username, filter|
157+
user_ids = User.where(username_lower: username.split(",").map(&:downcase)).pluck(:id)
158+
if user_ids.empty?
159+
relation.where("1 = 0")
160160
else
161-
relation.where("1 = 0") # No results if user doesn't exist
161+
relation.where("posts.user_id IN (?)", user_ids)
162162
end
163163
end
164164

spec/lib/personas/tools/researcher_spec.rb

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,54 @@
2121

2222
before { SiteSetting.ai_bot_enabled = true }
2323

24+
it "uses custom researcher_llm and applies token limits correctly" do
25+
# Create a second LLM model to test the researcher_llm option
26+
secondary_llm_model = Fabricate(:llm_model, name: "secondary_model")
27+
28+
# Create test content with long text to test token truncation
29+
topic = Fabricate(:topic, category: category, tags: [tag_research])
30+
long_content = "zz " * 100 # This will exceed our token limit
31+
_test_post =
32+
Fabricate(:post, topic: topic, raw: long_content, user: user, skip_validation: true)
33+
34+
prompts = nil
35+
responses = [["Research completed"]]
36+
researcher = nil
37+
38+
DiscourseAi::Completions::Llm.with_prepared_responses(
39+
responses,
40+
llm: secondary_llm_model,
41+
) do |_, _, _prompts|
42+
researcher =
43+
described_class.new(
44+
{ filter: "category:research-category", goals: "analyze test content", dry_run: false },
45+
persona_options: {
46+
"researcher_llm" => secondary_llm_model.id,
47+
"max_tokens_per_post" => 50, # Very small to force truncation
48+
"max_tokens_per_batch" => 8000,
49+
},
50+
bot_user: bot_user,
51+
llm: nil,
52+
context: DiscourseAi::Personas::BotContext.new(user: user, post: post),
53+
)
54+
55+
results = researcher.invoke(&progress_blk)
56+
57+
expect(results[:dry_run]).to eq(false)
58+
expect(results[:results]).to be_present
59+
60+
prompts = _prompts
61+
end
62+
63+
expect(prompts).to be_present
64+
65+
user_message = prompts.first.messages.find { |m| m[:type] == :user }
66+
expect(user_message[:content]).to be_present
67+
68+
# count how many times the the "zz " appears in the content (a bit of token magic, we lose a couple cause we redact)
69+
expect(user_message[:content].scan("zz ").count).to eq(48)
70+
end
71+
2472
describe "#invoke" do
2573
it "can correctly filter to a topic id" do
2674
researcher =
@@ -104,7 +152,7 @@
104152
researcher =
105153
described_class.new(
106154
{
107-
filter: "category:research-category @#{user.username}",
155+
filter: "category:research-category username:#{user.username}",
108156
goals: "find relevant content",
109157
dry_run: false,
110158
},
@@ -129,7 +177,7 @@
129177

130178
expect(results[:dry_run]).to eq(false)
131179
expect(results[:goals]).to eq("find relevant content")
132-
expect(results[:filter]).to eq("category:research-category @#{user.username}")
180+
expect(results[:filter]).to eq("category:research-category username:#{user.username}")
133181
expect(results[:results].first).to include("Found: Relevant content 1")
134182
end
135183
end

spec/lib/utils/research/filter_spec.rb

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,21 @@
144144
end
145145
end
146146

147+
describe "can find posts by users even with unicode usernames" do
148+
before { SiteSetting.unicode_usernames = true }
149+
let!(:unicode_user) { Fabricate(:user, username: "aאb") }
150+
151+
it "can filter by unicode usernames" do
152+
post = Fabricate(:post, user: unicode_user, topic: feature_topic)
153+
filter = described_class.new("username:aאb")
154+
expect(filter.search.pluck(:id)).to contain_exactly(post.id)
155+
156+
filter = described_class.new("usernames:aאb,#{user.username}")
157+
posts_ids = Post.where(user_id: [unicode_user.id, user.id]).pluck(:id)
158+
expect(filter.search.pluck(:id)).to contain_exactly(*posts_ids)
159+
end
160+
end
161+
147162
describe "category filtering" do
148163
it "correctly filters posts by categories" do
149164
filter = described_class.new("category:Announcements")

0 commit comments

Comments
 (0)