Skip to content
This repository was archived by the owner on Jul 22, 2025. It is now read-only.

Commit 46fcdb6

Browse files
authored
FIX: Make summaries backfill job more resilient. (#1071)
To quickly select backfill candidates without comparing SHAs, we compare the last summarized post to the topic's highest_post_number. However, hiding or deleting a post and adding a small action will update this column, causing the job to stall and re-generate the same summary repeatedly until someone posts a regular reply. On top of this, this is not always true for topics with `best_replies`, as this last reply isn't necessarily included. Since this is not evident at first glance and each summarization strategy picks its targets differently, I'm opting to simplify the backfill logic and how we track potential candidates. The first step is dropping `content_range`, which serves no purpose and it's there because summary caching was supposed to work differently at the beginning. So instead, I'm replacing it with a column called `highest_target_number`, which tracks `highest_post_number` for topics and could track other things like channel's `message_count` in the future. Now that we have this column when selecting every potential backfill candidate, we'll check if the summary is truly outdated by comparing the SHAs, and if it's not, we just update the column and move on
1 parent f9aa2de commit 46fcdb6

File tree

16 files changed

+150
-117
lines changed

16 files changed

+150
-117
lines changed

app/jobs/regular/fast_track_topic_gist.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def execute(args)
1616
gist = summarizer.existing_summary
1717
return if gist.present? && (!gist.outdated || gist.created_at >= 5.minutes.ago)
1818

19-
summarizer.force_summarize(Discourse.system_user)
19+
summarizer.summarize(Discourse.system_user)
2020
end
2121
end
2222
end

app/jobs/scheduled/summaries_backfill.rb

Lines changed: 21 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,18 +18,34 @@ def execute(_args)
1818
backfill_candidates(gist_t)
1919
.limit(current_budget(gist_t))
2020
.each do |topic|
21-
DiscourseAi::Summarization.topic_gist(topic).force_summarize(system_user)
21+
strategy = DiscourseAi::Summarization.topic_gist(topic)
22+
try_summarize(strategy, system_user, topic)
2223
end
2324
end
2425

2526
complete_t = AiSummary.summary_types[:complete]
2627
backfill_candidates(complete_t)
2728
.limit(current_budget(complete_t))
2829
.each do |topic|
29-
DiscourseAi::Summarization.topic_summary(topic).force_summarize(system_user)
30+
strategy = DiscourseAi::Summarization.topic_summary(topic)
31+
try_summarize(strategy, system_user, topic)
3032
end
3133
end
3234

35+
def try_summarize(strategy, user, topic)
36+
existing_summary = strategy.existing_summary
37+
38+
if existing_summary.blank? || existing_summary.outdated
39+
strategy.summarize(user)
40+
else
41+
# Hiding or deleting a post, and creating a small action alters the Topic#highest_post_number.
42+
# We use this as a quick way to select potential backfill candidates without relying on original_content_sha.
43+
# At this point, we are confident the summary doesn't need to be regenerated so something other than a regular reply
44+
# caused the number to change in the topic.
45+
existing_summary.update!(highest_target_number: topic.highest_post_number)
46+
end
47+
end
48+
3349
def backfill_candidates(summary_type)
3450
max_age_days = SiteSetting.ai_summary_backfill_topic_max_age_days
3551

@@ -45,12 +61,12 @@ def backfill_candidates(summary_type)
4561
.where(
4662
<<~SQL, # (1..1) gets stored ad (1..2).
4763
ais.id IS NULL OR (
48-
UPPER(ais.content_range) < topics.highest_post_number + 1
49-
AND ais.created_at < (current_timestamp - INTERVAL '5 minutes')
64+
ais.highest_target_number < topics.highest_post_number
65+
AND ais.updated_at < (current_timestamp - INTERVAL '5 minutes')
5066
)
5167
SQL
5268
)
53-
.order("ais.created_at DESC NULLS FIRST, topics.last_posted_at DESC")
69+
.order("ais.updated_at DESC NULLS FIRST, topics.last_posted_at DESC")
5470
end
5571

5672
def current_budget(type)

app/models/ai_summary.rb

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
# frozen_string_literal: true
22

33
class AiSummary < ActiveRecord::Base
4+
# TODO remove this line 01-3-2025
5+
self.ignored_columns = %i[content_range]
6+
47
belongs_to :target, polymorphic: true
58

69
enum :summary_type, { complete: 0, gist: 1 }
@@ -15,14 +18,20 @@ def self.store!(strategy, llm_model, summary, og_content, human:)
1518
target_id: strategy.target.id,
1619
target_type: strategy.target.class.name,
1720
algorithm: llm_model.name,
18-
content_range: (content_ids.first..content_ids.last),
21+
highest_target_number: strategy.highest_target_number,
1922
summarized_text: summary,
2023
original_content_sha: build_sha(content_ids.join),
2124
summary_type: strategy.type,
2225
origin: !!human ? origins[:human] : origins[:system],
2326
},
2427
unique_by: %i[target_id target_type summary_type],
25-
update_only: %i[summarized_text original_content_sha algorithm origin content_range],
28+
update_only: %i[
29+
summarized_text
30+
original_content_sha
31+
algorithm
32+
origin
33+
highest_target_number
34+
],
2635
)
2736
.first
2837
.then { AiSummary.find_by(id: _1["id"]) }
@@ -45,17 +54,17 @@ def outdated
4554
#
4655
# Table name: ai_summaries
4756
#
48-
# id :bigint not null, primary key
49-
# target_id :integer not null
50-
# target_type :string not null
51-
# content_range :int4range
52-
# summarized_text :string not null
53-
# original_content_sha :string not null
54-
# algorithm :string not null
55-
# created_at :datetime not null
56-
# updated_at :datetime not null
57-
# summary_type :integer default("complete"), not null
58-
# origin :integer
57+
# id :bigint not null, primary key
58+
# target_id :integer not null
59+
# target_type :string not null
60+
# summarized_text :string not null
61+
# original_content_sha :string not null
62+
# algorithm :string not null
63+
# created_at :datetime not null
64+
# updated_at :datetime not null
65+
# summary_type :integer default("complete"), not null
66+
# origin :integer
67+
# highest_target_number :integer not null
5968
#
6069
# Indexes
6170
#

app/serializers/ai_topic_summary_serializer.rb

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,6 @@ def can_regenerate
1313
end
1414

1515
def new_posts_since_summary
16-
# Postgres uses discrete range types for int4range, which means
17-
# (1..2) is stored as (1...3).
18-
#
19-
# We use Range#max to work around this, which in the case above always returns 2.
20-
# Be careful with using Range#end here, it could lead to unexpected results as:
21-
#
22-
# (1..2).end => 2
23-
# (1...3).end => 3
24-
25-
object.target.highest_post_number.to_i - object.content_range&.max.to_i
16+
object.target.highest_post_number.to_i - object.highest_target_number.to_i
2617
end
2718
end
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# frozen_string_literal: true
2+
class AddHighestTargetNumberToAiSummary < ActiveRecord::Migration[7.2]
3+
def up
4+
add_column :ai_summaries, :highest_target_number, :integer, null: false
5+
6+
execute <<~SQL
7+
UPDATE ai_summaries SET highest_target_number = GREATEST(UPPER(content_range) - 1, 1)
8+
SQL
9+
end
10+
11+
def down
12+
drop_column :ai_summaries, :highest_target_number
13+
end
14+
end
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# frozen_string_literal: true
2+
class DropAiSummariesContentRange < ActiveRecord::Migration[7.2]
3+
DROPPED_COLUMNS ||= { ai_summaries: %i[content_range] }
4+
5+
def up
6+
DROPPED_COLUMNS.each { |table, columns| Migration::ColumnDropper.execute_drop(table, columns) }
7+
end
8+
9+
def down
10+
raise ActiveRecord::IrreversibleMigration
11+
end
12+
end

lib/summarization/fold_content.rb

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ def initialize(llm, strategy, persist_summaries: true)
2121
# @param &on_partial_blk { Block - Optional } - The passed block will get called with the LLM partial response alongside a cancel function.
2222
# Note: The block is only called with results of the final summary, not intermediate summaries.
2323
#
24+
# This method doesn't care if we already have an up to date summary. It always regenerate.
25+
#
2426
# @returns { AiSummary } - Resulting summary.
2527
def summarize(user, &on_partial_blk)
2628
base_summary = ""
@@ -68,11 +70,6 @@ def delete_cached_summaries!
6870
AiSummary.where(target: strategy.target, summary_type: strategy.type).destroy_all
6971
end
7072

71-
def force_summarize(user, &on_partial_blk)
72-
@persist_summaries = true
73-
summarize(user, &on_partial_blk)
74-
end
75-
7673
private
7774

7875
attr_reader :persist_summaries

lib/summarization/strategies/chat_messages.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ def type
88
AiSummary.summary_types[:complete]
99
end
1010

11+
def highest_target_number
12+
nil # We don't persist so we can return nil.
13+
end
14+
1115
def initialize(target, since)
1216
super(target)
1317
@since = since

lib/summarization/strategies/hot_topic_gists.rb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ def feature
1212
"gists"
1313
end
1414

15+
def highest_target_number
16+
target.highest_post_number
17+
end
18+
1519
def targets_data
1620
op_post_number = 1
1721

@@ -38,6 +42,7 @@ def targets_data
3842
.limit(20)
3943
.pluck(:post_number)
4044
end
45+
4146
posts_data =
4247
Post
4348
.where(topic_id: target.id)

lib/summarization/strategies/topic_summary.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@ def type
88
AiSummary.summary_types[:complete]
99
end
1010

11+
def highest_target_number
12+
target.highest_post_number
13+
end
14+
1115
def targets_data
1216
posts_data =
1317
(target.has_summary? ? best_replies : pick_selection).pluck(

0 commit comments

Comments
 (0)