Skip to content

feat(async): Add daily task to archive stale Slack channels#206

Open
rgibert wants to merge 18 commits into
mainfrom
rgibert/archive-stale-channels
Open

feat(async): Add daily task to archive stale Slack channels#206
rgibert wants to merge 18 commits into
mainfrom
rgibert/archive-stale-channels

Conversation

@rgibert

@rgibert rgibert commented May 22, 2026

Copy link
Copy Markdown
Member

Add a django-q2 scheduled task that runs daily to find bot-created Slack channels where the workspace retention policy has deleted all message history. For each stale channel, posts an archival notice and archives the channel.

The task queries ExternalLink records (type=SLACK) to find channels Firetower created, checks each via conversations_info and conversations_history(limit=1), and archives any channel with zero remaining messages. Per-channel errors are isolated so one failure doesn't abort the run. If the Slack client is unavailable (no bot token), the task disables its own schedule.

SlackService changes: added archive_channel(), a limit parameter to get_channel_history (avoids paginating all messages when we only need to check existence), and is_archived to get_channel_info.

Resolves RELENG-20

Agent transcript: https://claudescope.sentry.dev/share/onjkTdIu9bD3cgBvGExA2meMCtHFG8S8cpHdLO1DEk8

Add a django-q2 scheduled task that runs daily to find bot-created
Slack channels where the workspace retention policy has deleted all
message history. For each stale channel, posts an archival notice
and archives the channel.

Changes:
- Add archive_channel() method and limit param to get_channel_history
  on SlackService
- Add archive_stale_channels task with per-channel error isolation
- Self-disables the schedule if Slack client is unavailable
- Add is_archived to get_channel_info return dict
- Data migration to register the schedule

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/SeyXSRtESuCIjsSjmOGgO3MM_MZHcZftkU8Td81uFP8
@rgibert rgibert self-assigned this May 22, 2026
@linear-code

linear-code Bot commented May 22, 2026

Copy link
Copy Markdown

RELENG-20

Comment thread src/firetower/incidents/tasks.py Outdated
Comment thread src/firetower/incidents/tasks.py Outdated
Comment thread src/firetower/incidents/tasks.py Outdated
Comment thread src/firetower/incidents/tasks.py Outdated
Three bugs fixed:

1. get_channel_history(limit=1) swallowed exceptions and returned [],
   which the task treated as "no messages" and archived active channels.
   Now exceptions propagate and the per-channel handler skips the channel.

2. The archival notice was posted before calling archive_channel. If
   archiving failed, the notice remained in the channel misleading users.
   Now the notice timestamp is captured and the message is deleted on
   failed archive.

3. A failed archive after a successful notice post would permanently
   prevent future archiving -- the bot's own notice became "history"
   causing the channel to be skipped on every subsequent run. Fixed by
   deleting the notice on failure (issue 2 fix).

Also adds SlackService.delete_message() wrapping chat_delete.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/Sq3ianFbApEXKqxUP2-TSgUOWGcOaf_XYnSDxYWDsKo
Comment thread src/firetower/incidents/tasks.py Outdated
Comment thread src/firetower/incidents/tasks/archive.py
Add a 2-second delay between channels to stay under Slack's Tier 3
rate limit (~50 req/min). Without this, workspaces with >25 incident
channels would routinely hit rate limits on every daily run.

Also skip archiving when post_message fails (returns None) instead of
archiving the channel with no warning to users.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/tRJBNFSwD_KABR2S1OuvR2T7oKVr_Hwgtfk-OUCIRyw
Comment thread src/firetower/incidents/tasks.py Outdated
Comment thread src/firetower/integrations/services/slack.py
Comment thread src/firetower/integrations/services/slack.py
Two issues fixed:

1. If delete_message failed after a failed archive, the bot's own notice
   remained as the only message in the channel, causing every future run
   to skip it permanently. Now the history check uses limit=5 and filters
   out bot messages (by bot_id), so leftover bot notices don't prevent
   future archival attempts.

2. If archive_channel raised a non-SlackApiError (e.g. ConnectionError),
   the exception hit the outer handler which never called delete_message.
   Restructured so the archive attempt has its own try/except that always
   reaches the notice cleanup on any failure mode.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/niq15OUwbdrxpEKAdaoIszEmbWDRe369PsqC1_qlA9Q
Comment thread src/firetower/incidents/tests/test_tasks.py
Comment thread src/firetower/incidents/tasks.py Outdated
rgibert added 2 commits May 22, 2026 12:06
The previous bot_id filter ignored messages from any bot, which could
skip messages from other integrations that legitimately indicate
channel activity. Now uses auth_test to resolve the bot's own identity
and only filters messages matching that specific bot_id. Messages from
other bots are treated as real activity.

Also adds a cached bot_id property to SlackService backed by auth_test,
called once per task run.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/l1u4EK0km7aDME7RkCQJB5vxCzc3c61oPc4oOSn7rws
@rgibert rgibert marked this pull request as ready for review May 22, 2026 17:46
@rgibert rgibert requested a review from a team as a code owner May 22, 2026 17:46
Comment thread src/firetower/integrations/services/slack.py
Comment thread src/firetower/integrations/services/slack.py
Comment thread src/firetower/incidents/tasks/archive.py
Comment thread src/firetower/incidents/tasks.py Outdated
Comment thread src/firetower/incidents/tasks/archive.py
Comment thread src/firetower/incidents/tasks/archive.py Outdated
rgibert added 2 commits May 25, 2026 11:00
Filter to only terminal-status incidents (Done, Cancelled), fetch full
channel history instead of only the 5 most recent messages, and check
thread replies for human activity before archiving.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/et_CT8X1CKbstT_wBeP1lKRqDdU55TAd5nPbC_Yfhbc
Move the None bot_id guard to an early return instead of folding it
into the list comprehension where it caused every message to match,
preventing any channel from being archived.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/3m8ZgzizSqg0HSivytiWzMW0Yuh3h10sQe0WTQSV0lc
Comment thread src/firetower/incidents/tasks/archive.py
Comment on lines 464 to 467
)
return response.get("messages", [])
messages: list[dict[str, Any]] = []
cursor: str | None = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The paginated path in get_channel_history swallows exceptions and returns [], causing the archive_stale_channels task to incorrectly archive channels on API failure.
Severity: HIGH

Suggested Fix

The try...except block in the paginated loop of get_channel_history should re-raise the exception after logging it. This will allow the calling archive_stale_channels task to catch the exception and correctly skip the channel, preventing it from being archived due to a transient API error. The non-paginated path already propagates exceptions, and the paginated path should behave consistently.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/firetower/integrations/services/slack.py#L464-L467

Potential issue: When `get_channel_history` is called without a `limit`, it uses a
paginated approach. This path contains a `try...except` block that catches all
exceptions from the `conversations_history` API call, logs them, and returns an empty
list of messages. The calling task, `archive_stale_channels`, interprets this empty list
as a sign of inactivity and proceeds to archive the channel. This is incorrect behavior,
as a transient API error (like a timeout or rate limit) could cause a recently active
channel to be archived. The intended behavior is for the exception to propagate so the
task can skip the channel instead of archiving it.

Also affects:

  • src/firetower/incidents/tasks.py:134~134
  • src/firetower/integrations/services/slack.py:503~506

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2862233. Both get_channel_history and get_thread_replies now re-raise exceptions and raise RuntimeError on not-ok responses instead of silently returning empty results. The archive task's per-channel except Exception handler correctly skips the channel.

— Claude Code

Comment thread src/firetower/incidents/tasks/archive.py Outdated
…nd get_thread_replies

Both methods previously caught exceptions during pagination and silently
returned partial or empty results. This caused archive_stale_channels to
misinterpret API failures as empty channels and proceed with archival.

Re-raise exceptions and convert not-ok responses to RuntimeError so
callers can handle failures explicitly. The archive task's existing
per-channel exception handler correctly skips channels on error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/a5qK8wNvZdTLJLyKzl8BHPN-eSCiT2_xylllokU8FvA

for i, link in enumerate(links):
if i > 0:
time.sleep(ARCHIVE_CHANNEL_DELAY_SECONDS)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Archiver exceeds worker timeout

Medium Severity

The daily archiver sleeps two seconds before each Slack link after the first, while Q_CLUSTER uses a 180-second task timeout. With dozens of terminal-incident Slack ExternalLink rows, sleep alone can exceed the limit, so django-q may terminate the run before every channel is scanned.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3968031. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The 2s delay is intentional to stay under Slack's Tier 3 rate limit (~50 req/min). The task is idempotent -- if it times out, the next daily run picks up where it left off since already-archived channels are skipped via is_archived check. If channel volume grows enough to make this a real problem, we can batch into smaller chunks or move to an async queue, but that's premature for the current scale.

— Claude Code

Comment thread src/firetower/integrations/services/slack.py Outdated
Migrations 0019_schedule_archive_stale_channels and 0022_actionitem_last_nag
are both leaf nodes since they branched independently from 0018.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/w5CpGaCMagADnNwSkzQ0Dx4_A2PavH46xMEUMheurAE
Comment thread src/firetower/incidents/tests/test_tasks.py
Comment thread src/firetower/integrations/services/slack.py Outdated
The limited (non-paginated) code path did not check response.ok,
inconsistent with the paginated path that raises on not-ok.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/W6kyKz5sBzFSZC-n_g2M6ITMi83QFYcQTub0LnjNfJU
]
if non_own_messages:
skipped += 1
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System messages block stale archival

Medium Severity

The stale-channel check treats any history item whose bot_id is not Firetower’s as blocking archival. Slack system events (e.g. channel_join) usually have no bot_id, so they are counted as foreign activity and the channel is skipped even when only Firetower bot posts remain besides those events.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 033c52d. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. System messages (channel_join, channel_purpose, etc.) lack bot_id and are treated as non-own activity. This is the safer direction -- we'd rather skip a channel than wrongly archive one with real activity obscured by system events. If this causes too many false-skips in practice, we can add a subtype filter in a follow-up.

— Claude Code

Comment on lines +71 to +72
if self._bot_id is not None:
return self._bot_id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The bot_id property does not cache None results, leading to repeated API calls if the initial lookup fails or returns no ID.
Severity: LOW

Suggested Fix

Modify the bot_id property to explicitly set self._bot_id to a sentinel value or None within the try...except block and after a successful API call, even if the result is None. This ensures that failed lookups are cached and not retried.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/firetower/integrations/services/slack.py#L71-L72

Potential issue: The `bot_id` property in the `SlackService` does not cache `None`
results. If the `auth.test` API call fails or returns a response without a `bot_id`, the
property returns `None` but does not update the `self._bot_id` cache attribute.
Consequently, subsequent accesses to the `bot_id` property will re-trigger the same API
call instead of returning a cached `None` value. This is inefficient for long-lived
`SlackService` instances, as it results in repeated, unnecessary network requests upon
failure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation but low impact here. The archive task early-returns when bot_id is None (archive.py:36-38), so subsequent access never happens. The SlackService is also short-lived per task invocation. Can address with a sentinel pattern in a follow-up if needed.

— Claude Code

Comment thread src/firetower/incidents/tasks/archive.py

@sentry-warden sentry-warden Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transient Slack client failure permanently disables the archive schedule

If slack.client is falsy on any single run (e.g. during a token rotation or config drift), the task sets repeats=0 on the schedule and exits — there is no code path that ever re-enables it. The daily job is permanently killed by a single transient unavailability.

Evidence
  • archive_stale_channels sets Schedule.objects.filter(name='archive_stale_channels').update(repeats=0) whenever slack.client is falsy and returns immediately.
  • SCHEDULES in __init__.py sets repeats: -1 only at initial registration; no migration, signal, or task re-sets it back after a failure.
  • SlackService.__init__ sets self.client = None when BOT_TOKEN is absent from settings, which can happen transiently during deploys or secret rotations.
  • test_disables_schedule_when_no_client confirms the repeats=0 write but has no corresponding test for re-enabling the schedule.

Identified by Warden find-bugs

…ssage

Add iter_channel_history generator that yields pages. The archive task
now checks each page as it arrives and stops immediately when a non-own
message is found, avoiding full history materialization for channels
that have human activity. get_channel_history is refactored to delegate
to iter_channel_history.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/FSaAgirqP-GdciE-zXxNpCgXygfdFjKrnVcSo3KTmzY

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f36375b. Configure here.

replies = slack.get_thread_replies(channel_id, msg["ts"])
if replies:
has_activity = True
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread bot replies ignored

Medium Severity

When deciding if a channel is stale, top-level history treats any message whose bot_id is not Firetower’s as activity, but thread follow-up uses get_thread_replies, which drops replies that have a bot_id. A channel whose visible history is only Firetower bot posts can still be archived even when another bot has active thread replies.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f36375b. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional asymmetry. The top-level check is conservative: any non-Firetower message (including other bots) counts as activity to avoid wrongful archival. The thread check uses get_thread_replies which filters to human-only replies -- its purpose is to detect human discussion in threads, not bot-to-bot chatter. A channel with only Firetower top-level messages and only other-bot thread replies has no human activity and is a valid archive candidate.

— Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +118 to +122
except Exception:
errored += 1
logger.exception(
f"Failed to archive channel {channel_id} "
f"(incident {link.incident.incident_number}), "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A SlackApiError during channel archival is logged twice: once in the archive_channel function and again in the calling task, leading to redundant log entries.
Severity: LOW

Suggested Fix

Remove the redundant logging. Either the archive_channel function should not log the error and only return False, letting the caller handle all logging, or the caller should not log the exception again after archive_channel has already logged it.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/firetower/incidents/tasks/archive.py#L118-L122

Potential issue: When the `slack.archive_channel` function fails with a `SlackApiError`,
it logs the error and returns `False`. The calling code in `archive.py` detects this
`False` return value, raises a `RuntimeError`, and then immediately catches that
exception in an outer block. This `except Exception` block then logs the exception a
second time, resulting in redundant logging for the same failure event. While this does
not affect functionality, it creates unnecessary noise in the logs.

Also affects:

  • src/firetower/integrations/services/slack.py:580~590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant