feat(async): Add daily task to archive stale Slack channels by rgibert · Pull Request #206 · getsentry/firetower

rgibert · 2026-05-22T14:58:24Z

Add a django-q2 scheduled task that runs daily to find bot-created Slack channels where the workspace retention policy has deleted all message history. For each stale channel, posts an archival notice and archives the channel.

The task queries ExternalLink records (type=SLACK) to find channels Firetower created, checks each via conversations_info and conversations_history(limit=1), and archives any channel with zero remaining messages. Per-channel errors are isolated so one failure doesn't abort the run. If the Slack client is unavailable (no bot token), the task disables its own schedule.

SlackService changes: added archive_channel(), a limit parameter to get_channel_history (avoids paginating all messages when we only need to check existence), and is_archived to get_channel_info.

Resolves RELENG-20

Agent transcript: https://claudescope.sentry.dev/share/onjkTdIu9bD3cgBvGExA2meMCtHFG8S8cpHdLO1DEk8

Add a django-q2 scheduled task that runs daily to find bot-created Slack channels where the workspace retention policy has deleted all message history. For each stale channel, posts an archival notice and archives the channel. Changes: - Add archive_channel() method and limit param to get_channel_history on SlackService - Add archive_stale_channels task with per-channel error isolation - Self-disables the schedule if Slack client is unavailable - Add is_archived to get_channel_info return dict - Data migration to register the schedule Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/SeyXSRtESuCIjsSjmOGgO3MM_MZHcZftkU8Td81uFP8

linear-code · 2026-05-22T15:00:34Z

RELENG-20

Three bugs fixed: 1. get_channel_history(limit=1) swallowed exceptions and returned [], which the task treated as "no messages" and archived active channels. Now exceptions propagate and the per-channel handler skips the channel. 2. The archival notice was posted before calling archive_channel. If archiving failed, the notice remained in the channel misleading users. Now the notice timestamp is captured and the message is deleted on failed archive. 3. A failed archive after a successful notice post would permanently prevent future archiving -- the bot's own notice became "history" causing the channel to be skipped on every subsequent run. Fixed by deleting the notice on failure (issue 2 fix). Also adds SlackService.delete_message() wrapping chat_delete. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/Sq3ianFbApEXKqxUP2-TSgUOWGcOaf_XYnSDxYWDsKo

Add a 2-second delay between channels to stay under Slack's Tier 3 rate limit (~50 req/min). Without this, workspaces with >25 incident channels would routinely hit rate limits on every daily run. Also skip archiving when post_message fails (returns None) instead of archiving the channel with no warning to users. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/tRJBNFSwD_KABR2S1OuvR2T7oKVr_Hwgtfk-OUCIRyw

Two issues fixed: 1. If delete_message failed after a failed archive, the bot's own notice remained as the only message in the channel, causing every future run to skip it permanently. Now the history check uses limit=5 and filters out bot messages (by bot_id), so leftover bot notices don't prevent future archival attempts. 2. If archive_channel raised a non-SlackApiError (e.g. ConnectionError), the exception hit the outer handler which never called delete_message. Restructured so the archive attempt has its own try/except that always reaches the notice cleanup on any failure mode. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/niq15OUwbdrxpEKAdaoIszEmbWDRe369PsqC1_qlA9Q

The previous bot_id filter ignored messages from any bot, which could skip messages from other integrations that legitimately indicate channel activity. Now uses auth_test to resolve the bot's own identity and only filters messages matching that specific bot_id. Messages from other bots are treated as real activity. Also adds a cached bot_id property to SlackService backed by auth_test, called once per task run. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/l1u4EK0km7aDME7RkCQJB5vxCzc3c61oPc4oOSn7rws

…n.md

Filter to only terminal-status incidents (Done, Cancelled), fetch full channel history instead of only the 5 most recent messages, and check thread replies for human activity before archiving. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/et_CT8X1CKbstT_wBeP1lKRqDdU55TAd5nPbC_Yfhbc

Move the None bot_id guard to an early return instead of folding it into the list comprehension where it caused every message to match, preventing any channel from being archived. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/3m8ZgzizSqg0HSivytiWzMW0Yuh3h10sQe0WTQSV0lc

sentry · 2026-05-25T15:04:44Z

+            )
+            return response.get("messages", [])
        messages: list[dict[str, Any]] = []
        cursor: str | None = None


Bug: The paginated path in get_channel_history swallows exceptions and returns [], causing the archive_stale_channels task to incorrectly archive channels on API failure.
_{Severity: HIGH}

Suggested Fix

The try...except block in the paginated loop of get_channel_history should re-raise the exception after logging it. This will allow the calling archive_stale_channels task to catch the exception and correctly skip the channel, preventing it from being archived due to a transient API error. The non-paginated path already propagates exceptions, and the paginated path should behave consistently.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/firetower/integrations/services/slack.py#L464-L467 Potential issue: When `get_channel_history` is called without a `limit`, it uses a paginated approach. This path contains a `try...except` block that catches all exceptions from the `conversations_history` API call, logs them, and returns an empty list of messages. The calling task, `archive_stale_channels`, interprets this empty list as a sign of inactivity and proceeds to archive the channel. This is incorrect behavior, as a transient API error (like a timeout or rate limit) could cause a recently active channel to be archived. The intended behavior is for the exception to propagate so the task can skip the channel instead of archiving it.

Also affects:

src/firetower/incidents/tasks.py:134~134

src/firetower/integrations/services/slack.py:503~506

Fixed in 2862233. Both get_channel_history and get_thread_replies now re-raise exceptions and raise RuntimeError on not-ok responses instead of silently returning empty results. The archive task's per-channel except Exception handler correctly skips the channel.

— Claude Code

…-channels $ Conflicts: $ src/firetower/integrations/tests/test_slack.py Agent transcript: https://claudescope.sentry.dev/share/iSza-nGWOPbydesSCvXfK4sBlje7eLRJtcNdU6DoM9M

…nd get_thread_replies Both methods previously caught exceptions during pagination and silently returned partial or empty results. This caused archive_stale_channels to misinterpret API failures as empty channels and proceed with archival. Re-raise exceptions and convert not-ok responses to RuntimeError so callers can handle failures explicitly. The archive task's existing per-channel exception handler correctly skips channels on error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/a5qK8wNvZdTLJLyKzl8BHPN-eSCiT2_xylllokU8FvA

cursor · 2026-07-03T18:40:23Z

+
+    for i, link in enumerate(links):
+        if i > 0:
+            time.sleep(ARCHIVE_CHANNEL_DELAY_SECONDS)


Archiver exceeds worker timeout

Medium Severity

The daily archiver sleeps two seconds before each Slack link after the first, while Q_CLUSTER uses a 180-second task timeout. With dozens of terminal-incident Slack ExternalLink rows, sleep alone can exceed the limit, so django-q may terminate the run before every channel is scanned.

^{Reviewed by Cursor Bugbot for commit 3968031. Configure here.}

Acknowledged. The 2s delay is intentional to stay under Slack's Tier 3 rate limit (~50 req/min). The task is idempotent -- if it times out, the next daily run picks up where it left off since already-archived channels are skipped via is_archived check. If channel volume grows enough to make this a real problem, we can batch into smaller chunks or move to an async queue, but that's premature for the current scale.

— Claude Code

Migrations 0019_schedule_archive_stale_channels and 0022_actionitem_last_nag are both leaf nodes since they branched independently from 0018. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/w5CpGaCMagADnNwSkzQ0Dx4_A2PavH46xMEUMheurAE

…finition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/cLAxS24igE7SbOnKuxTEfdrRfxRz3d36cpv83xmbmQY

The limited (non-paginated) code path did not check response.ok, inconsistent with the paginated path that raises on not-ok. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/W6kyKz5sBzFSZC-n_g2M6ITMi83QFYcQTub0LnjNfJU

cursor · 2026-07-03T19:08:35Z

+            ]
+            if non_own_messages:
+                skipped += 1
+                continue


System messages block stale archival

Medium Severity

The stale-channel check treats any history item whose bot_id is not Firetower’s as blocking archival. Slack system events (e.g. channel_join) usually have no bot_id, so they are counted as foreign activity and the channel is skipped even when only Firetower bot posts remain besides those events.

^{Reviewed by Cursor Bugbot for commit 033c52d. Configure here.}

Intentional. System messages (channel_join, channel_purpose, etc.) lack bot_id and are treated as non-own activity. This is the safer direction -- we'd rather skip a channel than wrongly archive one with real activity obscured by system events. If this causes too many false-skips in practice, we can add a subtype filter in a follow-up.

— Claude Code

sentry · 2026-07-03T19:09:24Z

+        if self._bot_id is not None:
+            return self._bot_id


Bug: The bot_id property does not cache None results, leading to repeated API calls if the initial lookup fails or returns no ID.
_{Severity: LOW}

Suggested Fix

Modify the bot_id property to explicitly set self._bot_id to a sentinel value or None within the try...except block and after a successful API call, even if the result is None. This ensures that failed lookups are cached and not retried.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/firetower/integrations/services/slack.py#L71-L72 Potential issue: The `bot_id` property in the `SlackService` does not cache `None` results. If the `auth.test` API call fails or returns a response without a `bot_id`, the property returns `None` but does not update the `self._bot_id` cache attribute. Consequently, subsequent accesses to the `bot_id` property will re-trigger the same API call instead of returning a cached `None` value. This is inefficient for long-lived `SlackService` instances, as it results in repeated, unnecessary network requests upon failure.

Valid observation but low impact here. The archive task early-returns when bot_id is None (archive.py:36-38), so subsequent access never happens. The SlackService is also short-lived per task invocation. Can address with a sentinel pattern in a follow-up if needed.

— Claude Code

sentry-warden

Transient Slack client failure permanently disables the archive schedule

If slack.client is falsy on any single run (e.g. during a token rotation or config drift), the task sets repeats=0 on the schedule and exits — there is no code path that ever re-enables it. The daily job is permanently killed by a single transient unavailability.

Evidence

archive_stale_channels sets Schedule.objects.filter(name='archive_stale_channels').update(repeats=0) whenever slack.client is falsy and returns immediately.
SCHEDULES in __init__.py sets repeats: -1 only at initial registration; no migration, signal, or task re-sets it back after a failure.
SlackService.__init__ sets self.client = None when BOT_TOKEN is absent from settings, which can happen transiently during deploys or secret rotations.
test_disables_schedule_when_no_client confirms the repeats=0 write but has no corresponding test for re-enabling the schedule.

_{Identified by Warden find-bugs}

…ssage Add iter_channel_history generator that yields pages. The archive task now checks each page as it arrives and stops immediately when a non-own message is found, avoiding full history materialization for channels that have human activity. get_channel_history is refactored to delegate to iter_channel_history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/FSaAgirqP-GdciE-zXxNpCgXygfdFjKrnVcSo3KTmzY

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f36375b. Configure here.}

cursor · 2026-07-03T19:55:22Z

+                    replies = slack.get_thread_replies(channel_id, msg["ts"])
+                    if replies:
+                        has_activity = True
+                        break


Thread bot replies ignored

Medium Severity

When deciding if a channel is stale, top-level history treats any message whose bot_id is not Firetower’s as activity, but thread follow-up uses get_thread_replies, which drops replies that have a bot_id. A channel whose visible history is only Firetower bot posts can still be archived even when another bot has active thread replies.

Additional Locations (1)

src/firetower/integrations/services/slack.py#L588-L597

^{Reviewed by Cursor Bugbot for commit f36375b. Configure here.}

Intentional asymmetry. The top-level check is conservative: any non-Firetower message (including other bots) counts as activity to avoid wrongful archival. The thread check uses get_thread_replies which filters to human-only replies -- its purpose is to detect human discussion in threads, not bot-to-bot chatter. A channel with only Firetower top-level messages and only other-bot thread replies has no human activity and is a valid archive candidate.

— Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sentry · 2026-07-03T20:02:28Z

+            except Exception:
+                errored += 1
+                logger.exception(
+                    f"Failed to archive channel {channel_id} "
+                    f"(incident {link.incident.incident_number}), "


Bug: A SlackApiError during channel archival is logged twice: once in the archive_channel function and again in the calling task, leading to redundant log entries.
_{Severity: LOW}

Suggested Fix

Remove the redundant logging. Either the archive_channel function should not log the error and only return False, letting the caller handle all logging, or the caller should not log the exception again after archive_channel has already logged it.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/firetower/incidents/tasks/archive.py#L118-L122 Potential issue: When the `slack.archive_channel` function fails with a `SlackApiError`, it logs the error and returns `False`. The calling code in `archive.py` detects this `False` return value, raises a `RuntimeError`, and then immediately catches that exception in an outer block. This `except Exception` block then logs the exception a second time, resulting in redundant logging for the same failure event. While this does not affect functionality, it creates unnecessary noise in the logs.

Also affects:

src/firetower/integrations/services/slack.py:580~590

rgibert self-assigned this May 22, 2026