feat(async): Add daily task to archive stale Slack channels#206
feat(async): Add daily task to archive stale Slack channels#206rgibert wants to merge 18 commits into
Conversation
Add a django-q2 scheduled task that runs daily to find bot-created Slack channels where the workspace retention policy has deleted all message history. For each stale channel, posts an archival notice and archives the channel. Changes: - Add archive_channel() method and limit param to get_channel_history on SlackService - Add archive_stale_channels task with per-channel error isolation - Self-disables the schedule if Slack client is unavailable - Add is_archived to get_channel_info return dict - Data migration to register the schedule Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/SeyXSRtESuCIjsSjmOGgO3MM_MZHcZftkU8Td81uFP8
Three bugs fixed: 1. get_channel_history(limit=1) swallowed exceptions and returned [], which the task treated as "no messages" and archived active channels. Now exceptions propagate and the per-channel handler skips the channel. 2. The archival notice was posted before calling archive_channel. If archiving failed, the notice remained in the channel misleading users. Now the notice timestamp is captured and the message is deleted on failed archive. 3. A failed archive after a successful notice post would permanently prevent future archiving -- the bot's own notice became "history" causing the channel to be skipped on every subsequent run. Fixed by deleting the notice on failure (issue 2 fix). Also adds SlackService.delete_message() wrapping chat_delete. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/Sq3ianFbApEXKqxUP2-TSgUOWGcOaf_XYnSDxYWDsKo
Add a 2-second delay between channels to stay under Slack's Tier 3 rate limit (~50 req/min). Without this, workspaces with >25 incident channels would routinely hit rate limits on every daily run. Also skip archiving when post_message fails (returns None) instead of archiving the channel with no warning to users. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/tRJBNFSwD_KABR2S1OuvR2T7oKVr_Hwgtfk-OUCIRyw
Two issues fixed: 1. If delete_message failed after a failed archive, the bot's own notice remained as the only message in the channel, causing every future run to skip it permanently. Now the history check uses limit=5 and filters out bot messages (by bot_id), so leftover bot notices don't prevent future archival attempts. 2. If archive_channel raised a non-SlackApiError (e.g. ConnectionError), the exception hit the outer handler which never called delete_message. Restructured so the archive attempt has its own try/except that always reaches the notice cleanup on any failure mode. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/niq15OUwbdrxpEKAdaoIszEmbWDRe369PsqC1_qlA9Q
The previous bot_id filter ignored messages from any bot, which could skip messages from other integrations that legitimately indicate channel activity. Now uses auth_test to resolve the bot's own identity and only filters messages matching that specific bot_id. Messages from other bots are treated as real activity. Also adds a cached bot_id property to SlackService backed by auth_test, called once per task run. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/l1u4EK0km7aDME7RkCQJB5vxCzc3c61oPc4oOSn7rws
Filter to only terminal-status incidents (Done, Cancelled), fetch full channel history instead of only the 5 most recent messages, and check thread replies for human activity before archiving. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/et_CT8X1CKbstT_wBeP1lKRqDdU55TAd5nPbC_Yfhbc
Move the None bot_id guard to an early return instead of folding it into the list comprehension where it caused every message to match, preventing any channel from being archived. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/3m8ZgzizSqg0HSivytiWzMW0Yuh3h10sQe0WTQSV0lc
| ) | ||
| return response.get("messages", []) | ||
| messages: list[dict[str, Any]] = [] | ||
| cursor: str | None = None |
There was a problem hiding this comment.
Bug: The paginated path in get_channel_history swallows exceptions and returns [], causing the archive_stale_channels task to incorrectly archive channels on API failure.
Severity: HIGH
Suggested Fix
The try...except block in the paginated loop of get_channel_history should re-raise the exception after logging it. This will allow the calling archive_stale_channels task to catch the exception and correctly skip the channel, preventing it from being archived due to a transient API error. The non-paginated path already propagates exceptions, and the paginated path should behave consistently.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/firetower/integrations/services/slack.py#L464-L467
Potential issue: When `get_channel_history` is called without a `limit`, it uses a
paginated approach. This path contains a `try...except` block that catches all
exceptions from the `conversations_history` API call, logs them, and returns an empty
list of messages. The calling task, `archive_stale_channels`, interprets this empty list
as a sign of inactivity and proceeds to archive the channel. This is incorrect behavior,
as a transient API error (like a timeout or rate limit) could cause a recently active
channel to be archived. The intended behavior is for the exception to propagate so the
task can skip the channel instead of archiving it.
Also affects:
src/firetower/incidents/tasks.py:134~134src/firetower/integrations/services/slack.py:503~506
There was a problem hiding this comment.
Fixed in 2862233. Both get_channel_history and get_thread_replies now re-raise exceptions and raise RuntimeError on not-ok responses instead of silently returning empty results. The archive task's per-channel except Exception handler correctly skips the channel.
— Claude Code
…-channels $ Conflicts: $ src/firetower/integrations/tests/test_slack.py Agent transcript: https://claudescope.sentry.dev/share/iSza-nGWOPbydesSCvXfK4sBlje7eLRJtcNdU6DoM9M
…nd get_thread_replies Both methods previously caught exceptions during pagination and silently returned partial or empty results. This caused archive_stale_channels to misinterpret API failures as empty channels and proceed with archival. Re-raise exceptions and convert not-ok responses to RuntimeError so callers can handle failures explicitly. The archive task's existing per-channel exception handler correctly skips channels on error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/a5qK8wNvZdTLJLyKzl8BHPN-eSCiT2_xylllokU8FvA
|
|
||
| for i, link in enumerate(links): | ||
| if i > 0: | ||
| time.sleep(ARCHIVE_CHANNEL_DELAY_SECONDS) |
There was a problem hiding this comment.
Archiver exceeds worker timeout
Medium Severity
The daily archiver sleeps two seconds before each Slack link after the first, while Q_CLUSTER uses a 180-second task timeout. With dozens of terminal-incident Slack ExternalLink rows, sleep alone can exceed the limit, so django-q may terminate the run before every channel is scanned.
Reviewed by Cursor Bugbot for commit 3968031. Configure here.
There was a problem hiding this comment.
Acknowledged. The 2s delay is intentional to stay under Slack's Tier 3 rate limit (~50 req/min). The task is idempotent -- if it times out, the next daily run picks up where it left off since already-archived channels are skipped via is_archived check. If channel volume grows enough to make this a real problem, we can batch into smaller chunks or move to an async queue, but that's premature for the current scale.
— Claude Code
Migrations 0019_schedule_archive_stale_channels and 0022_actionitem_last_nag are both leaf nodes since they branched independently from 0018. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/w5CpGaCMagADnNwSkzQ0Dx4_A2PavH46xMEUMheurAE
…finition Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/cLAxS24igE7SbOnKuxTEfdrRfxRz3d36cpv83xmbmQY
The limited (non-paginated) code path did not check response.ok, inconsistent with the paginated path that raises on not-ok. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/W6kyKz5sBzFSZC-n_g2M6ITMi83QFYcQTub0LnjNfJU
| ] | ||
| if non_own_messages: | ||
| skipped += 1 | ||
| continue |
There was a problem hiding this comment.
System messages block stale archival
Medium Severity
The stale-channel check treats any history item whose bot_id is not Firetower’s as blocking archival. Slack system events (e.g. channel_join) usually have no bot_id, so they are counted as foreign activity and the channel is skipped even when only Firetower bot posts remain besides those events.
Reviewed by Cursor Bugbot for commit 033c52d. Configure here.
There was a problem hiding this comment.
Intentional. System messages (channel_join, channel_purpose, etc.) lack bot_id and are treated as non-own activity. This is the safer direction -- we'd rather skip a channel than wrongly archive one with real activity obscured by system events. If this causes too many false-skips in practice, we can add a subtype filter in a follow-up.
— Claude Code
| if self._bot_id is not None: | ||
| return self._bot_id |
There was a problem hiding this comment.
Bug: The bot_id property does not cache None results, leading to repeated API calls if the initial lookup fails or returns no ID.
Severity: LOW
Suggested Fix
Modify the bot_id property to explicitly set self._bot_id to a sentinel value or None within the try...except block and after a successful API call, even if the result is None. This ensures that failed lookups are cached and not retried.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/firetower/integrations/services/slack.py#L71-L72
Potential issue: The `bot_id` property in the `SlackService` does not cache `None`
results. If the `auth.test` API call fails or returns a response without a `bot_id`, the
property returns `None` but does not update the `self._bot_id` cache attribute.
Consequently, subsequent accesses to the `bot_id` property will re-trigger the same API
call instead of returning a cached `None` value. This is inefficient for long-lived
`SlackService` instances, as it results in repeated, unnecessary network requests upon
failure.
There was a problem hiding this comment.
Valid observation but low impact here. The archive task early-returns when bot_id is None (archive.py:36-38), so subsequent access never happens. The SlackService is also short-lived per task invocation. Can address with a sentinel pattern in a follow-up if needed.
— Claude Code
There was a problem hiding this comment.
Transient Slack client failure permanently disables the archive schedule
If slack.client is falsy on any single run (e.g. during a token rotation or config drift), the task sets repeats=0 on the schedule and exits — there is no code path that ever re-enables it. The daily job is permanently killed by a single transient unavailability.
Evidence
archive_stale_channelssetsSchedule.objects.filter(name='archive_stale_channels').update(repeats=0)wheneverslack.clientis falsy and returns immediately.SCHEDULESin__init__.pysetsrepeats: -1only at initial registration; no migration, signal, or task re-sets it back after a failure.SlackService.__init__setsself.client = NonewhenBOT_TOKENis absent from settings, which can happen transiently during deploys or secret rotations.test_disables_schedule_when_no_clientconfirms therepeats=0write but has no corresponding test for re-enabling the schedule.
Identified by Warden find-bugs
…ssage Add iter_channel_history generator that yields pages. The archive task now checks each page as it arrives and stops immediately when a non-own message is found, avoiding full history materialization for channels that have human activity. get_channel_history is refactored to delegate to iter_channel_history. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/FSaAgirqP-GdciE-zXxNpCgXygfdFjKrnVcSo3KTmzY
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f36375b. Configure here.
| replies = slack.get_thread_replies(channel_id, msg["ts"]) | ||
| if replies: | ||
| has_activity = True | ||
| break |
There was a problem hiding this comment.
Thread bot replies ignored
Medium Severity
When deciding if a channel is stale, top-level history treats any message whose bot_id is not Firetower’s as activity, but thread follow-up uses get_thread_replies, which drops replies that have a bot_id. A channel whose visible history is only Firetower bot posts can still be archived even when another bot has active thread replies.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f36375b. Configure here.
There was a problem hiding this comment.
Intentional asymmetry. The top-level check is conservative: any non-Firetower message (including other bots) counts as activity to avoid wrongful archival. The thread check uses get_thread_replies which filters to human-only replies -- its purpose is to detect human discussion in threads, not bot-to-bot chatter. A channel with only Firetower top-level messages and only other-bot thread replies has no human activity and is a valid archive candidate.
— Claude Code
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| except Exception: | ||
| errored += 1 | ||
| logger.exception( | ||
| f"Failed to archive channel {channel_id} " | ||
| f"(incident {link.incident.incident_number}), " |
There was a problem hiding this comment.
Bug: A SlackApiError during channel archival is logged twice: once in the archive_channel function and again in the calling task, leading to redundant log entries.
Severity: LOW
Suggested Fix
Remove the redundant logging. Either the archive_channel function should not log the error and only return False, letting the caller handle all logging, or the caller should not log the exception again after archive_channel has already logged it.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/firetower/incidents/tasks/archive.py#L118-L122
Potential issue: When the `slack.archive_channel` function fails with a `SlackApiError`,
it logs the error and returns `False`. The calling code in `archive.py` detects this
`False` return value, raises a `RuntimeError`, and then immediately catches that
exception in an outer block. This `except Exception` block then logs the exception a
second time, resulting in redundant logging for the same failure event. While this does
not affect functionality, it creates unnecessary noise in the logs.
Also affects:
src/firetower/integrations/services/slack.py:580~590


Add a django-q2 scheduled task that runs daily to find bot-created Slack channels where the workspace retention policy has deleted all message history. For each stale channel, posts an archival notice and archives the channel.
The task queries ExternalLink records (type=SLACK) to find channels Firetower created, checks each via conversations_info and conversations_history(limit=1), and archives any channel with zero remaining messages. Per-channel errors are isolated so one failure doesn't abort the run. If the Slack client is unavailable (no bot token), the task disables its own schedule.
SlackService changes: added archive_channel(), a limit parameter to get_channel_history (avoids paginating all messages when we only need to check existence), and is_archived to get_channel_info.
Resolves RELENG-20
Agent transcript: https://claudescope.sentry.dev/share/onjkTdIu9bD3cgBvGExA2meMCtHFG8S8cpHdLO1DEk8