Skip to content

Conversation

@everettbu
Copy link
Contributor

@everettbu everettbu commented Jul 29, 2025

Test 8

…0528)

Since we'll be doing more with issue occurrences split out the concept
of incidents into it's own logic module, as well as incident_occurrence
into it's own module

Part of GH-80527
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements a significant refactoring of the monitor incident creation and issue occurrence logic within Sentry's cron monitoring system. The changes extract complex functionality from the monolithic mark_failed.py file into two new dedicated modules:

  1. incidents.py - Contains the try_incident_threshold function that centralizes incident creation logic, handling different monitor status states and failure thresholds
  2. incident_occurrence.py - Houses functions for creating issue occurrences in Kafka, generating human-readable failure messages, and building monitor environment context

The refactoring also introduces a new SimpleCheckIn TypedDict in types.py to optimize database queries by using lightweight dictionary objects instead of full Django model instances when processing check-ins.

The main mark_failed function now delegates incident threshold handling to the new try_incident_threshold function, creating a cleaner separation of concerns. This modular approach improves code organization by giving each module a single, focused responsibility: mark_failed.py handles monitor state updates, incidents.py manages incident threshold logic, and incident_occurrence.py handles issue creation and messaging.

This refactoring maintains identical functionality while making the codebase more maintainable, testable, and following the single responsibility principle more closely. The changes also include performance optimizations through the use of .values() queries with the new SimpleCheckIn type.

Confidence score: 3/5

• This refactoring appears safe but has some potential concurrency and database consistency concerns that need attention
• The score reflects database transaction safety issues in incidents.py and a potential bug in incident_occurrence.py, plus the lack of comprehensive testing for concurrent scenarios
• Files needing more attention: src/sentry/monitors/logic/incidents.py for database race conditions and src/sentry/monitors/logic/incident_occurrence.py for the context building bug

4 files reviewed, 2 comments

Edit Code Review Bot Settings | Greptile


def get_failure_reason(failed_checkins: Sequence[SimpleCheckIn]):
"""
Builds a humam readible string from a list of failed check-ins.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: 'humam' is misspelled, should be 'human'

Suggested change
Builds a humam readible string from a list of failed check-ins.
Builds a human readible string from a list of failed check-ins.

"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Using monitor_environment.monitor.config instead of the modified config variable defeats the purpose of the schedule_type transformation on line 162

Suggested change
"config": monitor_environment.monitor.config,
"config": config,

greptile-apps[bot]

This comment was marked as duplicate.

@GitHoobar
Copy link

Review Summary

🏷️ Draft Comments (6)

Skipped posting 6 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

src/sentry/monitors/logic/incident_occurrence.py (2)

168-168: get_monitor_environment_context returns the original monitor.config dict, not the copied/modified config, so changes to schedule_type are ignored and the returned context may be inconsistent.

📊 Impact Scores:

  • Production Impact: 4/5
  • Fix Specificity: 5/5
  • Urgency Impact: 3/5
  • Total Score: 12/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In src/sentry/monitors/logic/incident_occurrence.py, line 168, the function `get_monitor_environment_context` returns the original `monitor.config` instead of the local `config` variable that may have been modified (e.g., with a humanized `schedule_type`). Change line 168 to return `"config": config,` so that the returned context includes the correct, possibly modified configuration.

139-143: get_failure_reason uses checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys() in a loop, causing O(n*m) time for large failed_checkins; this can be O(n) by using in on the dict directly.

📊 Impact Scores:

  • Production Impact: 2/5
  • Fix Specificity: 5/5
  • Urgency Impact: 2/5
  • Total Score: 9/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In src/sentry/monitors/logic/incident_occurrence.py, lines 139-143, the code uses `checkin["status"] in HUMAN_FAILURE_STATUS_MAP.keys()` inside a loop, which is inefficient for large datasets. Change this to `checkin["status"] in HUMAN_FAILURE_STATUS_MAP` to improve performance by reducing time complexity from O(n*m) to O(n).

src/sentry/monitors/logic/incidents.py (1)

40-48: MonitorCheckIn.objects.filter(...).values(...).order_by(...).values(...) fetches all matching check-ins, then slices and reverses in Python, causing unnecessary DB load and memory use for large datasets.

📊 Impact Scores:

  • Production Impact: 3/5
  • Fix Specificity: 5/5
  • Urgency Impact: 2/5
  • Total Score: 10/15

🤖 AI Agent Prompt (Copy & Paste Ready):

Optimize database query and memory usage in src/sentry/monitors/logic/incidents.py lines 40-48. The current code fetches all matching check-ins, then slices and reverses in Python, which is inefficient for large datasets. Refactor so that only the required number of check-ins are fetched from the database, and then reverse the result in Python. Replace the current logic with a query that slices before converting to a list and reversing.

src/sentry/monitors/logic/mark_failed.py (2)

76-79: mark_failed returns the result of try_incident_threshold, but if try_incident_threshold raises an exception (e.g., due to DB or logic error), the monitor environment will have already been updated, potentially leaving the system in an inconsistent state.

📊 Impact Scores:

  • Production Impact: 3/5
  • Fix Specificity: 5/5
  • Urgency Impact: 3/5
  • Total Score: 11/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In src/sentry/monitors/logic/mark_failed.py, lines 76-79, the function `mark_failed` calls `try_incident_threshold` after updating the monitor environment. If `try_incident_threshold` raises an exception, the monitor environment may be left in an inconsistent state. Please wrap the call to `try_incident_threshold` in a try/except block, log the exception, and return False if an error occurs. Insert the following code:

    try:
        return try_incident_threshold(failed_checkin, failure_issue_threshold, received)
    except Exception as e:
        logger.exception("Failed to process incident threshold after marking monitor as failed")
        return False

This should replace the current direct return statement.

52-76: mark_failed performs two sequential DB updates (update, then refresh_from_db) for every failed check-in, which can cause significant DB load and contention under high check-in volume.

📊 Impact Scores:

  • Production Impact: 3/5
  • Fix Specificity: 2/5
  • Urgency Impact: 2/5
  • Total Score: 7/15

🤖 AI Agent Prompt (Copy & Paste Ready):

Optimize DB update pattern in src/sentry/monitors/logic/mark_failed.py lines 52-76. The current code performs a filter().update() followed by refresh_from_db(), causing two DB roundtrips and potential contention under high load. Refactor to use select_for_update() in a transaction, update the fields directly, and save, reducing DB load and ensuring atomicity. Replace the current update/refresh logic with a single transaction block as shown.

src/sentry/monitors/types.py (1)

83-85: processing_key property may return a string with 'None' as the environment if environment is missing, leading to incorrect grouping of check-ins.

📊 Impact Scores:

  • Production Impact: 3/5
  • Fix Specificity: 5/5
  • Urgency Impact: 3/5
  • Total Score: 11/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In src/sentry/monitors/types.py, lines 83-85, the `processing_key` property may return a string with 'None' as the environment if the 'environment' key is missing from the payload, which can cause incorrect grouping of check-ins. Update the code so that if 'environment' is missing or None, an empty string is used instead. Only change these lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants