-
Notifications
You must be signed in to change notification settings - Fork 11
fix(worker): Improve Sentry visibility for worker task crashes #636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add signal handlers for Sentry initialization in forked workers - Flush Sentry events on graceful worker shutdown (SIGTERM) - Set task context (args/kwargs) at task start for debugging - Capture task failures to Sentry with failure_type tag - Capture hard timeouts to Sentry with immediate flush before SIGKILL - Clean up redundant comments in exception handlers This ensures crash data is sent to Sentry even when workers die unexpectedly, making it easier to debug production issues.
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (68.96%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #636 +/- ##
==========================================
- Coverage 93.90% 93.89% -0.02%
==========================================
Files 1286 1286
Lines 46804 46833 +29
Branches 1517 1517
==========================================
+ Hits 43953 43973 +20
- Misses 2542 2551 +9
Partials 309 309
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Address Bugbot review: _capture_failure_to_sentry could throw and block super().on_failure(), metrics, and flow logging. Changes: - Move Sentry capture AFTER super().on_failure() (consistent with on_timeout) - Wrap Sentry capture in try-except for resilience - Add safe_str() helper to handle objects with broken __str__
- Wrap _capture_hard_timeout_to_sentry in on_timeout with try-except - Add _safe_str helper to celery_config.py for signal handler - Ensures metrics and flow logging are not blocked by Sentry failures
…y events CeleryIntegration already captures task failures automatically. We now only enrich the context with task args/kwargs/retries without calling capture_exception, avoiding duplicate events per Sentry issue #433.
The _enrich_sentry_context_for_failure method was setting Sentry context in on_failure, but CeleryIntegration captures exceptions when they're raised (before on_failure is called), so the enrichment was ineffective. Changes: - Remove _enrich_sentry_context_for_failure method and its call - Add 'retries' to set_sentry_task_context at task_prerun where it will be captured with exception events - Update docstring to clarify why context is set at prerun This addresses the review feedback that Sentry context enrichment occurs after the exception is already captured.
Summary
Improves Sentry visibility when worker tasks crash, making it easier to debug production issues.
Changes
Signal Handlers (celery_config.py)
worker_process_init- Initialize Sentry in each forked worker processworker_shutting_down- Flush pending Sentry events before shutdown (SIGTERM)task_prerun- Set task context (args/kwargs) for all Sentry eventsFailure/Timeout Capture (tasks/base.py)
on_failurehook - Capture exceptions to Sentry withfailure_type=task_failuretagon_timeouthook - Capture hard timeouts to Sentry with immediate flush (before SIGKILL)Cleanup
Why
When worker tasks crash (especially hard timeouts leading to SIGKILL), Sentry trace data wasn't being sent, making it difficult to debug production issues. These changes ensure:
Testing
test_sample_task_hard_timeoutandtest_sample_task_failureNote
Improves observability of worker task failures and timeouts via Sentry.
celery_config.py: re-initialize Sentry per forked worker (worker_process_init), flush events on shutdown (worker_shutting_down), and set task context (task_prerun) with safe stringified args/kwargsBaseCodecovRequest.on_timeout, capture hard timeouts to Sentry withfailure_type=hard_timeoutand immediateflush, plus metrics incrementsinitialize_sentry,is_sentry_enabled) andsentry_sdkusageWritten by Cursor Bugbot for commit 1adefc4. This will update automatically on new commits. Configure here.