feat: add failure metrics to kinesis outbox event processing#59
feat: add failure metrics to kinesis outbox event processing#59
Conversation
…r failed kinesis events with error code metadata
| emit_metric('journaled.worker.batch_process', value: total_events, worker_id:) | ||
| emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:) | ||
| emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:) | ||
| emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:) |
There was a problem hiding this comment.
| emit_metric('journaled.worker.batch_process', value: total_events, worker_id:) | |
| emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:) | |
| emit_metric('journaled.worker.batch.processed', value: total_events, worker_id:) | |
| emit_metric('journaled.worker.batch.sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.worker.batch.failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.worker.batch.failed_transiently', value: stats[:failed_transiently], worker_id:) |
🤷 nothing says we can't have more than 3 . in a stat name -- but mainly i renamed process to processed.
There was a problem hiding this comment.
Though I'm wondering if we need batch_ at all. We're emitting stats in batches, but the stats themselves are counts of individual events:
| emit_metric('journaled.worker.batch_process', value: total_events, worker_id:) | |
| emit_metric('journaled.worker.batch_sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.worker.batch_failed_transiently', value: stats[:failed_transiently], worker_id:) | |
| emit_metric('journaled.event.processed', value: total_events, worker_id:) | |
| emit_metric('journaled.event.sent', value: stats[:succeeded], worker_id:) | |
| emit_metric('journaled.event.failed_permanently', value: stats[:failed_permanently], worker_id:) | |
| emit_metric('journaled.event.failed_transiently', value: stats[:failed_transiently], worker_id:) |
There was a problem hiding this comment.
Also thoughts on event.failed_permanently and event.failed_transiently becoming event.failed and event.errored? (Aligning with DJ terminology)
There was a problem hiding this comment.
Yeah all sounds good! will update
| metrics = calculate_queue_metrics | ||
| emit_metric('journaled.worker.queue_total_count', value: metrics[:total_count], worker_id:) | ||
| emit_metric('journaled.worker.queue_workable_count', value: metrics[:workable_count], worker_id:) | ||
| emit_metric('journaled.worker.queue_erroring_count', value: metrics[:erroring_count], worker_id:) |
There was a problem hiding this comment.
| emit_metric('journaled.worker.queue_erroring_count', value: metrics[:erroring_count], worker_id:) | |
| emit_metric('journaled.worker.queue_failed_count', value: metrics[:erroring_count], worker_id:) |
Should it be "failed count"? If it's a batch query it can really only count the number of hard-failed rows, not the number of transiently-erroring rows, right?
smudge
left a comment
There was a problem hiding this comment.
TAFN - two points on naming conventions.
Summary
This adds more visibility into both batch and individual kinesis errors by emitting metrics with event information and error codes. This will allow us to track rate limiting errors more closely, and have insight into other errors that occur.
/no-platform