|
| 1 | +# Bugzilla disables webhooks after too many errors |
| 2 | + |
| 3 | +- Status: implemented |
| 4 | +- Date: 2023-11-27 |
| 5 | + |
| 6 | +Tracking issues: |
| 7 | +- [21 - Setting up Webhooks with 'ANY' Product](https://github.com/mozilla/jira-bugzilla-integration/issues/21) |
| 8 | +- [82 - Exception in JBI can delay webhook delivery](https://github.com/mozilla/jira-bugzilla-integration/issues/82) |
| 9 | +- [181 - Can we ping Bugzilla to see if the webhook is enabled?](https://github.com/mozilla/jira-bugzilla-integration/issues/181) |
| 10 | +- [710 - System is able to "drop" erring messages without user interference](https://github.com/mozilla/jira-bugzilla-integration/issues/710) |
| 11 | +- [730 - Establish convention for capturing system incidents](https://github.com/mozilla/jira-bugzilla-integration/issues/730) |
| 12 | +- [743 - Create alerts for when bugs fail to sync](https://github.com/mozilla/jira-bugzilla-integration/issues/743) |
| 13 | + |
| 14 | +## Context and Problem Statement |
| 15 | +When Bugzilla receives too many error responses from JBI, it stops triggering webhook calls for the entire project causing data to stop syncing. Frequently, these errors are due to a configuration error in Jira or Bugzilla. JBI being unable to process a payload due to errors in configuration (or incomplete configuration) in Jira or a mismatch of data for a single bug. These outages can last multiple days in some cases. |
| 16 | + |
| 17 | +We don't want the entire sync process to stop because of this. We have identified four options to solve this problem. |
| 18 | + |
| 19 | +## Decision Drivers |
| 20 | +- Amount of initial engineering effort |
| 21 | +- Amount of maintenance effort |
| 22 | +- Overall performance of JBI (how quickly is data able to move) |
| 23 | +- How intuitive the solution is to the users that depend on the data (will picking the easiest option solve their needs?) |
| 24 | + |
| 25 | +## Proposed Solution |
| 26 | +We propose to use a file share (or a data bucket) as a dead-letter-queue. Events will attempt to be reprocessed every 12 hours for up to 7 days. After which they will be dropped. Errors will be logged for each event that cannot be processed. Alerts can be triggered based on this to let Jira and Bugzilla admins know there is a problem. |
| 27 | + |
| 28 | +See the diagram below for a detailed flow of data. Note: This is designed to show the flow of data, not be representative of coding patterns or infrastructure. |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | + |
| 33 | +![Image detailing the bucket data structure. DLQ bucket > folders with names like project-bug_id > json files with names like action-[id-]timestamp.json](./003-bucket.drawio.jpg "Proposed Solution Flow Chart") |
| 34 | + |
| 35 | + |
| 36 | +<details> |
| 37 | + |
| 38 | + <summary>Breakdown of flowchart</summary> |
| 39 | + |
| 40 | + 1. JBI receives a payload from Bugzilla or the Retry Scheduler. |
| 41 | + 1. JBI will always return 200/OK for a response. |
| 42 | + 1. If the bug is private, discard the event and log why. |
| 43 | + 1. If the bug cannot be found in the Bugzilla API, discard the event and log why. |
| 44 | + 1. If an associated action cannot be found for the event, discard the event and log why. |
| 45 | + 1. If a matching Jira issue cannot be found, and the event is not creating one, discard the event and log why. |
| 46 | + 1. If there is a mismatch between project keys in the event and Jira, discard the event and log why. |
| 47 | + 1. If there is already an event for this bug in the DLQ, do not try to process this event and skip to the Error Event Handler. |
| 48 | + 1. Write updated data to Jira's API. |
| 49 | + 1. If successful, delete any associated items in DLQ. |
| 50 | + 1. If error is returned, continue to Error Event Handler |
| 51 | + 1. Handle errors in Error Event Handler |
| 52 | + 1. Write an error to the logs, which may be forwarded to an alerting mechanism. |
| 53 | + 1. Write an updated event file to the DLQ if the original event is less than 7 days old. |
| 54 | + 1. If we have exceeded 7 days from the original event, delete associated DLQ items. |
| 55 | + 1. The retry scheduler runs every 12 hours and will re-send events to JBI. Oldest events will be processed first. An additional parameter will be provided that notes these are events to reprocess. |
| 56 | +</details> |
| 57 | + |
| 58 | +### Pros: |
| 59 | + - Avoids the problem of accidentally overwriting newer data with older data |
| 60 | + - Avoids making users correct data manually if something is misconfigured |
| 61 | + - Gives users a whole work week to update potentially misconfigured settings |
| 62 | + - Low maintenance effort |
| 63 | + - Mid-low engineering effort |
| 64 | + - High performance of JBI |
| 65 | + - Intuitive solution with alerting via error logs |
| 66 | + |
| 67 | +### Cons: |
| 68 | + - Additional infrastructure for the DLQ file share or data bucket |
| 69 | + - Events will wait up to 12 hours to be reprocessed |
| 70 | + |
| 71 | +### Notes: |
| 72 | + - This relies on using the ``last_change_time`` property from Bugzilla webhook payloads. |
| 73 | + - Also relies on checking the ``issue.comment.updated`` and ``updated`` properties in the Jira API. |
| 74 | + - This will cause a bit more latency in event processing, but nothing noticeable to users. |
| 75 | + - This will cause more API calls to Jira. We should consider rate limits. |
| 76 | + |
| 77 | + |
| 78 | +## Considered Options |
| 79 | +For all of these options, we will be returning a successful 200 response to Bugzilla's webhook calls. Note: we have to return a 200 because of Bugzilla's webhook functionality (they check for 200 specifically, not just any OK response). |
| 80 | + |
| 81 | +### Option 1: Log the failure and move on |
| 82 | +JBI will log that we couldn't process a specific payload, along with relevant ID's (bug id, Jira issue id, comment id, etc) so further investigation can be done if needed. |
| 83 | + |
| 84 | +**Decision Drivers** |
| 85 | +- Amount of initial engineering effort: very low |
| 86 | +- Amount of maintenance effort: very low |
| 87 | +- Overall performance of JBI: high |
| 88 | +- How intuitive the solution: low - users will notice data is missing but see status pages that look green |
| 89 | + |
| 90 | +**Pros:** |
| 91 | +- The simplest solution |
| 92 | + |
| 93 | +**Cons:** |
| 94 | +- Will not alert people to data loss (without additional alerting functionality) |
| 95 | +- Still requires engineers to investigate further if needed |
| 96 | + |
| 97 | +### Option 2: Ask a human to do something |
| 98 | +JBI will alert users that data could not be synced. This could happen through an IM alert or an email immediately, or a scheduled (daily?) report, or by creating a well-formed log that an alerting workflow picks up. We should know which users to identify based on project configuration in Bugzilla or a distribution list if doing an IM or email directly. |
| 99 | + |
| 100 | +**Decision Drivers** |
| 101 | +- Amount of initial engineering effort: low |
| 102 | +- Amount of maintenance effort: low |
| 103 | +- Overall performance of JBI: high |
| 104 | +- How intuitive the solution: high |
| 105 | + |
| 106 | +**Pros:** |
| 107 | +- Removes need for engineering to investigate |
| 108 | +- Alerts users directly that there is a problem |
| 109 | + |
| 110 | +**Cons:** |
| 111 | +- Alerts can be noisy and cause notification fatigue |
| 112 | + |
| 113 | +### Option 3: Queue retries internally |
| 114 | +Create a persistence layer within the JBI containers that will queue and retry jobs for a specific length (2 hours? 2 days?) of time. This could be done with an internal cache (redis) or database (postgres) within the container. After retries exceed the max time length, an error would be logged and the data would be dropped. |
| 115 | + |
| 116 | +**Decision Drivers** |
| 117 | +- Amount of initial engineering effort: high, creating more complex containers |
| 118 | +- Amount of maintenance effort: moderate, increased complexity of testing and debugging |
| 119 | +- Overall performance of JBI: high |
| 120 | +- How intuitive the solution: low - users will notice data is missing but see status pages that look green |
| 121 | + |
| 122 | +**Pros:** |
| 123 | +- Allows for retries up to a designated amount of time |
| 124 | +- Keeping all services within the container make end-to-end testing and debugging easier (compared to option 4) |
| 125 | + |
| 126 | +**Cons:** |
| 127 | +- Increases complexity of the containers |
| 128 | +- Data will not persist container restarts (ie. redeploy) |
| 129 | +- High effort for engineers to build and maintain |
| 130 | +- Less intuitive to users and engineers, we would need to report on cache/queue metrics to logs |
| 131 | +- Data could be processed out of order, causing newer updates to get lost |
| 132 | + |
| 133 | +### Option 4: Use a simple DLQ (dead letter queue) |
| 134 | +We would always return 200, but any events that fail to process internally would get sent to a DLQ and be replayed later if needed. This could be a storage bucket, kubernetes volume, or table in a database. A scheduled kubernetes job would then run to try and pick these up and reprocess them later (every 4 hours, for example). |
| 135 | + |
| 136 | +After too many failed attempts the payload would be marked as unprocessable (setting a flag in the table, or updating the file name). |
| 137 | + |
| 138 | +**Decision Drivers** |
| 139 | +- Amount of initial engineering effort: mid, minimal added infrastructure (database/table or k8s volume or storage bucket) |
| 140 | +- Amount of maintenance effort: mid-low (mid if we spin up a new database) |
| 141 | +- Overall performance of JBI: high |
| 142 | +- How intuitive the solution: high to engineers, low to users |
| 143 | + |
| 144 | +**Pros:** |
| 145 | +- Durable and expandable solution |
| 146 | +- Does not reduce JBI throughput |
| 147 | +- Intuitive to engineers |
| 148 | + |
| 149 | +**Cons:** |
| 150 | +- Added infrastructure |
| 151 | +- Not intuitive to end users unless we build additional reporting so they know why an update didn't come over |
| 152 | +- Data could be processed out of order, causing newer updates to get lost |
| 153 | + |
| 154 | +### Option 5: Use a dedicated queue solution |
| 155 | +We would have a dedicated service that accepts all API calls from Bugzilla and puts them into a queue (apache kafka, rabbitMQ, etc). JBI would shift to being a downstream service and process these events asynchronously. Any events that fail to process would get sent to a DLQ (dead letter queue) that could be replayed later if needed. |
| 156 | + |
| 157 | +There are plenty of existing solutions we could use to solve this problem from a technical perspective. A separate ADR would be done to identify the best answer if we choose to go this route. |
| 158 | + |
| 159 | +**Decision Drivers** |
| 160 | +- Amount of initial engineering effort: high, building out more infrastructure |
| 161 | +- Amount of maintenance effort: high, maintaining more infrastructure |
| 162 | +- Overall performance of JBI: highest, event driven |
| 163 | +- How intuitive the solution: high - we'll have reporting on queue metrics |
| 164 | + |
| 165 | +**Pros:** |
| 166 | +- Most durable solution |
| 167 | +- Does not reduce JBI throughput |
| 168 | +- Intuitive to users and engineers, we can see and report on data in queue |
| 169 | + |
| 170 | +**Cons:** |
| 171 | +- Most complex solution |
| 172 | +- Highest effort for engineers to build and maintain |
| 173 | + |
| 174 | +### Option 6: A combination of the above |
| 175 | +Example: We could create a simple DLQ (a table in postgres) external queue for re-processing and then alert users if the DLQ grows too quickly. |
| 176 | + |
| 177 | +### Miscellaneous options that we thought about |
| 178 | +- Using a postgres or redis server to store data. This would mean another server to maintain and coordinate maintenance downtime. |
| 179 | +- Using a sqlite (or similar) file to store data. This doesn't work well in a scalable solution that will have multiple pods and threads running. |
| 180 | +- Using a queue (kafka, pub/sub, etc) but only as the DLQ and not as a work queue. There is a chance for data to be processed out of order with this approach if events come in too quickly. |
| 181 | + |
| 182 | + |
| 183 | +## Links |
| 184 | +- [What is event streaming?](https://kafka.apache.org/documentation/#intro_streaming) - Documentation from Apache Kafka |
0 commit comments