Skip to content

Commit 7f99b9e

Browse files
ADR - Options to prevent bugzilla from pausing webhook calls (#775)
* Created ADR on options and chosen solution for preventing bugzilla from pausing webhook calls when one event fails to process Co-authored-by: Mathieu Leplatre <[email protected]>
1 parent 5a14f89 commit 7f99b9e

File tree

3 files changed

+184
-0
lines changed

3 files changed

+184
-0
lines changed

docs/adrs/003-bucket.drawio.jpg

59.2 KB
Loading
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Bugzilla disables webhooks after too many errors
2+
3+
- Status: implemented
4+
- Date: 2023-11-27
5+
6+
Tracking issues:
7+
- [21 - Setting up Webhooks with 'ANY' Product](https://github.com/mozilla/jira-bugzilla-integration/issues/21)
8+
- [82 - Exception in JBI can delay webhook delivery](https://github.com/mozilla/jira-bugzilla-integration/issues/82)
9+
- [181 - Can we ping Bugzilla to see if the webhook is enabled?](https://github.com/mozilla/jira-bugzilla-integration/issues/181)
10+
- [710 - System is able to "drop" erring messages without user interference](https://github.com/mozilla/jira-bugzilla-integration/issues/710)
11+
- [730 - Establish convention for capturing system incidents](https://github.com/mozilla/jira-bugzilla-integration/issues/730)
12+
- [743 - Create alerts for when bugs fail to sync](https://github.com/mozilla/jira-bugzilla-integration/issues/743)
13+
14+
## Context and Problem Statement
15+
When Bugzilla receives too many error responses from JBI, it stops triggering webhook calls for the entire project causing data to stop syncing. Frequently, these errors are due to a configuration error in Jira or Bugzilla. JBI being unable to process a payload due to errors in configuration (or incomplete configuration) in Jira or a mismatch of data for a single bug. These outages can last multiple days in some cases.
16+
17+
We don't want the entire sync process to stop because of this. We have identified four options to solve this problem.
18+
19+
## Decision Drivers
20+
- Amount of initial engineering effort
21+
- Amount of maintenance effort
22+
- Overall performance of JBI (how quickly is data able to move)
23+
- How intuitive the solution is to the users that depend on the data (will picking the easiest option solve their needs?)
24+
25+
## Proposed Solution
26+
We propose to use a file share (or a data bucket) as a dead-letter-queue. Events will attempt to be reprocessed every 12 hours for up to 7 days. After which they will be dropped. Errors will be logged for each event that cannot be processed. Alerts can be triggered based on this to let Jira and Bugzilla admins know there is a problem.
27+
28+
See the diagram below for a detailed flow of data. Note: This is designed to show the flow of data, not be representative of coding patterns or infrastructure.
29+
30+
![Flow chart detailing the data flow, see expandable below for full details](./003.drawio.jpg "Proposed Solution Flow Chart")
31+
32+
33+
![Image detailing the bucket data structure. DLQ bucket > folders with names like project-bug_id > json files with names like action-[id-]timestamp.json](./003-bucket.drawio.jpg "Proposed Solution Flow Chart")
34+
35+
36+
<details>
37+
38+
<summary>Breakdown of flowchart</summary>
39+
40+
1. JBI receives a payload from Bugzilla or the Retry Scheduler.
41+
1. JBI will always return 200/OK for a response.
42+
1. If the bug is private, discard the event and log why.
43+
1. If the bug cannot be found in the Bugzilla API, discard the event and log why.
44+
1. If an associated action cannot be found for the event, discard the event and log why.
45+
1. If a matching Jira issue cannot be found, and the event is not creating one, discard the event and log why.
46+
1. If there is a mismatch between project keys in the event and Jira, discard the event and log why.
47+
1. If there is already an event for this bug in the DLQ, do not try to process this event and skip to the Error Event Handler.
48+
1. Write updated data to Jira's API.
49+
1. If successful, delete any associated items in DLQ.
50+
1. If error is returned, continue to Error Event Handler
51+
1. Handle errors in Error Event Handler
52+
1. Write an error to the logs, which may be forwarded to an alerting mechanism.
53+
1. Write an updated event file to the DLQ if the original event is less than 7 days old.
54+
1. If we have exceeded 7 days from the original event, delete associated DLQ items.
55+
1. The retry scheduler runs every 12 hours and will re-send events to JBI. Oldest events will be processed first. An additional parameter will be provided that notes these are events to reprocess.
56+
</details>
57+
58+
### Pros:
59+
- Avoids the problem of accidentally overwriting newer data with older data
60+
- Avoids making users correct data manually if something is misconfigured
61+
- Gives users a whole work week to update potentially misconfigured settings
62+
- Low maintenance effort
63+
- Mid-low engineering effort
64+
- High performance of JBI
65+
- Intuitive solution with alerting via error logs
66+
67+
### Cons:
68+
- Additional infrastructure for the DLQ file share or data bucket
69+
- Events will wait up to 12 hours to be reprocessed
70+
71+
### Notes:
72+
- This relies on using the ``last_change_time`` property from Bugzilla webhook payloads.
73+
- Also relies on checking the ``issue.comment.updated`` and ``updated`` properties in the Jira API.
74+
- This will cause a bit more latency in event processing, but nothing noticeable to users.
75+
- This will cause more API calls to Jira. We should consider rate limits.
76+
77+
78+
## Considered Options
79+
For all of these options, we will be returning a successful 200 response to Bugzilla's webhook calls. Note: we have to return a 200 because of Bugzilla's webhook functionality (they check for 200 specifically, not just any OK response).
80+
81+
### Option 1: Log the failure and move on
82+
JBI will log that we couldn't process a specific payload, along with relevant ID's (bug id, Jira issue id, comment id, etc) so further investigation can be done if needed.
83+
84+
**Decision Drivers**
85+
- Amount of initial engineering effort: very low
86+
- Amount of maintenance effort: very low
87+
- Overall performance of JBI: high
88+
- How intuitive the solution: low - users will notice data is missing but see status pages that look green
89+
90+
**Pros:**
91+
- The simplest solution
92+
93+
**Cons:**
94+
- Will not alert people to data loss (without additional alerting functionality)
95+
- Still requires engineers to investigate further if needed
96+
97+
### Option 2: Ask a human to do something
98+
JBI will alert users that data could not be synced. This could happen through an IM alert or an email immediately, or a scheduled (daily?) report, or by creating a well-formed log that an alerting workflow picks up. We should know which users to identify based on project configuration in Bugzilla or a distribution list if doing an IM or email directly.
99+
100+
**Decision Drivers**
101+
- Amount of initial engineering effort: low
102+
- Amount of maintenance effort: low
103+
- Overall performance of JBI: high
104+
- How intuitive the solution: high
105+
106+
**Pros:**
107+
- Removes need for engineering to investigate
108+
- Alerts users directly that there is a problem
109+
110+
**Cons:**
111+
- Alerts can be noisy and cause notification fatigue
112+
113+
### Option 3: Queue retries internally
114+
Create a persistence layer within the JBI containers that will queue and retry jobs for a specific length (2 hours? 2 days?) of time. This could be done with an internal cache (redis) or database (postgres) within the container. After retries exceed the max time length, an error would be logged and the data would be dropped.
115+
116+
**Decision Drivers**
117+
- Amount of initial engineering effort: high, creating more complex containers
118+
- Amount of maintenance effort: moderate, increased complexity of testing and debugging
119+
- Overall performance of JBI: high
120+
- How intuitive the solution: low - users will notice data is missing but see status pages that look green
121+
122+
**Pros:**
123+
- Allows for retries up to a designated amount of time
124+
- Keeping all services within the container make end-to-end testing and debugging easier (compared to option 4)
125+
126+
**Cons:**
127+
- Increases complexity of the containers
128+
- Data will not persist container restarts (ie. redeploy)
129+
- High effort for engineers to build and maintain
130+
- Less intuitive to users and engineers, we would need to report on cache/queue metrics to logs
131+
- Data could be processed out of order, causing newer updates to get lost
132+
133+
### Option 4: Use a simple DLQ (dead letter queue)
134+
We would always return 200, but any events that fail to process internally would get sent to a DLQ and be replayed later if needed. This could be a storage bucket, kubernetes volume, or table in a database. A scheduled kubernetes job would then run to try and pick these up and reprocess them later (every 4 hours, for example).
135+
136+
After too many failed attempts the payload would be marked as unprocessable (setting a flag in the table, or updating the file name).
137+
138+
**Decision Drivers**
139+
- Amount of initial engineering effort: mid, minimal added infrastructure (database/table or k8s volume or storage bucket)
140+
- Amount of maintenance effort: mid-low (mid if we spin up a new database)
141+
- Overall performance of JBI: high
142+
- How intuitive the solution: high to engineers, low to users
143+
144+
**Pros:**
145+
- Durable and expandable solution
146+
- Does not reduce JBI throughput
147+
- Intuitive to engineers
148+
149+
**Cons:**
150+
- Added infrastructure
151+
- Not intuitive to end users unless we build additional reporting so they know why an update didn't come over
152+
- Data could be processed out of order, causing newer updates to get lost
153+
154+
### Option 5: Use a dedicated queue solution
155+
We would have a dedicated service that accepts all API calls from Bugzilla and puts them into a queue (apache kafka, rabbitMQ, etc). JBI would shift to being a downstream service and process these events asynchronously. Any events that fail to process would get sent to a DLQ (dead letter queue) that could be replayed later if needed.
156+
157+
There are plenty of existing solutions we could use to solve this problem from a technical perspective. A separate ADR would be done to identify the best answer if we choose to go this route.
158+
159+
**Decision Drivers**
160+
- Amount of initial engineering effort: high, building out more infrastructure
161+
- Amount of maintenance effort: high, maintaining more infrastructure
162+
- Overall performance of JBI: highest, event driven
163+
- How intuitive the solution: high - we'll have reporting on queue metrics
164+
165+
**Pros:**
166+
- Most durable solution
167+
- Does not reduce JBI throughput
168+
- Intuitive to users and engineers, we can see and report on data in queue
169+
170+
**Cons:**
171+
- Most complex solution
172+
- Highest effort for engineers to build and maintain
173+
174+
### Option 6: A combination of the above
175+
Example: We could create a simple DLQ (a table in postgres) external queue for re-processing and then alert users if the DLQ grows too quickly.
176+
177+
### Miscellaneous options that we thought about
178+
- Using a postgres or redis server to store data. This would mean another server to maintain and coordinate maintenance downtime.
179+
- Using a sqlite (or similar) file to store data. This doesn't work well in a scalable solution that will have multiple pods and threads running.
180+
- Using a queue (kafka, pub/sub, etc) but only as the DLQ and not as a work queue. There is a chance for data to be processed out of order with this approach if events come in too quickly.
181+
182+
183+
## Links
184+
- [What is event streaming?](https://kafka.apache.org/documentation/#intro_streaming) - Documentation from Apache Kafka

docs/adrs/003.drawio.jpg

146 KB
Loading

0 commit comments

Comments
 (0)