Skip to content

Commit b5035b2

Browse files
committed
design: add More Zero-Downtime Upgrades design doc
1 parent a8dadd2 commit b5035b2

File tree

1 file changed

+359
-0
lines changed

1 file changed

+359
-0
lines changed
Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
# More Zero-Downtime Upgrades
2+
3+
We currently do zero-downtime upgrades, where we hydrate an `environmentd` and
4+
its `clusterd` processes at a new version before cutting over to that version.
5+
This makes it so that computation running on clusters is ready when we cut over
6+
and there is no downtime in processing, no lag in freshness. However, due to an
7+
implementation detail of how cutting over from one version to the next works
8+
(see context below) we have an interval of 10s of seconds (typically below 30
9+
seconds) where an environment is not reachable by the user.
10+
11+
This document proposes a design where we can achieve true zero-downtime
12+
upgrades for DML and DQL queries, with the caveat that we still have a window
13+
(again on the order of 10s of seconds) where users cannot issue DDL.
14+
15+
Achieving true zero-downtime upgrades for everything should be our ultimate
16+
goal, but for reasons explained below this is a decently hard engineering
17+
challenge and so we have to try and get there incrementally, as laid out here.
18+
19+
## Goals
20+
21+
- True zero-downtime upgrades for DML and DQL
22+
23+
The lived experience for users should be: **no perceived downtime for DQL and
24+
DML, and an error message about DDL not being allowed during version cutover,
25+
which should be on the order of 10s of seconds**.
26+
27+
A meta goal of this document is to sketch what we need for our larger,
28+
long-term goals, and to lay out an incremental path for getting there,
29+
extracting incremental customer value along the way.
30+
31+
## Non-Goals
32+
33+
- True zero-downtime upgrades for DDL
34+
- High Availability, so no downtime during upgrades or any other time
35+
- Physical Isolation
36+
37+
## Context
38+
39+
I will sketch the current approach as well as some of the related and desirable
40+
features.
41+
42+
### Current Zero-Downtime Upgrade Procedure
43+
44+
1. New `environmentd` starts with higher `deploy_generation`, higher `version`
45+
2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd`
46+
processes at new version, hydrates dataflows, everything is kept in
47+
read-only mode
48+
3. Signals readiness: once clusters report hydrated and caught up
49+
4. Orchestrator triggers promotion: new `environmentd` opens catalog in
50+
read/write mode, which writes its `deploy_generation`/`version` to the
51+
catalog, **then halts and is restarted in read/write mode by orchestrator**
52+
5. Old `environmentd` is fenced out: it notices the newer version in the
53+
catalog and **halts**
54+
6. New `environmentd` re-establishes connection to clusters, brings them out of
55+
read-only mode
56+
7. Cutover: network routes updated to point at new generation, orchestrator
57+
reaps old deployment
58+
59+
Why there's downtime:
60+
61+
- Old `environmentd` exits immediately when being fenced
62+
- New `environmentd` transitions from read-only to read/write using the
63+
**halt-and-restart** approach
64+
- Network routes update to new process
65+
- During this window, environment is unreachable
66+
67+
### Catalog Migrations
68+
69+
When a new version first establishes itself in the durable catalog by writing
70+
down its version and deploy generation, it sometimes has to modify the schema
71+
or contents of parts of the catalog. We call these _catalog migrations_. These
72+
are a large reason for why currently we can't have an older version read or
73+
interact with the catalog once a new version has "touched" it.
74+
75+
There are largely two types of migrations:
76+
77+
- _Streaming Migrations_: an entry of the catalog is modified without looking
78+
at the whole catalog. This can be adding fields or some other way of changing
79+
the contents.
80+
- _Global Migrations_: we modify the catalog after having observed all of the
81+
current content. This can be adding certain entries once we see that they are
82+
not there, or updating entries with some globally learned fact.
83+
84+
As the name suggests, the first kind is easier to apply to a stream of catalog
85+
changes as you read them, while the second one can only be applied on a full
86+
snapshot.
87+
88+
### Why Halt-and-Restart?
89+
90+
An alternative that we considered would have been to let the read-only
91+
`environmentd` follow all the catalog changes that the leader `environmentd` is
92+
doing, live apply any needed migrations to in-memory state, and then at the end
93+
apply migrations durably before transitioning from read-only to read/write
94+
_within_ the running process.
95+
96+
There are two main things that made us choose the halt-and-restart approach, and
97+
they still hold today:
98+
99+
- We think it is decently hard to follow catalog changes (in the read-only
100+
`environmentd`) _and_ incrementally apply migrations.
101+
- We liked the "clean slate" that the halt-and-restart approach gives. The
102+
`environmentd` that is coming up at the new version in read/write mode
103+
behaves like an `environmentd` that had to restart for any number of
104+
other reasons. It's exactly the same code path.
105+
106+
In the context of this new proposal, I now think that even with a _within
107+
process_ cutover, this might take on the order of seconds to get everything in
108+
place. Which would mean we still have on the order of seconds of downtime
109+
instead of the "almost zero" that I propose below.
110+
111+
### Related Features
112+
113+
- **High Availability**: No downtime, both during upgrades or any other time.
114+
Thinking in terms of our current architecture, this requires multiple
115+
`environmentd`-flavored processes, so that in case of a failure the load can
116+
be immediately moved to another `environmentd`. This requires the same
117+
engineering work that I propose in this document, plus more work on top.
118+
Ultimately we want to achieve this but I think we can get there
119+
incrementally.
120+
- **Physical Isolation**: Closely related to the above, where we also have
121+
multiple `environmentd`-flavored processes, but here in order to isolate the
122+
query/control workload of one use case from other use cases. This also
123+
requires much of the same work as this proposal and High Availability, plus
124+
other, more specific work on top.
125+
126+
The thing that both of these features require is the ability to follow catalog
127+
changes and react to them, applying their implications.
128+
129+
The work that they share with this proposal is the work around enabling a
130+
number of components to handle concurrent instances (see below).
131+
132+
Further, to achieve true zero-downtime upgrades for DQL, DML, **and DDL**, we
133+
need something like High Availability to work across multiple versions. The
134+
thing that is hard here is forward/backward compatibility so that multiple
135+
instances of `environmentd` at different versions can interact with the catalog
136+
and builtin tables.
137+
138+
## Proposal
139+
140+
I propose that we change from our current upgrade procedure to this flow:
141+
142+
1. New `environmentd` starts with higher `deploy_generation`/`version`
143+
2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd`
144+
processes at new version, hydrates dataflows, everything is kept in
145+
read-only mode
146+
3. Signals readiness: once clusters report hydrated and caught up
147+
4. Orchestrator triggers promotion: new `environmentd` opens catalog in
148+
read/write mode, which writes its `deploy_generation`/`version` to the
149+
catalog, **then halts and is restarted in read/write mode**
150+
5. Old `environmentd` notices the new version in the catalog and enters
151+
**lame-duck mode**: it does not halt, and none of its cluster processes are
152+
reaped, it can still serve queries but not apply writes to the catalog
153+
anymore, so cannot process DDL queries
154+
6. New `environmentd` re-establishes connection to clusters, brings them out of
155+
read-only mode
156+
7. Cutover: once orchestration determines that the new-version `environmentd`
157+
is ready to serve queries, we update network routes
158+
8. Eventually: resources of old-version deployment are reaped
159+
160+
The big difference to before is that the fenced deployment is not immediately
161+
halted/reaped but can still serve queries. This includes DML and DQL (so
162+
INSERTs and SELECTs), but not DDL.
163+
164+
We cut over network routes once the new deployment is fully ready, so any
165+
residual downtime is the route change itself. During that window the old
166+
deployment still accepts connections but rejects DDL with an error message.
167+
When cutting over, we drop connections and rely on reconnects to reach the new
168+
version.
169+
170+
This modified flow requires a number of changes in different components. I will
171+
sketch these below, but each of these sections will require a small-ish design
172+
document of its own or at the very least a thorough GitHub issue.
173+
174+
## Lame-Duck `environmentd` at Old Version
175+
176+
The observation that makes this proposal work is that neither the schema nor
177+
the contents of user collections (so tables, sources, etc.) change between
178+
versions. So both the old version and the new version _should_ be able to
179+
collaborate in writing down source data, accepting INSERTs for tables and
180+
serving SELECT queries.
181+
182+
DDL will not work because the new-version deployment will potentially have
183+
migrated (applied catalog migrations) the catalog and so the old version cannot
184+
be allowed to write to it anymore. And it wouldn't currently be able to read
185+
newer changes. Backward/forward compatibility for catalog changes is one of the
186+
things we need to get true zero-downtime working for all types of queries.
187+
188+
Once an `environmentd` notices that there is a newer version in the catalog it
189+
enters lame-duck mode, where it does not allow writes to the catalog anymore
190+
and will serve DQL/DML workload off of the catalog snapshot that it has. An
191+
important detail to figure out here is what happens when the old-version
192+
`environmentd` process crashes while we're in a lame-duck phase. If the since
193+
of the catalog shard has advanced, it will not be able to restart and read the
194+
catalog at the old version that it understands. We might have to do some
195+
trickery around holding back the since of the catalog shard around upgrades or
196+
similar.
197+
198+
TODO: It could even be the case that builtin tables are compatible for writing
199+
between the versions, because of how persist schema backward compatibility
200+
works. We have to audit whether it would work for both the old and new version
201+
to write at the same time. This is important for builtin tables that are not
202+
derived from catalog state, for example `mz_sessions`, the audit log, and
203+
probably others.
204+
205+
TODO: Figure out if we want to allow background tasks to keep writing. This
206+
includes, but is not limited to storage usage collection.
207+
208+
## Change How Old-Version Processes are Reaped
209+
210+
Currently, the fenced-out `environmentd` halts itself when the new version
211+
fences it via the catalog. And a newly establishing `environmentd` will reap
212+
replica processes (`clusterd`) of any versions older than itself.
213+
214+
We can't have this because the lame-duck `environmentd` still needs all its
215+
processes alive to serve traffic.
216+
217+
Instead we need to change the orchestration logic to determine when the
218+
new-version `environmentd` is ready to serve traffic, then cut over, and
219+
eventually reap processes of the old version.
220+
221+
## Get Orchestration Ready for Managing Version Cutover
222+
223+
We need to update the orchestration logic to use the new flow as outlined
224+
above.
225+
226+
TODO: What's the latency incurred by us cutting over between `environmentd`s
227+
when everything is ready. That's how close to zero we will get with this
228+
approach.
229+
230+
## Get Builtin Tables Ready for Concurrent Writers
231+
232+
There are two types of builtin tables:
233+
234+
1. Tables that mirror catalog state
235+
2. Other tables, which are largely append-only
236+
237+
We have to audit and find out which is which. For builtin tables that mirror
238+
catalog state we can use the self-correcting approach that we also use for
239+
materialized views and for a number of builtin storage-managed collections. For
240+
others, we have to see if concurrent append-only writers would work.
241+
242+
One of the thorny tables here is probably storage-usage data.
243+
244+
## Get Builtin Sources Ready for Concurrent Writers
245+
246+
Similar to tables, there are two types:
247+
248+
1. "Differential sources", where the controller knows what the desired state is
249+
and updates the content to match that when needed
250+
2. Append-only sources
251+
252+
Here again, we have to audit what works with concurrent writers and whether
253+
append-only sources remain correct.
254+
255+
NOTE: Builtin Sources are variously called storage-managed collections, because
256+
writing to them is mediated by the storage controller. Specifically async
257+
background tasks that it starts for this purpose.
258+
259+
## Concurrent User-Table Writes
260+
261+
We need to get user tables ready for concurrent writers. Blind writes are easy.
262+
Read-then-write is hard, but we can do OCC (optimistic concurrency control),
263+
with a loop that does:
264+
265+
1. Subscribe to changes
266+
2. Read latest changes, derive writes
267+
3. Get write timestamp based on latest read timestamp
268+
4. Attempt write; go back to 2 if that fails; succeed otherwise
269+
270+
We can even render that as a dataflow and attempt the read-then-write directly
271+
on a cluster, if you squint this is almost a "one-shot continual task" (if
272+
you're familiar with that). But we could initially keep that loop in
273+
`environmentd`, closer to how the current implementation works.
274+
275+
## Ensure Sources/Sinks are Ready for Concurrent Instances
276+
277+
I think they are ready, but sources will have a bit of a fight over who gets to
278+
read, for example for Postgres. Kafka is already fine because it already
279+
supports concurrently running ingestion instances.
280+
281+
Kafka Sinks use a transactional producer ID, so they would also fight over this
282+
but settle down when the old-version cluster processes are eventually reaped.
283+
284+
TODO: Figure out how big the impact of the above-mentioned squabbling would be.
285+
286+
## Critical Persist Handles for Everyone / Concurrent Instances of StorageCollections
287+
288+
`StorageCollections`, the component that is responsible for managing the critical
289+
since handles of storage collections currently has a single critical handle per
290+
collection. And when a new version comes online it "takes ownership" of that
291+
handle.
292+
293+
We have to figure out how that works in the lame-duck phase. This is closely
294+
related (if not the same) to how multiple instances of `environmentd` have to
295+
have handles for Physical Isolation and High Availability.
296+
297+
Hand-waving, I can imagine that we extend the ID of critical handles to be a
298+
pair of `(<id>, <opaque>)`, where the opaque is something that only the user of
299+
persist (that is Materialize) understands. This can be an identifier of a
300+
version or of different `environmentd` instances. We then acquire critical
301+
since handles per version, and we introduce a mechanism that allows retiring
302+
since handles of older versions (or given opaques), which would require persist
303+
API that allows enumerating IDs (or the opaque part) of a given persist shard.
304+
305+
One side effect of this is that we could get rid of a special case where
306+
`StorageCollections` will acquire leased read handles when in read-only mode
307+
and instead acquire critical since handles at its new version even in read-only
308+
mode.
309+
310+
TODO: Figure this out! It's the one change required by this proposal that
311+
affects durable state, which is always something we need to think carefully
312+
about.
313+
314+
## Persist Pubsub
315+
316+
Currently, `environmentd` acts as the persist pub-sub server. It's the
317+
component that makes the latency between persist writes and readers noticing
318+
changes snappy.
319+
320+
When we have both the old and new `environmentd` handling traffic, and both
321+
writing at different times, we would see an increase in read latency because
322+
writes from one `environmentd` (and its cluster processes) are not routed to
323+
the other `environmentd` and vice-versa.
324+
325+
We have to somehow address this, or accept the fact that we will have degraded
326+
latency for the short period where both the old and new `environmentd` serve
327+
traffic (the lame-duck phase).
328+
329+
## Alternatives
330+
331+
### Read-Only `environmentd` Follows Catalog Changes
332+
333+
Instead of the halt-and-restart approach, the new-version `environmentd` would
334+
listen to catalog changes, apply migrations live, and when promoted fences out
335+
the old `environmentd` and gets all its in-memory structures, controllers, and
336+
the like out of read-only mode.
337+
338+
This would still have the downside that bringing it out of read-only mode can
339+
take on the order of seconds, but we would shorten the window during which DDL
340+
is not available.
341+
342+
As described above, this part is hard.
343+
344+
### Fully Operational `environmentd` at different versions
345+
346+
This is an extension of the above, and basically all the different work streams
347+
mentioned above (High Availability, Physical Isolation, following catalog
348+
changes and applying migrations).
349+
350+
The only observable downtime would be when we cut over how traffic gets routed
351+
to different `environmentd`s.
352+
353+
In the fullness of time we want this, but it's a decent amount of work.
354+
Especially the forward/backward compatibility across a range of versions. On top
355+
of the other hard challenges.
356+
357+
## Open Questions
358+
359+
- How much downtime is incurred by rerouting network traffic?

0 commit comments

Comments
 (0)