|
| 1 | +# More Zero-Downtime Upgrades |
| 2 | + |
| 3 | +We currently do zero-downtime upgrades, where we hydrate an `environmentd` and |
| 4 | +its `clusterd` processes at a new version before cutting over to that version. |
| 5 | +This makes it so that computation running on clusters is ready when we cut over |
| 6 | +and there is no downtime in processing, no lag in freshness. However, due to an |
| 7 | +implementation detail of how cutting over from one version to the next works |
| 8 | +(see context below) we have an interval of 10s of seconds (typically below 30 |
| 9 | +seconds) where an environment is not reachable by the user. |
| 10 | + |
| 11 | +This document proposes a design where we can achieve true zero-downtime |
| 12 | +upgrades for DML and DQL queries, with the caveat that we still have a window |
| 13 | +(again on the order of 10s of seconds) where users cannot issue DDL. |
| 14 | + |
| 15 | +Achieving true zero-downtime upgrades for everything should be our ultimate |
| 16 | +goal, but for reasons explained below this is a decently hard engineering |
| 17 | +challenge and so we have to try and get there incrementally, as laid out here. |
| 18 | + |
| 19 | +## Goals |
| 20 | + |
| 21 | +- True zero-downtime upgrades for DML and DQL |
| 22 | + |
| 23 | +The lived experience for users should be: **no perceived downtime for DQL and |
| 24 | +DML, and an error message about DDL not being allowed during version cutover, |
| 25 | +which should be on the order of 10s of seconds**. |
| 26 | + |
| 27 | +A meta goal of this document is to sketch what we need for our larger, |
| 28 | +long-term goals, and to lay out an incremental path for getting there, |
| 29 | +extracting incremental customer value along the way. |
| 30 | + |
| 31 | +## Non-Goals |
| 32 | + |
| 33 | +- True zero-downtime upgrades for DDL |
| 34 | +- High Availability, so no downtime during upgrades or any other time |
| 35 | +- Physical Isolation |
| 36 | + |
| 37 | +## Context |
| 38 | + |
| 39 | +I will sketch the current approach as well as some of the related and desirable |
| 40 | +features. |
| 41 | + |
| 42 | +### Current Zero-Downtime Upgrade Procedure |
| 43 | + |
| 44 | +1. New `environmentd` starts with higher `deploy_generation`, higher `version` |
| 45 | +2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd` |
| 46 | + processes at new version, hydrates dataflows, everything is kept in |
| 47 | + read-only mode |
| 48 | +3. Signals readiness: once clusters report hydrated and caught up |
| 49 | +4. Orchestrator triggers promotion: new `environmentd` opens catalog in |
| 50 | + read/write mode, which writes its `deploy_generation`/`version` to the |
| 51 | + catalog, **then halts and is restarted in read/write mode by orchestrator** |
| 52 | +5. Old `environmentd` is fenced out: it notices the newer version in the |
| 53 | + catalog and **halts** |
| 54 | +6. New `environmentd` re-establishes connection to clusters, brings them out of |
| 55 | + read-only mode |
| 56 | +7. Cutover: network routes updated to point at new generation, orchestrator |
| 57 | + reaps old deployment |
| 58 | + |
| 59 | +Why there's downtime: |
| 60 | + |
| 61 | +- Old `environmentd` exits immediately when being fenced |
| 62 | +- New `environmentd` transitions from read-only to read/write using the |
| 63 | + **halt-and-restart** approach |
| 64 | +- Network routes update to new process |
| 65 | +- During this window, environment is unreachable |
| 66 | + |
| 67 | +### Catalog Migrations |
| 68 | + |
| 69 | +When a new version first establishes itself in the durable catalog by writing |
| 70 | +down its version and deploy generation, it sometimes has to modify the schema |
| 71 | +or contents of parts of the catalog. We call these _catalog migrations_. These |
| 72 | +are a large reason for why currently we can't have an older version read or |
| 73 | +interact with the catalog once a new version has "touched" it. |
| 74 | + |
| 75 | +There are largely two types of migrations: |
| 76 | + |
| 77 | +- _Streaming Migrations_: an entry of the catalog is modified without looking |
| 78 | + at the whole catalog. This can be adding fields or some other way of changing |
| 79 | + the contents. |
| 80 | +- _Global Migrations_: we modify the catalog after having observed all of the |
| 81 | + current content. This can be adding certain entries once we see that they are |
| 82 | + not there, or updating entries with some globally learned fact. |
| 83 | + |
| 84 | +As the name suggests, the first kind is easier to apply to a stream of catalog |
| 85 | +changes as you read them, while the second one can only be applied on a full |
| 86 | +snapshot. |
| 87 | + |
| 88 | +### Why Halt-and-Restart? |
| 89 | + |
| 90 | +An alternative that we considered would have been to let the read-only |
| 91 | +`environmentd` follow all the catalog changes that the leader `environmentd` is |
| 92 | +doing, live apply any needed migrations to in-memory state, and then at the end |
| 93 | +apply migrations durably before transitioning from read-only to read/write |
| 94 | +_within_ the running process. |
| 95 | + |
| 96 | +There are two main things that made us choose the halt-and-restart approach, and |
| 97 | +they still hold today: |
| 98 | + |
| 99 | +- We think it is decently hard to follow catalog changes (in the read-only |
| 100 | + `environmentd`) _and_ incrementally apply migrations. |
| 101 | +- We liked the "clean slate" that the halt-and-restart approach gives. The |
| 102 | + `environmentd` that is coming up at the new version in read/write mode |
| 103 | + behaves like an `environmentd` that had to restart for any number of |
| 104 | + other reasons. It's exactly the same code path. |
| 105 | + |
| 106 | +In the context of this new proposal, I now think that even with a _within |
| 107 | +process_ cutover, this might take on the order of seconds to get everything in |
| 108 | +place. Which would mean we still have on the order of seconds of downtime |
| 109 | +instead of the "almost zero" that I propose below. |
| 110 | + |
| 111 | +### Related Features |
| 112 | + |
| 113 | +- **High Availability**: No downtime, both during upgrades or any other time. |
| 114 | + Thinking in terms of our current architecture, this requires multiple |
| 115 | + `environmentd`-flavored processes, so that in case of a failure the load can |
| 116 | + be immediately moved to another `environmentd`. This requires the same |
| 117 | + engineering work that I propose in this document, plus more work on top. |
| 118 | + Ultimately we want to achieve this but I think we can get there |
| 119 | + incrementally. |
| 120 | +- **Physical Isolation**: Closely related to the above, where we also have |
| 121 | + multiple `environmentd`-flavored processes, but here in order to isolate the |
| 122 | + query/control workload of one use case from other use cases. This also |
| 123 | + requires much of the same work as this proposal and High Availability, plus |
| 124 | + other, more specific work on top. |
| 125 | + |
| 126 | +The thing that both of these features require is the ability to follow catalog |
| 127 | +changes and react to them, applying their implications. |
| 128 | + |
| 129 | +The work that they share with this proposal is the work around enabling a |
| 130 | +number of components to handle concurrent instances (see below). |
| 131 | + |
| 132 | +Further, to achieve true zero-downtime upgrades for DQL, DML, **and DDL**, we |
| 133 | +need something like High Availability to work across multiple versions. The |
| 134 | +thing that is hard here is forward/backward compatibility so that multiple |
| 135 | +instances of `environmentd` at different versions can interact with the catalog |
| 136 | +and builtin tables. |
| 137 | + |
| 138 | +## Proposal |
| 139 | + |
| 140 | +I propose that we change from our current upgrade procedure to this flow: |
| 141 | + |
| 142 | +1. New `environmentd` starts with higher `deploy_generation`/`version` |
| 143 | +2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd` |
| 144 | + processes at new version, hydrates dataflows, everything is kept in |
| 145 | + read-only mode |
| 146 | +3. Signals readiness: once clusters report hydrated and caught up |
| 147 | +4. Orchestrator triggers promotion: new `environmentd` opens catalog in |
| 148 | + read/write mode, which writes its `deploy_generation`/`version` to the |
| 149 | + catalog, **then halts and is restarted in read/write mode** |
| 150 | +5. Old `environmentd` notices the new version in the catalog and enters |
| 151 | + **lame-duck mode**: it does not halt, and none of its cluster processes are |
| 152 | + reaped, it can still serve queries but not apply writes to the catalog |
| 153 | + anymore, so cannot process DDL queries |
| 154 | +6. New `environmentd` re-establishes connection to clusters, brings them out of |
| 155 | + read-only mode |
| 156 | +7. Cutover: once orchestration determines that the new-version `environmentd` |
| 157 | + is ready to serve queries, we update network routes |
| 158 | +8. Eventually: resources of old-version deployment are reaped |
| 159 | + |
| 160 | +The big difference to before is that the fenced deployment is not immediately |
| 161 | +halted/reaped but can still serve queries. This includes DML and DQL (so |
| 162 | +INSERTs and SELECTs), but not DDL. |
| 163 | + |
| 164 | +We cut over network routes once the new deployment is fully ready, so any |
| 165 | +residual downtime is the route change itself. During that window the old |
| 166 | +deployment still accepts connections but rejects DDL with an error message. |
| 167 | +When cutting over, we drop connections and rely on reconnects to reach the new |
| 168 | +version. |
| 169 | + |
| 170 | +This modified flow requires a number of changes in different components. I will |
| 171 | +sketch these below, but each of these sections will require a small-ish design |
| 172 | +document of its own or at the very least a thorough GitHub issue. |
| 173 | + |
| 174 | +## Lame-Duck `environmentd` at Old Version |
| 175 | + |
| 176 | +The observation that makes this proposal work is that neither the schema nor |
| 177 | +the contents of user collections (so tables, sources, etc.) change between |
| 178 | +versions. So both the old version and the new version _should_ be able to |
| 179 | +collaborate in writing down source data, accepting INSERTs for tables and |
| 180 | +serving SELECT queries. |
| 181 | + |
| 182 | +DDL will not work because the new-version deployment will potentially have |
| 183 | +migrated (applied catalog migrations) the catalog and so the old version cannot |
| 184 | +be allowed to write to it anymore. And it wouldn't currently be able to read |
| 185 | +newer changes. Backward/forward compatibility for catalog changes is one of the |
| 186 | +things we need to get true zero-downtime working for all types of queries. |
| 187 | + |
| 188 | +Once an `environmentd` notices that there is a newer version in the catalog it |
| 189 | +enters lame-duck mode, where it does not allow writes to the catalog anymore |
| 190 | +and will serve DQL/DML workload off of the catalog snapshot that it has. An |
| 191 | +important detail to figure out here is what happens when the old-version |
| 192 | +`environmentd` process crashes while we're in a lame-duck phase. If the since |
| 193 | +of the catalog shard has advanced, it will not be able to restart and read the |
| 194 | +catalog at the old version that it understands. We might have to do some |
| 195 | +trickery around holding back the since of the catalog shard around upgrades or |
| 196 | +similar. |
| 197 | + |
| 198 | +TODO: It could even be the case that builtin tables are compatible for writing |
| 199 | +between the versions, because of how persist schema backward compatibility |
| 200 | +works. We have to audit whether it would work for both the old and new version |
| 201 | +to write at the same time. This is important for builtin tables that are not |
| 202 | +derived from catalog state, for example `mz_sessions`, the audit log, and |
| 203 | +probably others. |
| 204 | + |
| 205 | +TODO: Figure out if we want to allow background tasks to keep writing. This |
| 206 | +includes, but is not limited to storage usage collection. |
| 207 | + |
| 208 | +## Change How Old-Version Processes are Reaped |
| 209 | + |
| 210 | +Currently, the fenced-out `environmentd` halts itself when the new version |
| 211 | +fences it via the catalog. And a newly establishing `environmentd` will reap |
| 212 | +replica processes (`clusterd`) of any versions older than itself. |
| 213 | + |
| 214 | +We can't have this because the lame-duck `environmentd` still needs all its |
| 215 | +processes alive to serve traffic. |
| 216 | + |
| 217 | +Instead we need to change the orchestration logic to determine when the |
| 218 | +new-version `environmentd` is ready to serve traffic, then cut over, and |
| 219 | +eventually reap processes of the old version. |
| 220 | + |
| 221 | +## Get Orchestration Ready for Managing Version Cutover |
| 222 | + |
| 223 | +We need to update the orchestration logic to use the new flow as outlined |
| 224 | +above. |
| 225 | + |
| 226 | +TODO: What's the latency incurred by us cutting over between `environmentd`s |
| 227 | +when everything is ready. That's how close to zero we will get with this |
| 228 | +approach. |
| 229 | + |
| 230 | +## Get Builtin Tables Ready for Concurrent Writers |
| 231 | + |
| 232 | +There are two types of builtin tables: |
| 233 | + |
| 234 | +1. Tables that mirror catalog state |
| 235 | +2. Other tables, which are largely append-only |
| 236 | + |
| 237 | +We have to audit and find out which is which. For builtin tables that mirror |
| 238 | +catalog state we can use the self-correcting approach that we also use for |
| 239 | +materialized views and for a number of builtin storage-managed collections. For |
| 240 | +others, we have to see if concurrent append-only writers would work. |
| 241 | + |
| 242 | +One of the thorny tables here is probably storage-usage data. |
| 243 | + |
| 244 | +## Get Builtin Sources Ready for Concurrent Writers |
| 245 | + |
| 246 | +Similar to tables, there are two types: |
| 247 | + |
| 248 | +1. "Differential sources", where the controller knows what the desired state is |
| 249 | + and updates the content to match that when needed |
| 250 | +2. Append-only sources |
| 251 | + |
| 252 | +Here again, we have to audit what works with concurrent writers and whether |
| 253 | +append-only sources remain correct. |
| 254 | + |
| 255 | +NOTE: Builtin Sources are variously called storage-managed collections, because |
| 256 | +writing to them is mediated by the storage controller. Specifically async |
| 257 | +background tasks that it starts for this purpose. |
| 258 | + |
| 259 | +## Concurrent User-Table Writes |
| 260 | + |
| 261 | +We need to get user tables ready for concurrent writers. Blind writes are easy. |
| 262 | +Read-then-write is hard, but we can do OCC (optimistic concurrency control), |
| 263 | +with a loop that does: |
| 264 | + |
| 265 | +1. Subscribe to changes |
| 266 | +2. Read latest changes, derive writes |
| 267 | +3. Get write timestamp based on latest read timestamp |
| 268 | +4. Attempt write; go back to 2 if that fails; succeed otherwise |
| 269 | + |
| 270 | +We can even render that as a dataflow and attempt the read-then-write directly |
| 271 | +on a cluster, if you squint this is almost a "one-shot continual task" (if |
| 272 | +you're familiar with that). But we could initially keep that loop in |
| 273 | +`environmentd`, closer to how the current implementation works. |
| 274 | + |
| 275 | +## Ensure Sources/Sinks are Ready for Concurrent Instances |
| 276 | + |
| 277 | +I think they are ready, but sources will have a bit of a fight over who gets to |
| 278 | +read, for example for Postgres. Kafka is already fine because it already |
| 279 | +supports concurrently running ingestion instances. |
| 280 | + |
| 281 | +Kafka Sinks use a transactional producer ID, so they would also fight over this |
| 282 | +but settle down when the old-version cluster processes are eventually reaped. |
| 283 | + |
| 284 | +TODO: Figure out how big the impact of the above-mentioned squabbling would be. |
| 285 | + |
| 286 | +## Critical Persist Handles for Everyone / Concurrent Instances of StorageCollections |
| 287 | + |
| 288 | +`StorageCollections`, the component that is responsible for managing the critical |
| 289 | +since handles of storage collections currently has a single critical handle per |
| 290 | +collection. And when a new version comes online it "takes ownership" of that |
| 291 | +handle. |
| 292 | + |
| 293 | +We have to figure out how that works in the lame-duck phase. This is closely |
| 294 | +related (if not the same) to how multiple instances of `environmentd` have to |
| 295 | +have handles for Physical Isolation and High Availability. |
| 296 | + |
| 297 | +Hand-waving, I can imagine that we extend the ID of critical handles to be a |
| 298 | +pair of `(<id>, <opaque>)`, where the opaque is something that only the user of |
| 299 | +persist (that is Materialize) understands. This can be an identifier of a |
| 300 | +version or of different `environmentd` instances. We then acquire critical |
| 301 | +since handles per version, and we introduce a mechanism that allows retiring |
| 302 | +since handles of older versions (or given opaques), which would require persist |
| 303 | +API that allows enumerating IDs (or the opaque part) of a given persist shard. |
| 304 | + |
| 305 | +One side effect of this is that we could get rid of a special case where |
| 306 | +`StorageCollections` will acquire leased read handles when in read-only mode |
| 307 | +and instead acquire critical since handles at its new version even in read-only |
| 308 | +mode. |
| 309 | + |
| 310 | +TODO: Figure this out! It's the one change required by this proposal that |
| 311 | +affects durable state, which is always something we need to think carefully |
| 312 | +about. |
| 313 | + |
| 314 | +## Persist Pubsub |
| 315 | + |
| 316 | +Currently, `environmentd` acts as the persist pub-sub server. It's the |
| 317 | +component that makes the latency between persist writes and readers noticing |
| 318 | +changes snappy. |
| 319 | + |
| 320 | +When we have both the old and new `environmentd` handling traffic, and both |
| 321 | +writing at different times, we would see an increase in read latency because |
| 322 | +writes from one `environmentd` (and its cluster processes) are not routed to |
| 323 | +the other `environmentd` and vice-versa. |
| 324 | + |
| 325 | +We have to somehow address this, or accept the fact that we will have degraded |
| 326 | +latency for the short period where both the old and new `environmentd` serve |
| 327 | +traffic (the lame-duck phase). |
| 328 | + |
| 329 | +## Alternatives |
| 330 | + |
| 331 | +### Read-Only `environmentd` Follows Catalog Changes |
| 332 | + |
| 333 | +Instead of the halt-and-restart approach, the new-version `environmentd` would |
| 334 | +listen to catalog changes, apply migrations live, and when promoted fences out |
| 335 | +the old `environmentd` and gets all its in-memory structures, controllers, and |
| 336 | +the like out of read-only mode. |
| 337 | + |
| 338 | +This would still have the downside that bringing it out of read-only mode can |
| 339 | +take on the order of seconds, but we would shorten the window during which DDL |
| 340 | +is not available. |
| 341 | + |
| 342 | +As described above, this part is hard. |
| 343 | + |
| 344 | +### Fully Operational `environmentd` at different versions |
| 345 | + |
| 346 | +This is an extension of the above, and basically all the different work streams |
| 347 | +mentioned above (High Availability, Physical Isolation, following catalog |
| 348 | +changes and applying migrations). |
| 349 | + |
| 350 | +The only observable downtime would be when we cut over how traffic gets routed |
| 351 | +to different `environmentd`s. |
| 352 | + |
| 353 | +In the fullness of time we want this, but it's a decent amount of work. |
| 354 | +Especially the forward/backward compatibility across a range of versions. On top |
| 355 | +of the other hard challenges. |
| 356 | + |
| 357 | +## Open Questions |
| 358 | + |
| 359 | +- How much downtime is incurred by rerouting network traffic? |
0 commit comments