Skip to content
This repository was archived by the owner on Feb 22, 2023. It is now read-only.

Commit 2450c38

Browse files
Add zero-downtime deployments & data transformations guide (#1082)
* Add incomplete draft of data migration guidelines * Add more details about data migrations * Refocus document on zero-downtime deployments generally * Update api/docs/guides/zero-downtime-database-management.md Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu> * Remove outstanding comment * Add additional clarifications from @krysal --------- Co-authored-by: Madison Swain-Bowden <bowdenm@spu.edu>
1 parent e5b6f0c commit 2450c38

File tree

1 file changed

+292
-0
lines changed

1 file changed

+292
-0
lines changed
Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# Zero Downtime and Database Management
2+
3+
Openverse practices zero-downtime deployments. This puts a handful of
4+
constraints on our database migration and data management practices. This
5+
document describes how to ensure migrations can be deployed with zero downtime
6+
and how to implement and manage long-running data migrations.
7+
8+
Zero-downtime deployments are important to ensure service reliability. Following
9+
the practices that enable zero-downtime deployments also promotes best practices
10+
like ensuring changes are incremental and more easily reversible.
11+
12+
## External resources
13+
14+
This document assumes a general understanding of relational databases, including
15+
concepts like database tables, columns, constraints, and indexes. If this is not
16+
something you are familiar with,
17+
[the Wikipedia article on relation databases](https://en.wikipedia.org/wiki/Relational_database)
18+
is a good starting point.
19+
20+
Django's
21+
[database migration documentation](https://docs.djangoproject.com/en/4.1/topics/migrations/)
22+
also contains helpful background knowledge, though this document takes a more
23+
general approach than addressing only Django specific scenarios.
24+
25+
## Terms
26+
27+
- "Zero-downtime deployment": An application deployment that does not result in
28+
any period of time during which a service is inaccessible. For the purposes of
29+
Openverse, these require running two versions of the application at once that
30+
share the same underlying database infrastructure.
31+
- "Schema": The structure of a database. The tables, columns, and their types.
32+
- "Downtime deployment": An application deployment that does result in a period
33+
of time during which a service is inaccessible. The Openverse project goes to
34+
great lengths to avoid these. These are often caused when a new version of an
35+
application is incompatible with the underlying infrastructure of the
36+
previously deployed version.
37+
- "Database migration": A change to the schema of a database. Common migrations
38+
include the addition or removal of tables and columns.
39+
- "Data transformation": A change to the data held in a database that does not
40+
include (but can be related) to a database migration. Common examples include
41+
backfilling data to remove null values from a column or moving data between
42+
two related columns.
43+
- "Data migration": A data transformation that is executed as part of a Django
44+
migration.
45+
- "Long-running data transformation": A data transformation that lasts longer
46+
than a few seconds. Long-running data transformations are commonly caused by
47+
the modification of massive data, especially data in indexed columns.
48+
49+
## How zero-downtime deployments work
50+
51+
To understand the motivations of these best practices, it is important to
52+
understand how zero-downtime deployments are implemented. Openverse uses the
53+
[blue-green deployment strategy](https://en.wikipedia.org/wiki/Blue-green_deployment).
54+
The blue-green strategy requires running the new version of the application and
55+
the previous version at the same time during the duration of the deployment.
56+
This allows us to replace the multiple, load-balanced instances of our
57+
application one-by-one. As a result, we are able to verify the health of the
58+
instances running the new version, before fully replacing our entire cluster of
59+
application instances with the new version. At all times during a successful
60+
deployment process, both versions of the application must be fully operable and
61+
healthy and able to handle requests. During deployment, the load-balancer will
62+
send requests to both the previous and new versions of the application during
63+
the entire time of the deployment, which can be several minutes. This requires
64+
both versions of the application to be strictly compatible with the underlying
65+
database schema.
66+
67+
## What causes downtime during a deployment?
68+
69+
The most common cause of downtime during a deployment are database schema
70+
incompatibilities between the previous and new version of the application. The
71+
classic example of a schema incompatibility involves column name changes.
72+
Imagine there is a column on a table of audio files called "length", but we
73+
wanted to change the column name to specify the expected units, to make it
74+
clearer for new contributors. If we simply change the name of the column to
75+
"length_ms", then when the new version of the application deploys, it will apply
76+
the migration to change the name. The new version will, of course, work just
77+
fine, in this case. However, during deployments, the previous version of the
78+
application will still be running for a period of time. Requests by the previous
79+
version of the application to retrieve the "length" column with fail
80+
catastrophically because the "length" column will no longer exist! It has been
81+
renamed to "length_ms". If we prevented the new version of the application from
82+
applying the migration, the same issue would arise, but for the new versions as
83+
the "length_ms" column would not yet exist. This, in addition to column
84+
data-type changes, is the most common reason why downtime would be required
85+
during a deployment process that is otherwise capable of deploying without
86+
downtime. When schema incompatibilities arise between new and the previous
87+
version of an application, it is impossible to safely serve requests from both
88+
using the same underlying database.
89+
90+
Other causes are variations on this same pattern: a shared dependency is neither
91+
forward nor backwards compatible between two subsequent versions of the
92+
application.
93+
94+
> **Note**: This issue of incompatibility only applies to _subsequent_ versions
95+
> of an application because only subsequent versions are ever deployed
96+
> simultaneously with the same underlying support infrastructure. So long as
97+
> there is at least one version between them, application versions may and
98+
> indeed sometimes do have fundamental incompatibilities with each other and
99+
> could not be simultaneously deployed.
100+
101+
## How to achieve zero-downtime deployments
102+
103+
Sometimes you need to change the name of a column or introduce some other,
104+
non-backwards compatible change to the database schema. Luckily, this is still
105+
possible, even with zero-downtime deployments, though admittedly the process is
106+
more tedious.
107+
108+
Continuing with the column name change case-study, the following approach must
109+
be followed.
110+
111+
1. Create a new column with the desired name and data type. The new column must
112+
be nullable and should default to null. This step should happen with a new
113+
version of the application that continues to use the existing column.
114+
1. If the column is written to by the application, deploy a new version that
115+
starts writing new or updated data to both columns. It should read the data
116+
from the new column and only fall back to the old column if the new column is
117+
not yet populated.
118+
1. Use a data transformation management command to move data from the previous
119+
column to the new column. To find the rows that need updating, iterate
120+
through the table by querying for rows that do not have a value in the new
121+
column yet. Because the version of the application running at this point is
122+
writing and reading from the new column (falling back to the old for reads
123+
when necessary), the query will eventually return zero rows.
124+
1. Once the data transformation is complete, deploy a new version of the
125+
application that removes the old column and the fallback reads to it and only
126+
uses the new column. Also, add the corresponding constraints for the said
127+
column if required, e.g. non-nullable, default value, etc.
128+
129+
To reiterate, yes, this is a much more tedious process. However, the benefits to
130+
this approach are listed below.
131+
132+
Relatively similar processes and patterns can be applied to other
133+
"downtime-causing" database changes. These are covered in
134+
[this GitHub gist](https://gist.github.com/majackson/493c3d6d4476914ca9da63f84247407b)
135+
with specific instructions for handling them in a Django context.
136+
137+
### Benefits of this approach
138+
139+
#### Zero-downtime
140+
141+
The entire point, of course. This benefits everyone who depends on the
142+
application's uptime and reliability.
143+
144+
#### Reversibility
145+
146+
If the new version of the application has a critical bug, whether related to the
147+
data changes or not, we can revert each step to the previous version without
148+
issue or data loss. Even during the data transformation process, because the
149+
version of the application running is updating both columns, if you have to
150+
revert to the first version (or even earlier) that doesn't use the new column,
151+
the old column will still have up-to-date data and no user data will be lost.
152+
This would complicate the data migration process, however, as previous versions
153+
of the application will not be updating the new column and would likely require
154+
deleting the data from the new column to start the data migration process over
155+
from the start. This can cause massive time consumption but is overall less of a
156+
headache than data loss or fully broken deployments.
157+
158+
#### Intentionality and expediency
159+
160+
Due to the great lengths it takes to change a column name, the process will
161+
inevitably cause contributors to ask themselves: is this worth it? While
162+
changing the name of a column can be helpful to disambiguate the data in the
163+
column, using a model attribute alias can be just as helpful without any of the
164+
disruption or time of a data transformation. These kinds of questions prompt us
165+
to make expedient choices that deliver features, bug fixes, and developer
166+
experience improvements faster.
167+
168+
#### Shorter deployment times
169+
170+
Ideally maintainers orchestrating a production deployment of the service are
171+
keenly aware of the progress of the deployment. This is only a realistic and
172+
sustainable expectation, however, if deployments take a "short" amount of time.
173+
What "short" means is up for debate, but an initial benchmark can be the
174+
Openverse production frontend deployments, which currently take about 10
175+
minutes. Longer than this seems generally unreasonable to expect someone to keep
176+
a very close eye on the process. Sticking to zero-downtime deployments helps
177+
keep short deployments the norm. Even though it sometimes asks us to deploy more
178+
_often_, those deployments can—and in all likelihood, should—be spread over
179+
multiple days. This makes the expectation of keeping a close watch on the
180+
deployment more sustainable long-term and helps encourage us to deploy more
181+
often. In turn, this means new features and bug fixes get to production sooner.
182+
183+
#### Possibility to throttle
184+
185+
Management commands that iterate over data progressively can be throttled to
186+
prevent excessive load on the database or other related services that need to be
187+
accessed.
188+
189+
#### Unit testing
190+
191+
Management command data migrations can be far more easily unit tested using our
192+
existing tools and fixture utilities.
193+
194+
### Long running migrations
195+
196+
Sometimes long-running schema changes are unavoidable. In these cases, provided
197+
that the instructions above are followed to prevent the need for downtime, it is
198+
reasonable to take alternative approaches to deploying the migration.
199+
200+
At the moment we do not have specific recommendations or policies regarding
201+
these hopefully rare instances. If you come across the need for this, please
202+
carefully consider the reasons why it is necessary in the particular case and
203+
document the steps taken to prepare and deploy the migration. Please update this
204+
document with any general findings or advice, as applicable.
205+
206+
## Django management command based data transformations
207+
208+
### Why use management commands for data transformations instead of Django migrations?
209+
210+
Django comes with a data transformation feature built in that allows executing
211+
data transformations during the migration process. Transformations are described
212+
in Django's ORM and executed in a single pass at migration time. If you want to
213+
move data between two columns, it is trivial to do so with these "data
214+
migrations" and Django makes it just as easy.
215+
[Documentation for this Django feature is available here](https://docs.djangoproject.com/en/4.1/topics/migrations/#data-migrations).
216+
217+
When considering the potential issues with using Django migrations for data
218+
transformations with our current deployment strategy, keep in mind the following
219+
details:
220+
221+
- Migrations are run _at the time of deployment_ by the first instance of the
222+
new version of the application that runs in the pool.
223+
- **Note**: This specific detail will only be the case once we've fully
224+
migrated to ECS based deployments. For now one of the people deploying the
225+
application manually runs the migrations before deploying. The effect is the
226+
same though: we end up with a version of the application running against a
227+
database schema that it's not entirely configured to work with. Whether that
228+
is an issue depends solely on whether the practices described in this
229+
document regarding migrations have been followed.
230+
- Deployments should be timely so that developers are able to reasonably monitor
231+
their progress and have clear expectations for how long a deployment should
232+
take. Ideally a full production deployment should not take much longer than 10
233+
minutes once the Docker images are built. Those minutes are already spent by
234+
the process ECS undergoes to deploy a new version of the application.
235+
236+
With those two key details in mind, the main deficiency of using migrations for
237+
data transformations may already be evident: time. Django migration based data
238+
transformations dealing with certain smaller tables may not take very long and
239+
this issue, in some cases, might not be applicable. However, because it is
240+
extremely difficult to predetermine the amount of time a migration will take,
241+
even data transformations for small datasets should still heed the
242+
recommendation to use management commands. In particular, it can be difficult to
243+
predict tables with indexes (especially unique constraints) will perform during
244+
a SQL data migration.
245+
246+
Realistically (and provided it is avoidable), any Django migration that takes
247+
longer than 30 or so seconds, is not acceptable for our current deployment
248+
strategy. Because the vast majority of them will take longer than a few seconds,
249+
there is a strong, blanket recommendation against using them. Exceptions may
250+
exist for this recommendation, however. If you're working on an issue that
251+
involves a data transformation, and you think a migration is truly the best tool
252+
for the job and can demonstrate that it will not take longer than 30 seconds in
253+
production, then please include these details in the PR.
254+
255+
### General rules for data transformations
256+
257+
These rules apply for data transformations executed as management commands or
258+
otherwise.
259+
260+
#### Data transformations must be [idempotent](https://en.wikipedia.org/wiki/Idempotence)
261+
262+
This one particularly applies to management commands because they can
263+
theoretically be run multiple times, either by accident or as an attempt to
264+
recover from or continue after a failure.
265+
266+
Idempotency is important for data transformations because it prevents
267+
unnecessary duplicate processing of data. Idempotency can be achieved in three
268+
ways:
269+
270+
1. By checking the state of the data and only applying the transformation to
271+
rows for which the transformation has not yet been applied. For example, if
272+
moving data between two columns, only process rows for which the new column
273+
is null. Once data has been moved for a row, it will no longer be null and
274+
will be ignored from the query.
275+
1. By checking a timestamp available for each row before which it is known that
276+
data transformations have already been applied.
277+
1. By caching a list of identifiers for already processed rows in Redis.
278+
279+
#### Data transformations should not be destructive
280+
281+
Data transformations should avoid being destructive, if possible. Sometimes it
282+
is avoidable because data needs to be updated "in place". In these cases, it is
283+
imperative to save a list of modified rows (for example, in a Redis set) so that
284+
the transformation can be reversed if necessary.
285+
286+
### If a Django migration _must_ be used
287+
288+
In the rare case where a Django migration must be used, keep in mind that using
289+
a
290+
[non-atomic migration](https://docs.djangoproject.com/en/4.1/howto/writing-migrations/#non-atomic-migrations)
291+
can help make it easier to recover from unexpected errors without causing the
292+
entire transformation process to be reversed.

0 commit comments

Comments
 (0)