|
| 1 | +# Zero Downtime and Database Management |
| 2 | + |
| 3 | +Openverse practices zero-downtime deployments. This puts a handful of |
| 4 | +constraints on our database migration and data management practices. This |
| 5 | +document describes how to ensure migrations can be deployed with zero downtime |
| 6 | +and how to implement and manage long-running data migrations. |
| 7 | + |
| 8 | +Zero-downtime deployments are important to ensure service reliability. Following |
| 9 | +the practices that enable zero-downtime deployments also promotes best practices |
| 10 | +like ensuring changes are incremental and more easily reversible. |
| 11 | + |
| 12 | +## External resources |
| 13 | + |
| 14 | +This document assumes a general understanding of relational databases, including |
| 15 | +concepts like database tables, columns, constraints, and indexes. If this is not |
| 16 | +something you are familiar with, |
| 17 | +[the Wikipedia article on relation databases](https://en.wikipedia.org/wiki/Relational_database) |
| 18 | +is a good starting point. |
| 19 | + |
| 20 | +Django's |
| 21 | +[database migration documentation](https://docs.djangoproject.com/en/4.1/topics/migrations/) |
| 22 | +also contains helpful background knowledge, though this document takes a more |
| 23 | +general approach than addressing only Django specific scenarios. |
| 24 | + |
| 25 | +## Terms |
| 26 | + |
| 27 | +- "Zero-downtime deployment": An application deployment that does not result in |
| 28 | + any period of time during which a service is inaccessible. For the purposes of |
| 29 | + Openverse, these require running two versions of the application at once that |
| 30 | + share the same underlying database infrastructure. |
| 31 | +- "Schema": The structure of a database. The tables, columns, and their types. |
| 32 | +- "Downtime deployment": An application deployment that does result in a period |
| 33 | + of time during which a service is inaccessible. The Openverse project goes to |
| 34 | + great lengths to avoid these. These are often caused when a new version of an |
| 35 | + application is incompatible with the underlying infrastructure of the |
| 36 | + previously deployed version. |
| 37 | +- "Database migration": A change to the schema of a database. Common migrations |
| 38 | + include the addition or removal of tables and columns. |
| 39 | +- "Data transformation": A change to the data held in a database that does not |
| 40 | + include (but can be related) to a database migration. Common examples include |
| 41 | + backfilling data to remove null values from a column or moving data between |
| 42 | + two related columns. |
| 43 | +- "Data migration": A data transformation that is executed as part of a Django |
| 44 | + migration. |
| 45 | +- "Long-running data transformation": A data transformation that lasts longer |
| 46 | + than a few seconds. Long-running data transformations are commonly caused by |
| 47 | + the modification of massive data, especially data in indexed columns. |
| 48 | + |
| 49 | +## How zero-downtime deployments work |
| 50 | + |
| 51 | +To understand the motivations of these best practices, it is important to |
| 52 | +understand how zero-downtime deployments are implemented. Openverse uses the |
| 53 | +[blue-green deployment strategy](https://en.wikipedia.org/wiki/Blue-green_deployment). |
| 54 | +The blue-green strategy requires running the new version of the application and |
| 55 | +the previous version at the same time during the duration of the deployment. |
| 56 | +This allows us to replace the multiple, load-balanced instances of our |
| 57 | +application one-by-one. As a result, we are able to verify the health of the |
| 58 | +instances running the new version, before fully replacing our entire cluster of |
| 59 | +application instances with the new version. At all times during a successful |
| 60 | +deployment process, both versions of the application must be fully operable and |
| 61 | +healthy and able to handle requests. During deployment, the load-balancer will |
| 62 | +send requests to both the previous and new versions of the application during |
| 63 | +the entire time of the deployment, which can be several minutes. This requires |
| 64 | +both versions of the application to be strictly compatible with the underlying |
| 65 | +database schema. |
| 66 | + |
| 67 | +## What causes downtime during a deployment? |
| 68 | + |
| 69 | +The most common cause of downtime during a deployment are database schema |
| 70 | +incompatibilities between the previous and new version of the application. The |
| 71 | +classic example of a schema incompatibility involves column name changes. |
| 72 | +Imagine there is a column on a table of audio files called "length", but we |
| 73 | +wanted to change the column name to specify the expected units, to make it |
| 74 | +clearer for new contributors. If we simply change the name of the column to |
| 75 | +"length_ms", then when the new version of the application deploys, it will apply |
| 76 | +the migration to change the name. The new version will, of course, work just |
| 77 | +fine, in this case. However, during deployments, the previous version of the |
| 78 | +application will still be running for a period of time. Requests by the previous |
| 79 | +version of the application to retrieve the "length" column with fail |
| 80 | +catastrophically because the "length" column will no longer exist! It has been |
| 81 | +renamed to "length_ms". If we prevented the new version of the application from |
| 82 | +applying the migration, the same issue would arise, but for the new versions as |
| 83 | +the "length_ms" column would not yet exist. This, in addition to column |
| 84 | +data-type changes, is the most common reason why downtime would be required |
| 85 | +during a deployment process that is otherwise capable of deploying without |
| 86 | +downtime. When schema incompatibilities arise between new and the previous |
| 87 | +version of an application, it is impossible to safely serve requests from both |
| 88 | +using the same underlying database. |
| 89 | + |
| 90 | +Other causes are variations on this same pattern: a shared dependency is neither |
| 91 | +forward nor backwards compatible between two subsequent versions of the |
| 92 | +application. |
| 93 | + |
| 94 | +> **Note**: This issue of incompatibility only applies to _subsequent_ versions |
| 95 | +> of an application because only subsequent versions are ever deployed |
| 96 | +> simultaneously with the same underlying support infrastructure. So long as |
| 97 | +> there is at least one version between them, application versions may and |
| 98 | +> indeed sometimes do have fundamental incompatibilities with each other and |
| 99 | +> could not be simultaneously deployed. |
| 100 | +
|
| 101 | +## How to achieve zero-downtime deployments |
| 102 | + |
| 103 | +Sometimes you need to change the name of a column or introduce some other, |
| 104 | +non-backwards compatible change to the database schema. Luckily, this is still |
| 105 | +possible, even with zero-downtime deployments, though admittedly the process is |
| 106 | +more tedious. |
| 107 | + |
| 108 | +Continuing with the column name change case-study, the following approach must |
| 109 | +be followed. |
| 110 | + |
| 111 | +1. Create a new column with the desired name and data type. The new column must |
| 112 | + be nullable and should default to null. This step should happen with a new |
| 113 | + version of the application that continues to use the existing column. |
| 114 | +1. If the column is written to by the application, deploy a new version that |
| 115 | + starts writing new or updated data to both columns. It should read the data |
| 116 | + from the new column and only fall back to the old column if the new column is |
| 117 | + not yet populated. |
| 118 | +1. Use a data transformation management command to move data from the previous |
| 119 | + column to the new column. To find the rows that need updating, iterate |
| 120 | + through the table by querying for rows that do not have a value in the new |
| 121 | + column yet. Because the version of the application running at this point is |
| 122 | + writing and reading from the new column (falling back to the old for reads |
| 123 | + when necessary), the query will eventually return zero rows. |
| 124 | +1. Once the data transformation is complete, deploy a new version of the |
| 125 | + application that removes the old column and the fallback reads to it and only |
| 126 | + uses the new column. Also, add the corresponding constraints for the said |
| 127 | + column if required, e.g. non-nullable, default value, etc. |
| 128 | + |
| 129 | +To reiterate, yes, this is a much more tedious process. However, the benefits to |
| 130 | +this approach are listed below. |
| 131 | + |
| 132 | +Relatively similar processes and patterns can be applied to other |
| 133 | +"downtime-causing" database changes. These are covered in |
| 134 | +[this GitHub gist](https://gist.github.com/majackson/493c3d6d4476914ca9da63f84247407b) |
| 135 | +with specific instructions for handling them in a Django context. |
| 136 | + |
| 137 | +### Benefits of this approach |
| 138 | + |
| 139 | +#### Zero-downtime |
| 140 | + |
| 141 | +The entire point, of course. This benefits everyone who depends on the |
| 142 | +application's uptime and reliability. |
| 143 | + |
| 144 | +#### Reversibility |
| 145 | + |
| 146 | +If the new version of the application has a critical bug, whether related to the |
| 147 | +data changes or not, we can revert each step to the previous version without |
| 148 | +issue or data loss. Even during the data transformation process, because the |
| 149 | +version of the application running is updating both columns, if you have to |
| 150 | +revert to the first version (or even earlier) that doesn't use the new column, |
| 151 | +the old column will still have up-to-date data and no user data will be lost. |
| 152 | +This would complicate the data migration process, however, as previous versions |
| 153 | +of the application will not be updating the new column and would likely require |
| 154 | +deleting the data from the new column to start the data migration process over |
| 155 | +from the start. This can cause massive time consumption but is overall less of a |
| 156 | +headache than data loss or fully broken deployments. |
| 157 | + |
| 158 | +#### Intentionality and expediency |
| 159 | + |
| 160 | +Due to the great lengths it takes to change a column name, the process will |
| 161 | +inevitably cause contributors to ask themselves: is this worth it? While |
| 162 | +changing the name of a column can be helpful to disambiguate the data in the |
| 163 | +column, using a model attribute alias can be just as helpful without any of the |
| 164 | +disruption or time of a data transformation. These kinds of questions prompt us |
| 165 | +to make expedient choices that deliver features, bug fixes, and developer |
| 166 | +experience improvements faster. |
| 167 | + |
| 168 | +#### Shorter deployment times |
| 169 | + |
| 170 | +Ideally maintainers orchestrating a production deployment of the service are |
| 171 | +keenly aware of the progress of the deployment. This is only a realistic and |
| 172 | +sustainable expectation, however, if deployments take a "short" amount of time. |
| 173 | +What "short" means is up for debate, but an initial benchmark can be the |
| 174 | +Openverse production frontend deployments, which currently take about 10 |
| 175 | +minutes. Longer than this seems generally unreasonable to expect someone to keep |
| 176 | +a very close eye on the process. Sticking to zero-downtime deployments helps |
| 177 | +keep short deployments the norm. Even though it sometimes asks us to deploy more |
| 178 | +_often_, those deployments can—and in all likelihood, should—be spread over |
| 179 | +multiple days. This makes the expectation of keeping a close watch on the |
| 180 | +deployment more sustainable long-term and helps encourage us to deploy more |
| 181 | +often. In turn, this means new features and bug fixes get to production sooner. |
| 182 | + |
| 183 | +#### Possibility to throttle |
| 184 | + |
| 185 | +Management commands that iterate over data progressively can be throttled to |
| 186 | +prevent excessive load on the database or other related services that need to be |
| 187 | +accessed. |
| 188 | + |
| 189 | +#### Unit testing |
| 190 | + |
| 191 | +Management command data migrations can be far more easily unit tested using our |
| 192 | +existing tools and fixture utilities. |
| 193 | + |
| 194 | +### Long running migrations |
| 195 | + |
| 196 | +Sometimes long-running schema changes are unavoidable. In these cases, provided |
| 197 | +that the instructions above are followed to prevent the need for downtime, it is |
| 198 | +reasonable to take alternative approaches to deploying the migration. |
| 199 | + |
| 200 | +At the moment we do not have specific recommendations or policies regarding |
| 201 | +these hopefully rare instances. If you come across the need for this, please |
| 202 | +carefully consider the reasons why it is necessary in the particular case and |
| 203 | +document the steps taken to prepare and deploy the migration. Please update this |
| 204 | +document with any general findings or advice, as applicable. |
| 205 | + |
| 206 | +## Django management command based data transformations |
| 207 | + |
| 208 | +### Why use management commands for data transformations instead of Django migrations? |
| 209 | + |
| 210 | +Django comes with a data transformation feature built in that allows executing |
| 211 | +data transformations during the migration process. Transformations are described |
| 212 | +in Django's ORM and executed in a single pass at migration time. If you want to |
| 213 | +move data between two columns, it is trivial to do so with these "data |
| 214 | +migrations" and Django makes it just as easy. |
| 215 | +[Documentation for this Django feature is available here](https://docs.djangoproject.com/en/4.1/topics/migrations/#data-migrations). |
| 216 | + |
| 217 | +When considering the potential issues with using Django migrations for data |
| 218 | +transformations with our current deployment strategy, keep in mind the following |
| 219 | +details: |
| 220 | + |
| 221 | +- Migrations are run _at the time of deployment_ by the first instance of the |
| 222 | + new version of the application that runs in the pool. |
| 223 | + - **Note**: This specific detail will only be the case once we've fully |
| 224 | + migrated to ECS based deployments. For now one of the people deploying the |
| 225 | + application manually runs the migrations before deploying. The effect is the |
| 226 | + same though: we end up with a version of the application running against a |
| 227 | + database schema that it's not entirely configured to work with. Whether that |
| 228 | + is an issue depends solely on whether the practices described in this |
| 229 | + document regarding migrations have been followed. |
| 230 | +- Deployments should be timely so that developers are able to reasonably monitor |
| 231 | + their progress and have clear expectations for how long a deployment should |
| 232 | + take. Ideally a full production deployment should not take much longer than 10 |
| 233 | + minutes once the Docker images are built. Those minutes are already spent by |
| 234 | + the process ECS undergoes to deploy a new version of the application. |
| 235 | + |
| 236 | +With those two key details in mind, the main deficiency of using migrations for |
| 237 | +data transformations may already be evident: time. Django migration based data |
| 238 | +transformations dealing with certain smaller tables may not take very long and |
| 239 | +this issue, in some cases, might not be applicable. However, because it is |
| 240 | +extremely difficult to predetermine the amount of time a migration will take, |
| 241 | +even data transformations for small datasets should still heed the |
| 242 | +recommendation to use management commands. In particular, it can be difficult to |
| 243 | +predict tables with indexes (especially unique constraints) will perform during |
| 244 | +a SQL data migration. |
| 245 | + |
| 246 | +Realistically (and provided it is avoidable), any Django migration that takes |
| 247 | +longer than 30 or so seconds, is not acceptable for our current deployment |
| 248 | +strategy. Because the vast majority of them will take longer than a few seconds, |
| 249 | +there is a strong, blanket recommendation against using them. Exceptions may |
| 250 | +exist for this recommendation, however. If you're working on an issue that |
| 251 | +involves a data transformation, and you think a migration is truly the best tool |
| 252 | +for the job and can demonstrate that it will not take longer than 30 seconds in |
| 253 | +production, then please include these details in the PR. |
| 254 | + |
| 255 | +### General rules for data transformations |
| 256 | + |
| 257 | +These rules apply for data transformations executed as management commands or |
| 258 | +otherwise. |
| 259 | + |
| 260 | +#### Data transformations must be [idempotent](https://en.wikipedia.org/wiki/Idempotence) |
| 261 | + |
| 262 | +This one particularly applies to management commands because they can |
| 263 | +theoretically be run multiple times, either by accident or as an attempt to |
| 264 | +recover from or continue after a failure. |
| 265 | + |
| 266 | +Idempotency is important for data transformations because it prevents |
| 267 | +unnecessary duplicate processing of data. Idempotency can be achieved in three |
| 268 | +ways: |
| 269 | + |
| 270 | +1. By checking the state of the data and only applying the transformation to |
| 271 | + rows for which the transformation has not yet been applied. For example, if |
| 272 | + moving data between two columns, only process rows for which the new column |
| 273 | + is null. Once data has been moved for a row, it will no longer be null and |
| 274 | + will be ignored from the query. |
| 275 | +1. By checking a timestamp available for each row before which it is known that |
| 276 | + data transformations have already been applied. |
| 277 | +1. By caching a list of identifiers for already processed rows in Redis. |
| 278 | + |
| 279 | +#### Data transformations should not be destructive |
| 280 | + |
| 281 | +Data transformations should avoid being destructive, if possible. Sometimes it |
| 282 | +is avoidable because data needs to be updated "in place". In these cases, it is |
| 283 | +imperative to save a list of modified rows (for example, in a Redis set) so that |
| 284 | +the transformation can be reversed if necessary. |
| 285 | + |
| 286 | +### If a Django migration _must_ be used |
| 287 | + |
| 288 | +In the rare case where a Django migration must be used, keep in mind that using |
| 289 | +a |
| 290 | +[non-atomic migration](https://docs.djangoproject.com/en/4.1/howto/writing-migrations/#non-atomic-migrations) |
| 291 | +can help make it easier to recover from unexpected errors without causing the |
| 292 | +entire transformation process to be reversed. |
0 commit comments