-
Notifications
You must be signed in to change notification settings - Fork 717
Open
Labels
Description
Programs & Events Dashboard recently experienced a major database crash: https://phabricator.wikimedia.org/T411890
In this case, the database VM's filesystem broke, and also the attached storage volume that holds the database files had some corruption.
Since the rollout of the timeslice system, we're no longer worried about running out of storage on the Programs & Events Dashboard server, and after cleaning up the duplicated files used during the recovery process, we'd have plenty of extra storage such that we could hold a series of database dumps.
I think ideally, we'd want to do dumps while there are no course updates running, so I think we'd to build a system around something like:
- On a weekly basis, pause the sidekiq course update and other non-default workers, allow them to finish their current jobs, then do a database dump, and turn them back on.
- Automatically curate a set of database dumps (eg, keep one that is about 6 months old, one that is about a month old, and one that is at most a week old) to keep, and delete old ones as new ones get created.
Reactions are currently unavailable