-
Notifications
You must be signed in to change notification settings - Fork 100
PITR
Enabling/disabling of the PITR sampling is made via config value pitr.enabled
.
$ pbm config --set=pitr.enabled=true
$ pbm config --set=pitr.enabled=false
Enabling PITR means some agent on each replicaset is going to save oplog chunks to the storage defined in the config. Detailed on the storage layout see PITR: storage layout. Each chunk, with some exceptions described below, captures about 10 min span of oplog events. In order for chunks to be useful for the restore, the chunk consequence should follow two restrictions:
- The chunks chain should form a continuous timeline. Since any gap would violate the consistency of data changes.
- The chunks chain should begin after some backup snapshot. Since the oplog is roughly speaking a log of events applied to some data. And during the restore we need to recover this data in the first place before replying the events and "moving" to the specified point in time.
- The chunks timeline shouldn't be overlapped with any restore events. Since the restore is intruding the events timeline making it invalid - chunk couldn't continue its lineage to the snapshot anymore since the "snapshot data" was rewritten.
On the start, each agent spawns a background process that constantly (now each 15 sec) checking the pitr.enabled
state and if its "on" checks if anyone else in the replicaset is doing the job (there is a lock for PITR operation and it is not stale). And if no one, the agent will try to acquire the lock for "PITR" operation and start slicing. If pitr.enabled
is "off" it will send a cancellation signal to the slicing routine if it has any.
Staring the silicon the agent, first of all, will do a "catch up" - define the start point. The starting point sets to the last backup's or last PITR chunk's TS whichever is the most recent. It also checks if there is no restore intercepted the timeline (hence there are no restores after the most recent backup).
Next, the slicing routine will be started. It runs an infinite loop wich on each step blocks for 10 minutes and then gets oplog slice from the last point until now. Save it to the storage, adds chunk metadata to the pbmPITRChunks
collection and updates the "last point". There is two ways to wake up the slicing process before the 10min timeout. The first way is to send a wake-up signal. In that case, again it will. The behaviour will be the same as with "scheduled wakeup". It's just a way not to wait up to 10 min. The second way is to send a cancellation signal (for example when PITR was switched off in the config). In that case, the routine will do the same - capture and save the oplog from the last point to now, but it exits after finishing this cycle.
After slicing routine wakes up it checks if the agent still owns the lock. If the agent along with the Mongo node was separated from the cluster for quite a time its lock will become stale and some other node will acquire a new lock and continue slicing. But if separated node returns to the cluster we don't want to have more then one node doing the same job.