Tarantool backup/restore #12040

nshy · 2025-11-19T12:17:56Z

nshy
Nov 19, 2025
Collaborator

Reviewers:

Changelog

v3 23/01/2026: Described incremental backup and introduced box.backup.info().
v2 04/12/2025: Described steps to backup/restore explicitly. Added technical details to make sure xlogs has only commited data in case of synchronous replicaset. Added multimaster/asynchronous master-replica replicaset backup/restore. Described backup in case of vshard. Made misc changes to improve document structure.
v1: Added initial version.

Links

Github issue #11729 (which holds further references also) .

Document aims

The purpose of the document is to describe how to backup and restore Tarantool at instance, replicaset and cluster level. We describe API only at instance level (vshard is exception, it is an example of cluster backup), all other steps are should done by backup agent. Also we expect tt CLI will provide more user-friendly interface for backup/restrore on base of this RFC.

Not all described below is how Tarantool currently works and rather is how we plan to make it work in terms of backup/restore.

Use cases

Backup is done to restore after all data is lost.

Other known use cases:

Restore cluster after full data lost when there is standby cluster. If standby cluster if distant, then restoring lost cluster from it will take more time then first restoring from local backup and then copying only difference from standby cluster. The same can be applied to replicaset spread among datacenters when restoring a lost replica.

Backup consistency

We do not elaborate here making replicaset/cluster backup consistent in terms it represents state at some moment in global time as there is no such yet. However replica/shards has data at the "moment" of backup start. It may differ from replica to replica and from shard to shard due to network latencies, replica failures, internal events which may delay backup start (see technical details for asynchronous replicaset backup).

Scope

Tickets mention PITR (point in time recovery). Current argeement is that incremental backups will provide enough points of consistent recovery so we don't need anything more fine grained.

Single instance

Backup

When box.backup.start() is called WAL is rotated and function returns a list of files required to restore to the current point. It is last snapshot and all WAL files after it up to rotated. Backup agent is supposed to copy the listed files. It may optimize backup storage and copy only new WAL files if there was no new snapshot since the last backup. After files are copied call box.backup.stop().

Example:

> box.backup.start()
---
- - ./00000000000000001111.snap
  - ./00000000000000001111.xlog
  - ./00000000000000002222.xlog
...

So to backup such replicaset we need next steps:

Call box.backup.start() on instance.
Copy files listed in the above call where required.
Call `box.backup.stop().

The above is example is for memtx only spaces. In case of there are vinyl spaces the list of data files will include *.vylog, *.index and *.run files.

Incremental backup

In case of incremental backup we should be able to backup only difference since last full/incremental backup. In this case backup procedure is similar to what described in regular backup but backup is started by box.backup.start(type='full') or by box.backup.start(type='incremental') call.

If one requests incremental backup when there was no previous full or incremental then full backup is done. Also data files for incremental backup are kept for limited amount time set by new configuration option backup_base_gc_timeout. If incremental backup is requested after this period of time then also full backup is done.

To facilitate client check whether full or incremental backup was done we add key type to table returned by box.backup.start(). Like in case of starting backup it can have value full or incremental. If backup is not full nor incremental (regular) then type = 'default'.

Introspection

New box.backup.info() function will return the same information as box.backup.start() if there is active backup or nil otherwise.

Recovery

To recover an instance one need to put all data files (listed on backup in box.backup.start()) in working directory of instance before start.

Synchronous replicaset

In this case it is enough to backup only master. We cannot have too outdated data in this case. Notion of master can be up-to-date or not. The latter case is when there is new term and new master this this term and this instance does not know it yet and consider itself a master. If master is up-to-date then it holds all the committed data up to now. If master is not up-to-date then the replicaset can hold new committed data but as master can continue to consider itself a master only for election timeout the amount of this data is limited.

So to backup such replicaset we need next steps:

Backup current replicaset master as described in single instance backup.
Store each replica instance UUID in backup metadata (instance UUID is available through box.info.uuid for example).

To restore such replicaset we need next steps:

Copy backup data files in each replica working directory.
Fix instance UUID in each data file header to the UUID saved in backup metadata. Data file header is plain text. One need to replace value for header key Instance.

More technical details on replicaset recovery from single instance backup are in #12039.

Incremental backup

Incremental backup will be not possible if current master is different from the master at the time the previous incremental/full backup was done. Though the incremental backup will not fail - the full backup will be done.

There is another issue we need to take care due to changing masters. For example we make full backup F1 at replica A, then master changed and we make full backup F2 and replica B, then again A is master, we request incremental backup I3 at replica A. I3 is a difference since F1 which is not yet garbage collected by chance. Client probably expects that the I3 holds the difference from F2. To facilitate client handle this case we add vlock_start and vclock_end to the result of box.backup.start(). These are vclock range of statements present in the data files of the backup. So I3 will start from where F1 ended, not F2, so client can link incremental backup properly or request to start full backup instead.

Technical details

Without extra precautions the xlogs can have uncommitted transactions. These transactions can be rolled back later in replicaset history but on restore they can be applied. So we may have statements after restore that never be visible in replicaset history. We can avoid that if we wait all uncommitted transactions that get into backup xlog to be committed. If they get rolled back then box.backup.start() should raise error. There should be special error code, so the client can retry starting backup on this error as error is transient.

Non synchronous replicaset

This can be multimaster replicaset and asynchronous master-replica replicaset. In both cases making backup of only a single instance from replicaset as described above can miss some data. For example, in case of multimaster the replication can be paused due to long standing conflict, so instances can have different statements. If we backup only one of the instances we miss statements from the other that are not replicated. As conflict can exist for a long period of time we can miss data in backups for this period.

So to backup such replicaset we need next steps:

Backup every instance of replicaset as described in single instance backup. Start backup with box.backup.start({mode='replicaset'}). Backup start will return vclock_start and vclock_end in this mode. These are vclock range of statements present in the data files of the backup. Backup agent should check that intervals of all replicas are overlapped for each vclock component. In this case there will be no rebootstrap after restore. In case the condition is not met, the backup should be restarted (box.backup.stop()/box.backup.start({mode='replicaset'})).

Example:

> box.backup.start()
---
---
- 1: ./00000000000000000777.xlog
  2: ./00000000000000001111.snap
  3: ./00000000000000001111.xlog
  4: ./00000000000000002222.xlog
  vclock_start:
  - 555
  - 222
  vclock_end:
  - 888
  - 1334
...
...

To restore such replicaset we need next steps:

Copy corresponding backup data files for each replica to their working directory.

Technical details

In backup mode 'replicaset' we list all extra xlogs the other replicas need to connect without rebootstrap, besides last snapshot and xlogs after it.

There still a chance that rebootstrap will be required. This can happen due to race. We make backup of instance A, then we make backup of instance B. Before that B advances gc vclock for A. So backup of instance B can miss some statements required for A. We can deal with that by inspecting vclock intervals present in box.backup.start() output. We add vclock_start and vclock_end to the box.backup.start() output in mode='replicaset'. There will be no rebootstrap if intervals of all replicas are overlapped for each vclock component. This check should be done by backup agent.

Cluster

Mere backup of every replicaset in cluster without extra coordination may be inconsistent for 2 reasons.

Rebalancing, when data is moved from one replicaset to another. The data being migrated may be lost or duplicated.
Cross-shards transactions. The changes may be applied in order unexpected to application.

We can take full cluster write lock during backup to exclude both cases but this way backup may impact cluster performance significantly. At the replicaset level backup is lightweight.

Another approach can handle issue 1 but not 2. We can abort/finish in progress data migrations and disable new ones before starting replicasets backup. After it is started data migrations are enabled again. This can be done fast and does not reduce cluster performance. As to issue 2 we can only rely on application in the latter approach, that the application can restore consistently by itself somehow.

vshard

In case of vshard we can use vshard.router.map_callrw() to start backup on every shard. This way all in progress rebalancing will be finished before starting backup. vshard consists of synchronous replicasets, so we need synchronous replicaset backup (as described in section above) for every shard.

So to backup vshard cluster we need next steps:

Call vshard.router.map_callrw() with function ``box.backup.start()`. Make each shard backup as described in section for synchronous replicaset backup.

To restore cluster we need next steps:

Restore each shard as described in section for synchronous replicaset backup.

sergepetrenko · 2025-11-20T10:24:17Z

sergepetrenko
Nov 20, 2025
Maintainer

Restore cluster after full data lost when there is standby cluster. If standby cluster if distant in terms of data replication we can restore damaged cluster from local backup and only then align it with standby cluster.

Let's state the benefit of doing so: AFAIU we assume replication will take more time than local backup recovery.
Also the same applies to restoring a single node of a replica set, if other nodes are distant.

1 reply

nshy Dec 4, 2025
Collaborator Author

Rewrited the passage to mention explicitly time, mentioned replicaset also.

Serpentian · 2025-11-24T09:54:59Z

Serpentian
Nov 24, 2025
Collaborator

It seems, that this will be the main RFC, which unites the replicaset and cluster backup. Currently, it's very unclear to me, what we're doing, way too many questions.

Strict overview of motivation?

Let's firstly figure out, why do we implement that and what do we want to achieve at the end. We must strictly describe the goals of the RFC (e.g. do users wanna see the Point-In-Time recovery or not, according to the https://jira.vk.team/browse/TNTP-2825 they do) and the guarantees, we give to users.

For guarantees, we can check backup tools for other databases:

PostgreSQL Patroni/Etcd — pgBackRest / WAL archiving
MySQL Galera — Percona XtraBackup
Cassandra — nodetool snapshot

And please, include the links to the associated github and jira tickets, it's very difficult to find them now.

How will the process look for the end user?

We must determine, how the backup process will look like for the end user. Is he going to take the tool from SDK, configure it and start the backup? In that case the tool should automatically move the needed files to the configured servers. Then a user just calls the tool one more time and it restores the cluster from a backup?

Or do we expect user to call some vshard/aeon function, that will return, which files should be copied and from which server, user manually goes to every instance, copies files to some servers and then uses tool to restore the cluster?

Or is it going TCM or/and ATE?

From the first glance, it loooks like we need all.

API of replicaset/cluster

Then we should define, how API of the replicaset/cluster will look like, this will be called by a user or our tool.

Will we use box.backup.start/info/end(), as it's proposed here.

Will writing the metadata (e.g. timestamp, instance info) of the backup to a file be a separate API? Or box.backup.info{write = true?

Will we have vshard/aeon.router.backup()?

Review

Rebalancing, when data is moved from one replicaset to another. The data being migrated may be lost or duplicated.

It may happen in VShard, if it's done without any protections:

Backup tool comes to rs1, doesn't find data there, makes backup
rs2 sends data to rs1
Backup tool comes to rs2, doesn't find data there, makes backup.
Data is lost

For that we could use already existing vshard.router.map_callrw, which prohibits rebalancing and makes sure, that all buckets are writable on the instance. It give guarantees, that the request will be done everywhere or nowhere. Write lock is not needed.

We can abort in progress data migrations and disable new ones before starting
replicasets backup

It's not possible in VShard now and I'm not sure it's possible to implement that at all, while preserving the safety. If that's needed, we'll have to write careful RFC for that to investigate.

1 reply

Gerold103 Nov 25, 2025
Collaborator

I agree with almost everything what Nikita said. Only will leave meta-comments here:

Or do we expect user to call some vshard/aeon function, that will return, which files should be copied and from which server, user manually goes to every instance, copies files to some servers and then uses tool to restore the cluster?

Yes. Any sort of consistent cluster backup is only possible, when the replicasets can't exchange data. And this can only be done now by non-core frameworks like vshard and whatever else is on top of the core at the same level (Aeon I guess?).

Inside replicaset we could in theory invent something like pause GC on all replicas, collect xlogs, and then "merge" them. Could be done with a tool. Maybe even with the built-in xlog Lua module with some code on top of it.

But in a cluster we won't have this available. Because no vclock-like logical clock exists in the whole cluster, allowing to see which nodes did which changes.

This backup thing will need to be cluster-tech-aware. VShard and Aeon will need to be explicitly supported.

We can abort in progress data migrations and disable new ones before starting
replicasets backup

It's not possible in VShard now

Abortion isn't possible, but it is quite possible to wait for the end of the current migrations and not allow new ones. Via vshard.router.map_callrw(). No? During this call no bucket moves will be happening.

nshy · 2025-12-04T14:37:35Z

nshy
Dec 4, 2025
Collaborator Author

Strict overview of motivation?

Improved this part, hopefully addressed all the questions.

How will the process look for the end user?
API of replicaset/cluster

The high level API is outside of this RFC goals. Here we only describe Tarantool API that can be used for backup/restore by backup agent.

At this point it is not clear why we should add extra API to fetch instance config. It is already should be known due to config in Tarantool 3 or one can call box.cfg{} to fetch it.

Review

Thanx for suggestion! This part was abstract so added concrete steps for vshard (using vshard.router.map_callrw() as suggested).

0 replies

locker · 2025-12-05T15:23:25Z

locker
Dec 5, 2025
Maintainer

So to backup such replicaset we need next steps:

Backup current replicaset master as described in single instance backup.

Store each replica instance UUID in backup metadata (instance UUID is available through box.info.uuid for example).

Is it necessary? Can't we recreate a replicaset with different UUIDs, maybe even with a different replication factor?

1 reply

nshy Dec 9, 2025
Collaborator Author

Yeah looks like it is optional. But it may be required if there are bindings to UUIDs from somewhere else.

locker · 2025-12-05T15:25:21Z

locker
Dec 5, 2025
Maintainer

For example, in case of multimaster the replication can be paused due to long standing conflict, so instances can have different statements.

Do we actually support multimaster setups? Is it possible to configure one with Tarantool 3.0 config?

1 reply

nshy Dec 9, 2025
Collaborator Author

Yeah, we have a master-master section in documentation for Tarantool 3.0.

locker · 2026-01-29T14:24:18Z

locker
Jan 29, 2026
Maintainer

I don't see any point in implementing incremental backup API in Tarantool core because snapshots+xlogs already do incremental backups, in fact. Introducing something like checkpoint_interval for backups seems to be too much: checkpoint_interval should be configured to do incremental backups properly. Vinyl produces incremental backups internally basing on data size without setting the time interval explicitly. All we need to do is rotate xlog and include it into backup.
Extending box.backup.start() return value isn't backward-compatible (think of pairs). Let's leave it as is. box.backup.info() should include the vclock, time, status, maybe something else + list of files. The list of files should be return in a separate field.
I don't see much point in going into low-level details when describing replicaset backup. We just need to ensure that it's possible to restore a replicaset from a replica backup and how to do that. How to choose a replica for a backup is up to the backup application. They may always choose the first replica in the replica set or the leader or have an option.

1 reply

sergepetrenko Feb 4, 2026
Maintainer

checkpoint_interval should be configured to do incremental backups properly

AFAIU it's typical for backup applications to perform full backups once a day, and cover all the intermediate time slots of a day with incremental backups.

One of the ideas is to collect incremental backups each 15 minutes to be able to do point-in-time recovery with up to 15 minute granularity (of course, choosing the time interval will be the job of the backup tool or its configuration, not tarantool).

I don't see any point in implementing incremental backup API in Tarantool core because snapshots+xlogs already do incremental backups, in fact

I'm not sure the single API box.backup.start{type='full'} will handle all the cases with incremental backup. I see at least one problem:
Imagine the user has checkpoint_count = 1, checkpoint_interval=C and the backup tool has incremental backup collection interval B.
Regardless of C and B values it's possible, that a full backup is collected at moment T1, when data dir contains:

111.snap
111.xlog
222.xlog -- just rotated because of a backup.start() request

Then at the moment T1 + B we want to collect an incremental backup, but the data directory contents are now:

333.snap
333.xlog
444.xlog -- just rotated.

All the operations between 222.xlog and 333.snap will be lost. So it's impossible to collect an incremental backup without holding a reference for 222.xlog somewhere, in _gc_consumers space probably.

This is what type='incremental' handle is for, AFAIU.

sergepetrenko · 2026-02-04T12:40:42Z

sergepetrenko
Feb 4, 2026
Maintainer

If backup is not full nor incremental (regular) then type = 'default'

How is it different from a full backup? Isn't it better to just have 2 options, either 'full' or 'incremental'?

Introspection
New box.backup.info() function will return the same information as box.backup.start() if there is active backup or nil otherwise

Does this info survive a node restart?

Store each replica instance UUID in backup metadata (instance UUID is available through box.info.uuid for example).

I think this info should be returned via box.backup.start()/info() as well -- basically putting box.space._cluster:select{} contents there is enough.

box.backup.start({mode='replicaset'})

AFAIU this mode only differs in backup.start()/backup.info() output -- it adds vclock_start and vclock_end fields. Right? If yes, maybe better return these fields in all backup modes and drop mode = 'replicaset' altogether?

3 replies

nshy Feb 4, 2026
Collaborator Author

How is it different from a full backup? Isn't it better to just have 2 options, either 'full' or 'incremental'?

In case of 'full' backup we want to pin snapshot it is rooted from. So when next snapshot is done we will be able on fetch only difference on next incremental backup.

Does this info survive a node restart?

I guess not. Currently checkpoint pins of backup are not restored after restart, so backup cannot be continued after restart generally speaking. However it looks like a useful feature. Should we add restart support to the RFC? We can put pins in existing _gc_consumers and have a timeout for purging backup pins from there (there is backup_base_gc_timeout for incremental backups, we can rename it to backup_gc_timeout I guess).

AFAIU this mode only differs in backup.start()/backup.info() output -- it adds vclock_start and vclock_end fields. Right? If yes, maybe better return these fields in all backup modes and drop mode = 'replicaset' altogether?

There is another difference, in 'replicaset' mode we provide more xlogs to avoid rebootstrap on restoring from backup. Or we should hide non synchronous replicaset section to move it out of current RFC consideration?

sergepetrenko Feb 5, 2026
Maintainer

In case of 'full' backup we want to pin snapshot it is rooted from. So when next snapshot is done we will be able on fetch only difference on next incremental backup.

So, 'regular' backup fetches the same data as 'full' one, but doesn't pin snapshot/xlogs, right? Is it for the sake of users who don't do incremental backups?

I guess not. Currently checkpoint pins of backup are not restored after restart, so backup cannot be continued after restart generally speaking. However it looks like a useful feature. Should we add restart support to the RFC? We can put pins in existing _gc_consumers and have a timeout for purging backup pins from there (there is backup_base_gc_timeout for incremental backups, we can rename it to backup_gc_timeout I guess).

Ok, I see. I think we shouldn't do this unless someone asks for it explicitly.

There is another difference, in 'replicaset' mode we provide more xlogs to avoid rebootstrap on restoring from backup. Or we should hide non synchronous replicaset section to move it out of current RFC consideration?

Oh, so we find the oldest xlogs still needed by some replicas and return them as well?
Ok, I see now, thanks!

After a second thought all these modes do make sense. Although it's a bit confusing that there are so many of them.

nshy Feb 5, 2026
Collaborator Author

So, 'regular' backup fetches the same data as 'full' one, but doesn't pin snapshot/xlogs, right? Is it for the sake of users who don't do incremental backups?

Exactly.

Tarantool

Tarantool backup/restore #12040

Uh oh!

Uh oh!

nshy Nov 19, 2025 Collaborator

Reviewers:

TOC

Changelog

Links

Document aims

Use cases

Backup consistency

Scope

Single instance

Backup

Incremental backup

Introspection

Recovery

Synchronous replicaset

Incremental backup

Technical details

Non synchronous replicaset

Technical details

Cluster

vshard

Replies: 7 comments · 8 replies

Uh oh!

sergepetrenko Nov 20, 2025 Maintainer

Uh oh!

nshy Dec 4, 2025 Collaborator Author

Uh oh!

Serpentian Nov 24, 2025 Collaborator

Strict overview of motivation?

How will the process look for the end user?

API of replicaset/cluster

Review

Uh oh!

Uh oh!

Gerold103 Nov 25, 2025 Collaborator

Uh oh!

nshy Dec 4, 2025 Collaborator Author

Uh oh!

locker Dec 5, 2025 Maintainer

Uh oh!

nshy Dec 9, 2025 Collaborator Author

Uh oh!

locker Dec 5, 2025 Maintainer

Uh oh!

nshy Dec 9, 2025 Collaborator Author

Uh oh!

locker Jan 29, 2026 Maintainer

Uh oh!

sergepetrenko Feb 4, 2026 Maintainer

Uh oh!

sergepetrenko Feb 4, 2026 Maintainer

Uh oh!

nshy Feb 4, 2026 Collaborator Author

Uh oh!

sergepetrenko Feb 5, 2026 Maintainer

Uh oh!

nshy Feb 5, 2026 Collaborator Author

nshy
Nov 19, 2025
Collaborator

Replies: 7 comments 8 replies

sergepetrenko
Nov 20, 2025
Maintainer

nshy Dec 4, 2025
Collaborator Author

Serpentian
Nov 24, 2025
Collaborator

Gerold103 Nov 25, 2025
Collaborator

nshy
Dec 4, 2025
Collaborator Author

locker
Dec 5, 2025
Maintainer

nshy Dec 9, 2025
Collaborator Author

locker
Dec 5, 2025
Maintainer

nshy Dec 9, 2025
Collaborator Author

locker
Jan 29, 2026
Maintainer

sergepetrenko Feb 4, 2026
Maintainer

sergepetrenko
Feb 4, 2026
Maintainer

nshy Feb 4, 2026
Collaborator Author

sergepetrenko Feb 5, 2026
Maintainer

nshy Feb 5, 2026
Collaborator Author