-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Disk Manager Filestore metadata backup
This document describes the algorithm for backup and restore of the Filestore metadata and the storage schema for backups. Backups are expected to be performed as a Disk Manager dataplane persistent task, which can be interrupted and retried idempotently. Filestore backups are expected to use point-in-time checkpoints. Filestore does not have production-ready checkpoint support yet (see #1923). Until then, the backups are expected to be inconsistent.
Proposed solution
Backup:
The filestore backup creation should consist of the following steps:
- API request handling, controlplane task creation
- Controlplane task, which creates a record in the controlplane database, creates a checkpoint, and schedules the dataplane metadata backup task.
- Metadata backup task, which reads all the metadata from the checkpoint and writes it to the backup storage.
Restoration:
- API request handling, controlplane task creation
- Controlplane task, which creates a filestore and the corresponding record in the controlplane database, and schedules a dataplane restoration task.
- Dataplane task, which reads all the metadata from the backup storage and writes it to the filestore.
Filestore Metadata structure
Filestore metadata consists of inode attributes and the directory structure of the filesystem. A tablet stores its state in several tables, which allows indexing by primary key only.
Node refs
Information about the directory structure is stored in the NodeRefs table
struct NodeRefs: TTableSchema<9>
{
struct NodeId : Column<1, NKikimr::NScheme::NTypeIds::Uint64> {};
struct CommitId : Column<2, NKikimr::NScheme::NTypeIds::Uint64> {};
struct Name : Column<3, NKikimr::NScheme::NTypeIds::String> {};
struct ChildId : Column<4, NKikimr::NScheme::NTypeIds::Uint64> {};
struct ShardId : Column<5, NKikimr::NScheme::NTypeIds::String> {};
struct ShardName : Column<6, NKikimr::NScheme::NTypeIds::String> {};
using TKey = TableKey<NodeId, Name>;
using TColumns = TableColumns<
NodeId,
CommitId,
Name,
ChildId,
ShardId,
ShardName
>;
using StoragePolicy = TStoragePolicy<IndexChannel>;
};Note that NodeId here is the ID of a parent node. In the Disk Manager terminology, we will stick to the terminology parent id and child id.
For sharded filesystems, ShardId is the ID of the shard, which is a separate filesystem that stores files in a flat structure. ShardName is the name of the file in this shard filesystem.
IMPORTANT The YDB tablet API does not support querying a table simply by a limit and an offset. The table allows querying only by the ID of the parent and by the name of the file.
Information about inode (atime, ctime, mtime, mod, uid, gid, etc. ) is stored in the Nodes table
struct Nodes: TTableSchema<5>
{
struct NodeId : Column<1, NKikimr::NScheme::NTypeIds::Uint64> {};
struct CommitId : Column<2, NKikimr::NScheme::NTypeIds::Uint64> {};
struct Proto : ProtoColumn<3, NProto::TNode> {};
using TKey = TableKey<NodeId>;
using TColumns = TableColumns<
NodeId,
CommitId,
Proto
>;
using StoragePolicy = TStoragePolicy<IndexChannel>;
};To support extended attributes backup, we will also need to back up the NodeAttrs table
struct NodeAttrs: TTableSchema<7>
{
struct NodeId : Column<1, NKikimr::NScheme::NTypeIds::Uint64> {};
struct CommitId : Column<2, NKikimr::NScheme::NTypeIds::Uint64> {};
struct Name : Column<3, NKikimr::NScheme::NTypeIds::String> {};
struct Value : Column<4, NKikimr::NScheme::NTypeIds::String> {};
struct Version : Column<5, NKikimr::NScheme::NTypeIds::Uint64> {};
using TKey = TableKey<NodeId, Name>;
using TColumns = TableColumns<
NodeId,
CommitId,
Name,
Value,
Version
>;
using StoragePolicy = TStoragePolicy<IndexChannel>;
};Controlplane record creation
Controlplane database entry should have the following schema:
folder_id: Utf8
zone_id: Utf8
filesystem_id: Utf8
filesystem_backup_id: Utf8
creating_at: Timestamp
created_at: Timestamp
deleting_at: Timestamp
deleted_at: Timestamp
size: Uint64
storage_size: Uint64
status: Int64
After the controlplane record is created, an almost identical database entry for dataplane is created.
Dataplane metadata backup
The proposed approach for the dataplane.BackupMetadata is to perform a BFS-like traversal of the filesystem tree via several parallel workers, while using a YDB table as a persistent queue.
NodeRefs
The NodeRefs table is indexed by the pair of (NodeId, Name). We will use the standard ListNodes API for directory listing. This way, we do not depend on the internal structure of the filestore.
For that, we will need the following filesystems/node_refs table:
filesystem_backup_id: Utf8
depth: Uint64
parent_node_id: Uint64
name: Utf8
child_node_id: Uint64
node_type: Uint32
The primary key should be (filesystem_backup_id, depth, parent_node_id, name).
Nodes
Node attributes should be stored in the filesystems/nodes table:
filesystem_backup_id: Utf8
node_id: Uint64
mode: Uint32
uid: Uint32
gid: Uint32
atime: Uint64
mtime: Uint64
ctime: Uint64
size: Uint64
symlink_target: Utf8
with primary key (filesystem_backup_id, node_id).
Directory listing queue
For the queue, we will use the following table filesystems/directory_listing_queue:
filesystem_backup_id: Utf8
node_id: Uint64
cookie: Bytes
depth: Uint64
The primary key should be (filesystem_backup_id, node_id, cookie).
Hard links:
Hard links are visible as files with Links > 1. They turn the filesystem tree into an acyclic graph and add some complexity. To tackle this issue, we will store such references in a separate table and restore these references at the very end of the metadata backup restore. filesystems/hardlinks will have the following schema:
filesystem_backup_id: Utf8
node_id: Uint64
parent_node_id: Uint64
name: Utf8
With primary key (filesystem_backup_id, node_id, parent_node_id, name).
dataplane.BackupMetadata task
-
First, it places the root node into the queue table if one does not exist yet. Several
DirectoryListergoroutines and a singleDirectoryListingSchedulergoroutine are spawned. -
DirectoryListingSchedulerdoes the following:- Check if the root record was scheduled; if not, schedule the root node as pending and update the task state in the same transaction (
SaveStateWithPreparation()). - Fetch all inodes to list with the
listingstate and put them into the channel.
- Check if the root record was scheduled; if not, schedule the root node as pending and update the task state in the same transaction (
-
DirectoryListerdoes the following:- Reads a record from the channel.
- Performs ListNodes API call for the
node_idfrom the record, using thecookiefrom the record if it is not empty. - Performs upsert of all the nodes to
filesystems/node_refstable. (By incrementing the depth of the parent from the queue by one). For this, we can use theBulkUpsertAPI call. (saveNodeRefs()) - For all the symlinks, performs
ReadLinkAPI call to populate the node data. - For all the inode attributes, saves them to the
filesystems/nodestable. (saveNodes()) - For all the files with
Links > 1, saves them to thefilesystems/hardlinks(saveHardlinks()). Links is returned inListNodesAPI call. All the files withLinks > 1are to be saved to thefilesystems/node_refstable with type Link. - Puts all the directories into the queue table with an empty cookie. (
enqueueDirectoriesToList()) - In the same transaction as 7, updates the cookie and removes the record from the queue if the inode listing was finished.
- Continues 2-8 with the new cookie until there are no more entries to process.
Metadata restore
For the metadata restoration, we will restore the filesystem layer-by-layer, and within each layer, we will restore nodes in parallel using multiple workers and the channel-with-inflight-queue approach, incrementing the depth after each layer is restored.
For the mapping of source node IDs to destination node IDs we will need the following filesystem/restore_mapping table:
source_filesystem_id: Utf8
destination_filesystem_id: Utf8
source_node_id: Uint64
destination_node_id: Uint64
with primary key (source_filesystem_id, destination_filesystem_id, source_node_id).
We will initially prepend the table with the root node mapping (1 -> 1). (Root node id is always 1).
In the task state, we will need to store the current depth and the index of the chunk of fixed size, all the chunks before which are already processed.
To split each layer into chunks, we need to implement a CountNodesByDepth() method.
A large number of hard links is unexpected. To simplify the implementation, hardlinks are to be restored separately after all the layers are restored.
We will fetch data from the hardlinks table, create the first entry with the same inode as a file, and others as a link.