-
Notifications
You must be signed in to change notification settings - Fork 18
docs(master): add architectural reference README #750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,294 @@ | ||
| # Master Server (`sfsmaster`) -- Architectural Reference | ||
|
|
||
| The master server is the central metadata authority in a SaunaFS cluster. It | ||
| maintains the entire filesystem namespace (inodes, directories, edges, xattrs, | ||
| ACLs, quotas, locks) in memory and coordinates chunk placement across | ||
| chunkservers. The binary produced from this directory is **`sfsmaster`**. | ||
|
|
||
| A single codebase supports two runtime **personalities**: | ||
|
|
||
| - **Master** -- the active metadata server that accepts client mutations. | ||
| - **Shadow** -- a hot-standby that replays changelogs from the master and can | ||
| be promoted at any time. | ||
|
|
||
| The personality is selected at startup via configuration and can only transition | ||
| from Shadow to Master (never the reverse). Conditional compilation guards | ||
| (`METARESTORE`, `METALOGGER`) also allow parts of this code to be linked into | ||
| the `sfsmetarestore` and `sfsmetalogger` binaries. | ||
|
|
||
| ## Code Organization | ||
|
|
||
| The directory is flat (no subdirectories). Files are grouped by **filename | ||
| prefix**, and each group corresponds to a logical subsystem: | ||
|
|
||
| | Prefix / group | Subsystem | | ||
| |-----------------------|--------------------------------------------------| | ||
| | `filesystem_*` | Core metadata model, operations, and maintenance | | ||
| | `matoclserv*` | Client (FUSE mount) protocol server | | ||
| | `matocsserv*` | Chunkserver protocol server | | ||
| | `matomlserv*` | Metalogger / shadow protocol server | | ||
| | `matontserv*` | Notifier (inotify-like) protocol server | | ||
| | `masterconn*` | Shadow-to-master replication connection | | ||
| | `metadata_backend_*` | Metadata persistence (load / save / dump) | | ||
| | `metadata_loader*` | Section-based async metadata loading | | ||
| | `metadata_dumper_*` | Periodic metadata dumping | | ||
| | `chunks*` | Chunk lifecycle and replication tracking | | ||
| | `changelog*` | Write-ahead log (WAL) | | ||
| | `restore*` | Changelog replay (apply entries to metadata) | | ||
| | `hstring*` | Efficient string (filename) storage | | ||
| | `kv_*` | KV store (FoundationDB) integration | | ||
| | `task_manager*` | Async task execution framework | | ||
| | `locks*` | POSIX and flock advisory file locking | | ||
| | `acl_storage*` | Deduplicated RichACL storage | | ||
| | `exports*` | Client mount permissions and IP-based rules | | ||
| | `topology*` | Network topology for data-locality placement | | ||
| | `personality*` | Master / Shadow personality management | | ||
| | `chartsdata*` | Monitoring / charting data collection | | ||
|
|
||
| A key organizational convention is the **interface / implementation split**: | ||
| public interfaces live in `*_interface.h` files as pure virtual classes (e.g. | ||
| `IFilesystemOperations`, `IFilesystemNodeOperations`, `IMetadataBackend`, | ||
| `IKVConnector`), while default in-memory implementations use a `*Base` suffix | ||
| or reside in the corresponding `.cc` file. | ||
|
|
||
| ## Core Data Model | ||
|
|
||
| Core namespace metadata lives in a **`FilesystemMetadata`** instance addressed | ||
| through the global pointer `gMetadata` (declared in | ||
| `filesystem_metadata.h`). Chunk metadata is managed separately in `chunks.cc` | ||
| (`gChunksMetadata`). Key `gMetadata` members are: | ||
|
|
||
| - **`nodeHash`** -- a fixed-size hash table (4M buckets) mapping inode IDs to | ||
| `FSNode*` pointers. This is the primary index for all filesystem objects. | ||
| - **`root`** -- pointer to the root `FSNodeDirectory`. | ||
| - **`trash` / `reserved`** -- containers for deleted files (awaiting trashtime | ||
| expiry) and files held open by clients after unlinking. | ||
| - **`inodePool`** (`IdPoolDetainer`) -- inode ID allocation with a reuse delay | ||
| (currently derived from compile-time constant `SFS_INODE_REUSE_DELAY`) to | ||
| prevent stale file handles (for example, NFS handles) from hitting recycled | ||
| inodes. | ||
| - **`aclStorage`** -- reference-counted, deduplicated RichACL store. | ||
| - **`xattrInodeHash` / `xattrDataHash`** -- extended attribute storage. | ||
| - **`quotaDatabase`** -- per-user/group/directory quota tracking. | ||
| - **`flockLocks` / `posixLocks`** -- advisory file lock state. | ||
| - **`taskManager`** -- the async task execution engine (see below). | ||
| - **`metadataVersion`** -- monotonically increasing version counter, bumped on | ||
| every metadata mutation. | ||
| - **Signals** -- `nodeChangedSignal`, `edgeChangedSignal`, `edgeRemovedSignal` | ||
| emitted on metadata changes; these are extension points for observers such as | ||
| KV connectors. | ||
|
|
||
| ### Node Type Hierarchy | ||
|
|
||
| `FSNode` (64 bytes) is the base class for all filesystem objects. Concrete | ||
| subtypes: | ||
|
|
||
| | Class | Type | Extra fields | | ||
| |-------------------|---------------------|--------------------------------------| | ||
| | `FSNodeFile` | file, trash, reserved | `length`, `chunks[]`, `sessionIds[]`| | ||
| | `FSNodeDirectory` | directory | `entries` (SkipList), `stats`, `nlink`, `lowerCaseEntries` | | ||
| | `FSNodeSymlink` | symlink | `path`, `path_length` | | ||
| | `FSNodeDevice` | block/char device | `rdev` | | ||
|
|
||
| Every node stores a `parents` compact vector of `(inode_t, Handle*)` pairs, | ||
| providing reverse links from child to parent(s) (files can have multiple | ||
| parents via hard links). | ||
|
|
||
| Directory entries use a **SkipList** keyed by hashed name handles, with an | ||
| optional parallel `lowerCaseEntries` SkipList for case-insensitive | ||
| filesystems. | ||
|
|
||
| Each `FSNodeDirectory` maintains an aggregate `StatsRecord` (inodes, dirs, | ||
| files, links, chunks, length, size, realsize) that is incrementally propagated | ||
| up the tree on every mutation, enabling O(1) `dirinfo` queries at any level. | ||
|
|
||
| ## Filesystem Operations | ||
|
|
||
| Operations are organized in a two-layer interface hierarchy: | ||
|
|
||
| 1. **`IFilesystemOperations`** (`filesystem_operations_interface.h`) -- | ||
| high-level POSIX-like API: `lookup`, `mknod`, `mkdir`, `unlink`, `rmdir`, | ||
| `rename`, `link`, `symlink`, `setAttr`, `readdir`, `writeChunk`, | ||
| `readChunk`, quota management, lock operations, etc. The global instance | ||
| is `gFSOperations`. | ||
|
|
||
| 2. **`IFilesystemNodeOperations`** (`filesystem_node_operations_interface.h`) | ||
| -- lower-level node CRUD: `createNode`, `link`, `unlink`, `removeEdge`, | ||
| `purge`, `getPath`, stats propagation, ACL inheritance, etc. | ||
|
|
||
| `IFilesystemOperations` delegates to `IFilesystemNodeOperations` via its | ||
| `nodeOperations()` accessor. Both have in-memory default implementations | ||
| (`FilesystemOperationsBase`, `FilesystemNodeOperationsBase`) that operate | ||
| directly on `gMetadata`. | ||
|
|
||
| The **`FsContext`** (`fs_context.h`) carries per-operation state: personality, | ||
| session data, uid/gid, timestamps. The **`FilesystemOperationContext`** | ||
| (`filesystem_operation_context.h`) carries optional transaction handles; in the | ||
| current in-memory implementation these are unused (see the KV Store section | ||
| below for context). | ||
|
|
||
| ## Chunk Management | ||
|
|
||
| Chunk state is managed in `chunks.h` / `chunks.cc`. Key responsibilities: | ||
|
|
||
| - **Chunk lifecycle** -- creation, version bumps, deletion, truncation. | ||
| - **Location tracking** -- which chunkserver holds which chunk and version, | ||
| reported via `chunk_server_has_chunk()`. | ||
| - **Replication decisions** -- tracking under-goal / over-goal chunks and | ||
| issuing recover/remove operations. | ||
| - **Operation callbacks** -- `chunk_got_create_status`, | ||
| `chunk_got_replicate_status`, etc., process async results from chunkservers. | ||
| - **ID generation** -- `gChunkIdGenerator` (an `IIdGeneratorWithState`). | ||
| - **Server selection** -- `get_servers_for_new_chunk.*` implements the | ||
| algorithm for picking chunkservers considering labels, weights, disk usage, | ||
| and load balancing. | ||
|
|
||
| Files reference chunks through `FSNodeFile::chunks`, a vector of chunk IDs | ||
| indexed by chunk index (offset / chunk_size). | ||
|
|
||
| The `gChunkChangedSignal` notifies observers when chunk metadata changes. | ||
|
|
||
| ## Network Servers | ||
|
|
||
| The master communicates with five types of peers. The naming convention | ||
| `mato*serv` means "**ma**ster **to** *X* **serv**ice": | ||
|
|
||
| | Module | Peer | Role | | ||
| |----------------|-----------------|------| | ||
| | `matoclserv` | FUSE clients | Handles all client filesystem requests (the largest module). Manages delayed chunk operations and session state. | | ||
| | `matocsserv` | Chunkservers | Sends chunk create/delete/replicate/truncate/set-version commands. Tracks disk usage, labels, and connection state. | | ||
| | `matomlserv` | Metaloggers & shadows | Broadcasts changelog entries and metadata snapshots for replication. | | ||
| | `matontserv` | Notifier clients | Broadcasts changelog events for inotify-like functionality. | | ||
| | `masterconn` | Active master | Used only when running in **Shadow** personality. Receives changelogs and metadata from the active master. | | ||
|
|
||
| ## Metadata Persistence | ||
|
|
||
| Metadata durability is achieved through two complementary mechanisms: | ||
|
|
||
| ### Full Metadata Snapshots | ||
|
|
||
| - **`IMetadataBackend`** (`metadata_backend_interface.h`) -- interface for | ||
| loading and saving complete metadata. Methods: `init()`, `loadall()`, | ||
| `store_fd()`, `commit_metadata_dump()`, `emergency_saves()`. | ||
| - **`MetadataBackendFile`** (`metadata_backend_file.*`) -- the file-based | ||
| implementation. Stores metadata in `metadata.sfs` with section-based format. | ||
| Rotates previous copies on save. | ||
| - **`MetadataLoader`** (`metadata_loader.h`) -- reads metadata sections from | ||
| memory-mapped files, supporting async section loading. | ||
| - **`IMetadataDumper` / `MetadataDumperFile`** -- handles periodic metadata | ||
| dumps (foreground or background). The dumper can run in a forked child | ||
| process to avoid blocking the event loop. | ||
|
|
||
| ### Changelog (Write-Ahead Log) | ||
|
|
||
| - **`changelog.*`** -- incremental WAL. Each mutation is recorded as a text | ||
rolysr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| entry in the format `<version>: <ts>|<COMMAND>(arg1,arg2,...)`. | ||
| - Changelogs are rotated (configurable `BACK_LOGS`), flushed after each write | ||
| (configurable), and broadcast to metaloggers/shadows via `matomlserv`. | ||
| - **`restore.*`** -- replays changelog entries to reconstruct metadata state. | ||
| Used by shadow masters during synchronization and by `sfsmetarestore` for | ||
| offline recovery. | ||
|
|
||
| ## KV Store Backend (FoundationDB) | ||
|
|
||
| An alternative to file-based metadata storage, designed for distributed | ||
| metadata: | ||
|
|
||
| - **`IKVConnector`** (`kv_connector_interface.h`) -- interface for KV store | ||
| operations and event handlers for metadata changes (`onNodeChanged`, | ||
| `onEdgeChanged`, `onEdgeRemoved`, `onDetainedAdded`, etc.). | ||
| - **`kv_connector_fdb.*`** -- FoundationDB concrete implementation | ||
| (conditionally compiled). This integration is currently not wired into the | ||
| default master initialization path in `init.h`. | ||
| - **`kv_common_keys.h`** -- defines key prefix conventions for all metadata | ||
| sections in the KV store: `NODE_`, `EDGE_`, `FREE_`, `CHNK_`, `XATR_`, | ||
| `ACLS_`, `QUOT_`, `FLCK_`, plus reverse indexes (`DIR_PARENT_`, `PARENT_`, | ||
| `DIR_NODES_COUNT_`, `DIR_STATS_`). | ||
| - `FilesystemOperationContext` carries optional transaction handles. In the | ||
| current in-memory master implementation these are empty, and they serve as an | ||
| extension point for transactional backends. | ||
|
|
||
| ## Task Manager and Async Tasks | ||
|
|
||
| Long-running or recursive operations are decomposed into small incremental | ||
| tasks to avoid blocking the single-threaded event loop: | ||
|
|
||
| - **`TaskManager`** (`task_manager.h`) -- generic job/task execution system. | ||
| A **Job** is a named unit of work containing an ordered list of **Tasks**. | ||
| The manager round-robins across jobs, executing a bounded batch of tasks per | ||
| event loop iteration. | ||
| - **`SnapshotTask`** (`snapshot_task.h`) -- splits filesystem snapshot | ||
| (clone) operations into per-inode tasks. | ||
| - **`RecursiveRemoveTask`** (`recursive_remove_task.h`) -- recursive directory | ||
| deletion. | ||
| - **`SetGoalTask`** / **`SetTrashtimeTask`** -- recursive goal or trashtime | ||
| changes across a subtree. | ||
| - **`DeferredMetadataDumpTask`** (`deferred_metadata_dump_task.h`) -- | ||
| post-failover metadata dump. | ||
|
|
||
| Jobs support cancellation and completion callbacks. | ||
|
|
||
| ## Periodic Operations | ||
|
|
||
| Several maintenance routines run on timer-driven or per-loop schedules, | ||
| registered in `fs_periodic_master_init()` (see `filesystem_periodic.cc`): | ||
|
|
||
| | Operation | Schedule | Purpose | | ||
| |-----------|----------|---------| | ||
| | **File integrity test** (`fs_periodic_file_test` / `fs_background_file_test`) | Every second (timer) + every loop (background) | Scans the entire node hash table in a configurable cycle time (`FILE_TEST_LOOP_MIN_TIME`, default 3600s). For each file node, checks chunk availability and copy counts. For each directory, validates parent-child pointer consistency. Builds a `gDefectiveNodes` map of inodes with unavailable chunks, under-goal chunks, or structural errors. | | ||
| | **Background task processing** (`fs_background_task_manager_work`) | Every loop | Drives the `TaskManager`, processing a batch of tasks (snapshots, recursive removes, goal/trashtime changes) per iteration. | | ||
| | **Checksum recalculation** (`fs_background_checksum_recalculation_a_bit`) | Every loop | Incrementally recalculates metadata checksums (nodes, xattrs, chunks) in the background, progressing through steps at a speed limit per iteration. | | ||
| | **Trash cleanup** (`fs_periodic_emptytrash`) | Every 100ms | Purges expired trash entries whose deletion timestamp has passed. | | ||
| | **Reserved file cleanup** (`fs_periodic_emptyreserved`) | Configurable period | Releases reserved files (deleted-but-still-open files) whose sessions are no longer active. | | ||
| | **Chunk maintenance** (in `chunks.cc`) | Periodic | Handles chunk replication, deletion of excess copies, and rebalancing across chunkservers. | | ||
|
|
||
| ## Initialization Sequence | ||
|
|
||
| Startup is orchestrated by ordered `RunTab` arrays in `init.h`. The sequence | ||
| is dependency-ordered -- comments in the source mark critical orderings: | ||
|
|
||
| ``` | ||
| 1. prometheus_init -- Optional Prometheus metrics endpoint | ||
| 2. hstorage_init -- String storage backend (must be first) | ||
| 3. personality_init -- Set master/shadow personality (must be second) | ||
| 4. rnd_init -- Random number generator | ||
| 5. dcm_init -- Data cache manager (before fs_init and matoclserv) | ||
| 6. matoclserv_sessions_init -- Load persisted sessions (before fs_init) | ||
| 7. exports_init -- Client mount/export permission rules | ||
| 8. topology_init -- Network topology configuration | ||
| 9. metadata_backend_init -- Initialize MetadataBackendFile + inode ID generator | ||
| 10. fs_init -- Core filesystem: load metadata, register periodic ops | ||
| 11. chartsdata_init -- Monitoring charts data collection | ||
| 12. masterconn_init -- Shadow's connection to active master | ||
| 13. matomlserv_init -- Metalogger/shadow communication | ||
| 14. matocsserv_init -- Chunkserver communication | ||
| 15. matontserv_init -- Notifier communication | ||
| 16. matoclserv_network_init -- Client network init (last -- opens for business) | ||
| ``` | ||
|
|
||
| Client connections are accepted only after all other subsystems are ready. | ||
|
|
||
| ## Key Design Patterns | ||
|
|
||
| - **Strategy / interface-based extensibility** -- all major subsystems are | ||
| behind pure virtual interfaces (`IFilesystemOperations`, | ||
| `IFilesystemNodeOperations`, `IMetadataBackend`, `IKVConnector`, | ||
| `hstorage::Storage`), allowing alternative implementations to be plugged in. | ||
| - **Observer / signal pattern** -- `Signal<>` objects on `FilesystemMetadata` | ||
| and in the chunk subsystem notify listeners about metadata changes without | ||
| coupling producers to consumers. | ||
| - **Global process state** -- core state is exposed through global variables: | ||
| `gMetadata` is a raw pointer (`FilesystemMetadata *`), while | ||
| `gFSOperations`, `gMetadataBackend`, `gInodeIdGenerator`, and | ||
| `gChunkIdGenerator` are global `std::unique_ptr`s. They are initialized | ||
| during startup and then used process-wide. | ||
| - **Conditional compilation** -- `METARESTORE` and `METALOGGER` preprocessor | ||
| guards exclude master-only or tool-only code paths, allowing the same source | ||
| files to be linked into different binaries. | ||
lgsilva3087 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Incremental stat propagation** -- directory `StatsRecord` values are | ||
| maintained incrementally on every mutation and propagated up to root, | ||
| avoiding expensive tree traversals for `dirinfo` queries. | ||
| - **Cooperative multitasking** -- the master runs a single-threaded event loop. | ||
| Long operations (snapshots, recursive removes, checksum recalculation) are | ||
| split into small batches via `TaskManager` and `eventloop_make_next_poll_nonblocking()` | ||
| to maintain responsiveness. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.