|
| 1 | +# Master Server (`sfsmaster`) -- Architectural Reference |
| 2 | + |
| 3 | +The master server is the central metadata authority in a SaunaFS cluster. It |
| 4 | +maintains the entire filesystem namespace (inodes, directories, edges, xattrs, |
| 5 | +ACLs, quotas, locks) in memory and coordinates chunk placement across |
| 6 | +chunkservers. The binary produced from this directory is **`sfsmaster`**. |
| 7 | + |
| 8 | +A single codebase supports two runtime **personalities**: |
| 9 | + |
| 10 | +- **Master** -- the active metadata server that accepts client mutations. |
| 11 | +- **Shadow** -- a hot-standby that replays changelogs from the master and can |
| 12 | + be promoted at any time. |
| 13 | + |
| 14 | +The personality is selected at startup via configuration and can only transition |
| 15 | +from Shadow to Master (never the reverse). Conditional compilation guards |
| 16 | +(`METARESTORE`, `METALOGGER`) also allow parts of this code to be linked into |
| 17 | +the `sfsmetarestore` and `sfsmetalogger` binaries. |
| 18 | + |
| 19 | +## Code Organization |
| 20 | + |
| 21 | +The directory is flat (no subdirectories). Files are grouped by **filename |
| 22 | +prefix**, and each group corresponds to a logical subsystem: |
| 23 | + |
| 24 | +| Prefix / group | Subsystem | |
| 25 | +|-----------------------|--------------------------------------------------| |
| 26 | +| `filesystem_*` | Core metadata model, operations, and maintenance | |
| 27 | +| `matoclserv*` | Client (FUSE mount) protocol server | |
| 28 | +| `matocsserv*` | Chunkserver protocol server | |
| 29 | +| `matomlserv*` | Metalogger / shadow protocol server | |
| 30 | +| `matontserv*` | Notifier (inotify-like) protocol server | |
| 31 | +| `masterconn*` | Shadow-to-master replication connection | |
| 32 | +| `metadata_backend_*` | Metadata persistence (load / save / dump) | |
| 33 | +| `metadata_loader*` | Section-based async metadata loading | |
| 34 | +| `metadata_dumper_*` | Periodic metadata dumping | |
| 35 | +| `chunks*` | Chunk lifecycle and replication tracking | |
| 36 | +| `changelog*` | Write-ahead log (WAL) | |
| 37 | +| `restore*` | Changelog replay (apply entries to metadata) | |
| 38 | +| `hstring*` | Efficient string (filename) storage | |
| 39 | +| `kv_*` | KV store (FoundationDB) integration | |
| 40 | +| `task_manager*` | Async task execution framework | |
| 41 | +| `locks*` | POSIX and flock advisory file locking | |
| 42 | +| `acl_storage*` | Deduplicated RichACL storage | |
| 43 | +| `exports*` | Client mount permissions and IP-based rules | |
| 44 | +| `topology*` | Network topology for data-locality placement | |
| 45 | +| `personality*` | Master / Shadow personality management | |
| 46 | +| `chartsdata*` | Monitoring / charting data collection | |
| 47 | + |
| 48 | +A key organizational convention is the **interface / implementation split**: |
| 49 | +public interfaces live in `*_interface.h` files as pure virtual classes (e.g. |
| 50 | +`IFilesystemOperations`, `IFilesystemNodeOperations`, `IMetadataBackend`, |
| 51 | +`IKVConnector`), while default in-memory implementations use a `*Base` suffix |
| 52 | +or reside in the corresponding `.cc` file. |
| 53 | + |
| 54 | +## Core Data Model |
| 55 | + |
| 56 | +Core namespace metadata lives in a **`FilesystemMetadata`** instance addressed |
| 57 | +through the global pointer `gMetadata` (declared in |
| 58 | +`filesystem_metadata.h`). Chunk metadata is managed separately in `chunks.cc` |
| 59 | +(`gChunksMetadata`). Key `gMetadata` members are: |
| 60 | + |
| 61 | +- **`nodeHash`** -- a fixed-size hash table (4M buckets) mapping inode IDs to |
| 62 | + `FSNode*` pointers. This is the primary index for all filesystem objects. |
| 63 | +- **`root`** -- pointer to the root `FSNodeDirectory`. |
| 64 | +- **`trash` / `reserved`** -- containers for deleted files (awaiting trashtime |
| 65 | + expiry) and files held open by clients after unlinking. |
| 66 | +- **`inodePool`** (`IdPoolDetainer`) -- inode ID allocation with a reuse delay |
| 67 | + (currently derived from compile-time constant `SFS_INODE_REUSE_DELAY`) to |
| 68 | + prevent stale file handles (for example, NFS handles) from hitting recycled |
| 69 | + inodes. |
| 70 | +- **`aclStorage`** -- reference-counted, deduplicated RichACL store. |
| 71 | +- **`xattrInodeHash` / `xattrDataHash`** -- extended attribute storage. |
| 72 | +- **`quotaDatabase`** -- per-user/group/directory quota tracking. |
| 73 | +- **`flockLocks` / `posixLocks`** -- advisory file lock state. |
| 74 | +- **`taskManager`** -- the async task execution engine (see below). |
| 75 | +- **`metadataVersion`** -- monotonically increasing version counter, bumped on |
| 76 | + every metadata mutation. |
| 77 | +- **Signals** -- `nodeChangedSignal`, `edgeChangedSignal`, `edgeRemovedSignal` |
| 78 | + emitted on metadata changes; these are extension points for observers such as |
| 79 | + KV connectors. |
| 80 | + |
| 81 | +### Node Type Hierarchy |
| 82 | + |
| 83 | +`FSNode` (64 bytes) is the base class for all filesystem objects. Concrete |
| 84 | +subtypes: |
| 85 | + |
| 86 | +| Class | Type | Extra fields | |
| 87 | +|-------------------|---------------------|--------------------------------------| |
| 88 | +| `FSNodeFile` | file, trash, reserved | `length`, `chunks[]`, `sessionIds[]`| |
| 89 | +| `FSNodeDirectory` | directory | `entries` (SkipList), `stats`, `nlink`, `lowerCaseEntries` | |
| 90 | +| `FSNodeSymlink` | symlink | `path`, `path_length` | |
| 91 | +| `FSNodeDevice` | block/char device | `rdev` | |
| 92 | + |
| 93 | +Every node stores a `parents` compact vector of `(inode_t, Handle*)` pairs, |
| 94 | +providing reverse links from child to parent(s) (files can have multiple |
| 95 | +parents via hard links). |
| 96 | + |
| 97 | +Directory entries use a **SkipList** keyed by hashed name handles, with an |
| 98 | +optional parallel `lowerCaseEntries` SkipList for case-insensitive |
| 99 | +filesystems. |
| 100 | + |
| 101 | +Each `FSNodeDirectory` maintains an aggregate `StatsRecord` (inodes, dirs, |
| 102 | +files, links, chunks, length, size, realsize) that is incrementally propagated |
| 103 | +up the tree on every mutation, enabling O(1) `dirinfo` queries at any level. |
| 104 | + |
| 105 | +## Filesystem Operations |
| 106 | + |
| 107 | +Operations are organized in a two-layer interface hierarchy: |
| 108 | + |
| 109 | +1. **`IFilesystemOperations`** (`filesystem_operations_interface.h`) -- |
| 110 | + high-level POSIX-like API: `lookup`, `mknod`, `mkdir`, `unlink`, `rmdir`, |
| 111 | + `rename`, `link`, `symlink`, `setAttr`, `readdir`, `writeChunk`, |
| 112 | + `readChunk`, quota management, lock operations, etc. The global instance |
| 113 | + is `gFSOperations`. |
| 114 | + |
| 115 | +2. **`IFilesystemNodeOperations`** (`filesystem_node_operations_interface.h`) |
| 116 | + -- lower-level node CRUD: `createNode`, `link`, `unlink`, `removeEdge`, |
| 117 | + `purge`, `getPath`, stats propagation, ACL inheritance, etc. |
| 118 | + |
| 119 | +`IFilesystemOperations` delegates to `IFilesystemNodeOperations` via its |
| 120 | +`nodeOperations()` accessor. Both have in-memory default implementations |
| 121 | +(`FilesystemOperationsBase`, `FilesystemNodeOperationsBase`) that operate |
| 122 | +directly on `gMetadata`. |
| 123 | + |
| 124 | +The **`FsContext`** (`fs_context.h`) carries per-operation state: personality, |
| 125 | +session data, uid/gid, timestamps. The **`FilesystemOperationContext`** |
| 126 | +(`filesystem_operation_context.h`) carries optional transaction handles; in the |
| 127 | +current in-memory implementation these are unused (see the KV Store section |
| 128 | +below for context). |
| 129 | + |
| 130 | +## Chunk Management |
| 131 | + |
| 132 | +Chunk state is managed in `chunks.h` / `chunks.cc`. Key responsibilities: |
| 133 | + |
| 134 | +- **Chunk lifecycle** -- creation, version bumps, deletion, truncation. |
| 135 | +- **Location tracking** -- which chunkserver holds which chunk and version, |
| 136 | + reported via `chunk_server_has_chunk()`. |
| 137 | +- **Replication decisions** -- tracking under-goal / over-goal chunks and |
| 138 | + issuing recover/remove operations. |
| 139 | +- **Operation callbacks** -- `chunk_got_create_status`, |
| 140 | + `chunk_got_replicate_status`, etc., process async results from chunkservers. |
| 141 | +- **ID generation** -- `gChunkIdGenerator` (an `IIdGeneratorWithState`). |
| 142 | +- **Server selection** -- `get_servers_for_new_chunk.*` implements the |
| 143 | + algorithm for picking chunkservers considering labels, weights, disk usage, |
| 144 | + and load balancing. |
| 145 | + |
| 146 | +Files reference chunks through `FSNodeFile::chunks`, a vector of chunk IDs |
| 147 | +indexed by chunk index (offset / chunk_size). |
| 148 | + |
| 149 | +The `gChunkChangedSignal` notifies observers when chunk metadata changes. |
| 150 | + |
| 151 | +## Network Servers |
| 152 | + |
| 153 | +The master communicates with five types of peers. The naming convention |
| 154 | +`mato*serv` means "**ma**ster **to** *X* **serv**ice": |
| 155 | + |
| 156 | +| Module | Peer | Role | |
| 157 | +|----------------|-----------------|------| |
| 158 | +| `matoclserv` | FUSE clients | Handles all client filesystem requests (the largest module). Manages delayed chunk operations and session state. | |
| 159 | +| `matocsserv` | Chunkservers | Sends chunk create/delete/replicate/truncate/set-version commands. Tracks disk usage, labels, and connection state. | |
| 160 | +| `matomlserv` | Metaloggers & shadows | Broadcasts changelog entries and metadata snapshots for replication. | |
| 161 | +| `matontserv` | Notifier clients | Broadcasts changelog events for inotify-like functionality. | |
| 162 | +| `masterconn` | Active master | Used only when running in **Shadow** personality. Receives changelogs and metadata from the active master. | |
| 163 | + |
| 164 | +## Metadata Persistence |
| 165 | + |
| 166 | +Metadata durability is achieved through two complementary mechanisms: |
| 167 | + |
| 168 | +### Full Metadata Snapshots |
| 169 | + |
| 170 | +- **`IMetadataBackend`** (`metadata_backend_interface.h`) -- interface for |
| 171 | + loading and saving complete metadata. Methods: `init()`, `loadall()`, |
| 172 | + `store_fd()`, `commit_metadata_dump()`, `emergency_saves()`. |
| 173 | +- **`MetadataBackendFile`** (`metadata_backend_file.*`) -- the file-based |
| 174 | + implementation. Stores metadata in `metadata.sfs` with section-based format. |
| 175 | + Rotates previous copies on save. |
| 176 | +- **`MetadataLoader`** (`metadata_loader.h`) -- reads metadata sections from |
| 177 | + memory-mapped files, supporting async section loading. |
| 178 | +- **`IMetadataDumper` / `MetadataDumperFile`** -- handles periodic metadata |
| 179 | + dumps (foreground or background). The dumper can run in a forked child |
| 180 | + process to avoid blocking the event loop. |
| 181 | + |
| 182 | +### Changelog (Write-Ahead Log) |
| 183 | + |
| 184 | +- **`changelog.*`** -- incremental WAL. Each mutation is recorded as a text |
| 185 | + entry in the format `<version>: <ts>|<COMMAND>(arg1,arg2,...)`. |
| 186 | +- Changelogs are rotated (configurable `BACK_LOGS`), flushed after each write |
| 187 | + (configurable), and broadcast to metaloggers/shadows via `matomlserv`. |
| 188 | +- **`restore.*`** -- replays changelog entries to reconstruct metadata state. |
| 189 | + Used by shadow masters during synchronization and by `sfsmetarestore` for |
| 190 | + offline recovery. |
| 191 | + |
| 192 | +## KV Store Backend (FoundationDB) |
| 193 | + |
| 194 | +An alternative to file-based metadata storage, designed for distributed |
| 195 | +metadata: |
| 196 | + |
| 197 | +- **`IKVConnector`** (`kv_connector_interface.h`) -- interface for KV store |
| 198 | + operations and event handlers for metadata changes (`onNodeChanged`, |
| 199 | + `onEdgeChanged`, `onEdgeRemoved`, `onDetainedAdded`, etc.). |
| 200 | +- **`kv_connector_fdb.*`** -- FoundationDB concrete implementation |
| 201 | + (conditionally compiled). This integration is currently not wired into the |
| 202 | + default master initialization path in `init.h`. |
| 203 | +- **`kv_common_keys.h`** -- defines key prefix conventions for all metadata |
| 204 | + sections in the KV store: `NODE_`, `EDGE_`, `FREE_`, `CHNK_`, `XATR_`, |
| 205 | + `ACLS_`, `QUOT_`, `FLCK_`, plus reverse indexes (`DIR_PARENT_`, `PARENT_`, |
| 206 | + `DIR_NODES_COUNT_`, `DIR_STATS_`). |
| 207 | +- `FilesystemOperationContext` carries optional transaction handles. In the |
| 208 | + current in-memory master implementation these are empty, and they serve as an |
| 209 | + extension point for transactional backends. |
| 210 | + |
| 211 | +## Task Manager and Async Tasks |
| 212 | + |
| 213 | +Long-running or recursive operations are decomposed into small incremental |
| 214 | +tasks to avoid blocking the single-threaded event loop: |
| 215 | + |
| 216 | +- **`TaskManager`** (`task_manager.h`) -- generic job/task execution system. |
| 217 | + A **Job** is a named unit of work containing an ordered list of **Tasks**. |
| 218 | + The manager round-robins across jobs, executing a bounded batch of tasks per |
| 219 | + event loop iteration. |
| 220 | +- **`SnapshotTask`** (`snapshot_task.h`) -- splits filesystem snapshot |
| 221 | + (clone) operations into per-inode tasks. |
| 222 | +- **`RecursiveRemoveTask`** (`recursive_remove_task.h`) -- recursive directory |
| 223 | + deletion. |
| 224 | +- **`SetGoalTask`** / **`SetTrashtimeTask`** -- recursive goal or trashtime |
| 225 | + changes across a subtree. |
| 226 | +- **`DeferredMetadataDumpTask`** (`deferred_metadata_dump_task.h`) -- |
| 227 | + post-failover metadata dump. |
| 228 | + |
| 229 | +Jobs support cancellation and completion callbacks. |
| 230 | + |
| 231 | +## Periodic Operations |
| 232 | + |
| 233 | +Several maintenance routines run on timer-driven or per-loop schedules, |
| 234 | +registered in `fs_periodic_master_init()` (see `filesystem_periodic.cc`): |
| 235 | + |
| 236 | +| Operation | Schedule | Purpose | |
| 237 | +|-----------|----------|---------| |
| 238 | +| **File integrity test** (`fs_periodic_file_test` / `fs_background_file_test`) | Every second (timer) + every loop (background) | Scans the entire node hash table in a configurable cycle time (`FILE_TEST_LOOP_MIN_TIME`, default 3600s). For each file node, checks chunk availability and copy counts. For each directory, validates parent-child pointer consistency. Builds a `gDefectiveNodes` map of inodes with unavailable chunks, under-goal chunks, or structural errors. | |
| 239 | +| **Background task processing** (`fs_background_task_manager_work`) | Every loop | Drives the `TaskManager`, processing a batch of tasks (snapshots, recursive removes, goal/trashtime changes) per iteration. | |
| 240 | +| **Checksum recalculation** (`fs_background_checksum_recalculation_a_bit`) | Every loop | Incrementally recalculates metadata checksums (nodes, xattrs, chunks) in the background, progressing through steps at a speed limit per iteration. | |
| 241 | +| **Trash cleanup** (`fs_periodic_emptytrash`) | Every 100ms | Purges expired trash entries whose deletion timestamp has passed. | |
| 242 | +| **Reserved file cleanup** (`fs_periodic_emptyreserved`) | Configurable period | Releases reserved files (deleted-but-still-open files) whose sessions are no longer active. | |
| 243 | +| **Chunk maintenance** (in `chunks.cc`) | Periodic | Handles chunk replication, deletion of excess copies, and rebalancing across chunkservers. | |
| 244 | + |
| 245 | +## Initialization Sequence |
| 246 | + |
| 247 | +Startup is orchestrated by ordered `RunTab` arrays in `init.h`. The sequence |
| 248 | +is dependency-ordered -- comments in the source mark critical orderings: |
| 249 | + |
| 250 | +``` |
| 251 | +1. prometheus_init -- Optional Prometheus metrics endpoint |
| 252 | +2. hstorage_init -- String storage backend (must be first) |
| 253 | +3. personality_init -- Set master/shadow personality (must be second) |
| 254 | +4. rnd_init -- Random number generator |
| 255 | +5. dcm_init -- Data cache manager (before fs_init and matoclserv) |
| 256 | +6. matoclserv_sessions_init -- Load persisted sessions (before fs_init) |
| 257 | +7. exports_init -- Client mount/export permission rules |
| 258 | +8. topology_init -- Network topology configuration |
| 259 | +9. metadata_backend_init -- Initialize MetadataBackendFile + inode ID generator |
| 260 | +10. fs_init -- Core filesystem: load metadata, register periodic ops |
| 261 | +11. chartsdata_init -- Monitoring charts data collection |
| 262 | +12. masterconn_init -- Shadow's connection to active master |
| 263 | +13. matomlserv_init -- Metalogger/shadow communication |
| 264 | +14. matocsserv_init -- Chunkserver communication |
| 265 | +15. matontserv_init -- Notifier communication |
| 266 | +16. matoclserv_network_init -- Client network init (last -- opens for business) |
| 267 | +``` |
| 268 | + |
| 269 | +Client connections are accepted only after all other subsystems are ready. |
| 270 | + |
| 271 | +## Key Design Patterns |
| 272 | + |
| 273 | +- **Strategy / interface-based extensibility** -- all major subsystems are |
| 274 | + behind pure virtual interfaces (`IFilesystemOperations`, |
| 275 | + `IFilesystemNodeOperations`, `IMetadataBackend`, `IKVConnector`, |
| 276 | + `hstorage::Storage`), allowing alternative implementations to be plugged in. |
| 277 | +- **Observer / signal pattern** -- `Signal<>` objects on `FilesystemMetadata` |
| 278 | + and in the chunk subsystem notify listeners about metadata changes without |
| 279 | + coupling producers to consumers. |
| 280 | +- **Global process state** -- core state is exposed through global variables: |
| 281 | + `gMetadata` is a raw pointer (`FilesystemMetadata *`), while |
| 282 | + `gFSOperations`, `gMetadataBackend`, `gInodeIdGenerator`, and |
| 283 | + `gChunkIdGenerator` are global `std::unique_ptr`s. They are initialized |
| 284 | + during startup and then used process-wide. |
| 285 | +- **Conditional compilation** -- `METARESTORE` and `METALOGGER` preprocessor |
| 286 | + guards exclude master-only or tool-only code paths, allowing the same source |
| 287 | + files to be linked into different binaries. |
| 288 | +- **Incremental stat propagation** -- directory `StatsRecord` values are |
| 289 | + maintained incrementally on every mutation and propagated up to root, |
| 290 | + avoiding expensive tree traversals for `dirinfo` queries. |
| 291 | +- **Cooperative multitasking** -- the master runs a single-threaded event loop. |
| 292 | + Long operations (snapshots, recursive removes, checksum recalculation) are |
| 293 | + split into small batches via `TaskManager` and `eventloop_make_next_poll_nonblocking()` |
| 294 | + to maintain responsiveness. |
0 commit comments