Skip to content

Commit f8975ad

Browse files
committed
docs(master): add architectural reference README
The master server is one of the largest components in the codebase. A high-level architectural reference helps new contributors to understand how the major subsystems relate to each other before diving into the code. Signed-off-by: guillex <guillex@leil.io>
1 parent 2d2713b commit f8975ad

File tree

1 file changed

+294
-0
lines changed

1 file changed

+294
-0
lines changed

src/master/README.md

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# Master Server (`sfsmaster`) -- Architectural Reference
2+
3+
The master server is the central metadata authority in a SaunaFS cluster. It
4+
maintains the entire filesystem namespace (inodes, directories, edges, xattrs,
5+
ACLs, quotas, locks) in memory and coordinates chunk placement across
6+
chunkservers. The binary produced from this directory is **`sfsmaster`**.
7+
8+
A single codebase supports two runtime **personalities**:
9+
10+
- **Master** -- the active metadata server that accepts client mutations.
11+
- **Shadow** -- a hot-standby that replays changelogs from the master and can
12+
be promoted at any time.
13+
14+
The personality is selected at startup via configuration and can only transition
15+
from Shadow to Master (never the reverse). Conditional compilation guards
16+
(`METARESTORE`, `METALOGGER`) also allow parts of this code to be linked into
17+
the `sfsmetarestore` and `sfsmetalogger` binaries.
18+
19+
## Code Organization
20+
21+
The directory is flat (no subdirectories). Files are grouped by **filename
22+
prefix**, and each group corresponds to a logical subsystem:
23+
24+
| Prefix / group | Subsystem |
25+
|-----------------------|--------------------------------------------------|
26+
| `filesystem_*` | Core metadata model, operations, and maintenance |
27+
| `matoclserv*` | Client (FUSE mount) protocol server |
28+
| `matocsserv*` | Chunkserver protocol server |
29+
| `matomlserv*` | Metalogger / shadow protocol server |
30+
| `matontserv*` | Notifier (inotify-like) protocol server |
31+
| `masterconn*` | Shadow-to-master replication connection |
32+
| `metadata_backend_*` | Metadata persistence (load / save / dump) |
33+
| `metadata_loader*` | Section-based async metadata loading |
34+
| `metadata_dumper_*` | Periodic metadata dumping |
35+
| `chunks*` | Chunk lifecycle and replication tracking |
36+
| `changelog*` | Write-ahead log (WAL) |
37+
| `restore*` | Changelog replay (apply entries to metadata) |
38+
| `hstring*` | Efficient string (filename) storage |
39+
| `kv_*` | KV store (FoundationDB) integration |
40+
| `task_manager*` | Async task execution framework |
41+
| `locks*` | POSIX and flock advisory file locking |
42+
| `acl_storage*` | Deduplicated RichACL storage |
43+
| `exports*` | Client mount permissions and IP-based rules |
44+
| `topology*` | Network topology for data-locality placement |
45+
| `personality*` | Master / Shadow personality management |
46+
| `chartsdata*` | Monitoring / charting data collection |
47+
48+
A key organizational convention is the **interface / implementation split**:
49+
public interfaces live in `*_interface.h` files as pure virtual classes (e.g.
50+
`IFilesystemOperations`, `IFilesystemNodeOperations`, `IMetadataBackend`,
51+
`IKVConnector`), while default in-memory implementations use a `*Base` suffix
52+
or reside in the corresponding `.cc` file.
53+
54+
## Core Data Model
55+
56+
Core namespace metadata lives in a **`FilesystemMetadata`** instance addressed
57+
through the global pointer `gMetadata` (declared in
58+
`filesystem_metadata.h`). Chunk metadata is managed separately in `chunks.cc`
59+
(`gChunksMetadata`). Key `gMetadata` members are:
60+
61+
- **`nodeHash`** -- a fixed-size hash table (4M buckets) mapping inode IDs to
62+
`FSNode*` pointers. This is the primary index for all filesystem objects.
63+
- **`root`** -- pointer to the root `FSNodeDirectory`.
64+
- **`trash` / `reserved`** -- containers for deleted files (awaiting trashtime
65+
expiry) and files held open by clients after unlinking.
66+
- **`inodePool`** (`IdPoolDetainer`) -- inode ID allocation with a reuse delay
67+
(currently derived from compile-time constant `SFS_INODE_REUSE_DELAY`) to
68+
prevent stale file handles (for example, NFS handles) from hitting recycled
69+
inodes.
70+
- **`aclStorage`** -- reference-counted, deduplicated RichACL store.
71+
- **`xattrInodeHash` / `xattrDataHash`** -- extended attribute storage.
72+
- **`quotaDatabase`** -- per-user/group/directory quota tracking.
73+
- **`flockLocks` / `posixLocks`** -- advisory file lock state.
74+
- **`taskManager`** -- the async task execution engine (see below).
75+
- **`metadataVersion`** -- monotonically increasing version counter, bumped on
76+
every metadata mutation.
77+
- **Signals** -- `nodeChangedSignal`, `edgeChangedSignal`, `edgeRemovedSignal`
78+
emitted on metadata changes; these are extension points for observers such as
79+
KV connectors.
80+
81+
### Node Type Hierarchy
82+
83+
`FSNode` (64 bytes) is the base class for all filesystem objects. Concrete
84+
subtypes:
85+
86+
| Class | Type | Extra fields |
87+
|-------------------|---------------------|--------------------------------------|
88+
| `FSNodeFile` | file, trash, reserved | `length`, `chunks[]`, `sessionIds[]`|
89+
| `FSNodeDirectory` | directory | `entries` (SkipList), `stats`, `nlink`, `lowerCaseEntries` |
90+
| `FSNodeSymlink` | symlink | `path`, `path_length` |
91+
| `FSNodeDevice` | block/char device | `rdev` |
92+
93+
Every node stores a `parents` compact vector of `(inode_t, Handle*)` pairs,
94+
providing reverse links from child to parent(s) (files can have multiple
95+
parents via hard links).
96+
97+
Directory entries use a **SkipList** keyed by hashed name handles, with an
98+
optional parallel `lowerCaseEntries` SkipList for case-insensitive
99+
filesystems.
100+
101+
Each `FSNodeDirectory` maintains an aggregate `StatsRecord` (inodes, dirs,
102+
files, links, chunks, length, size, realsize) that is incrementally propagated
103+
up the tree on every mutation, enabling O(1) `dirinfo` queries at any level.
104+
105+
## Filesystem Operations
106+
107+
Operations are organized in a two-layer interface hierarchy:
108+
109+
1. **`IFilesystemOperations`** (`filesystem_operations_interface.h`) --
110+
high-level POSIX-like API: `lookup`, `mknod`, `mkdir`, `unlink`, `rmdir`,
111+
`rename`, `link`, `symlink`, `setAttr`, `readdir`, `writeChunk`,
112+
`readChunk`, quota management, lock operations, etc. The global instance
113+
is `gFSOperations`.
114+
115+
2. **`IFilesystemNodeOperations`** (`filesystem_node_operations_interface.h`)
116+
-- lower-level node CRUD: `createNode`, `link`, `unlink`, `removeEdge`,
117+
`purge`, `getPath`, stats propagation, ACL inheritance, etc.
118+
119+
`IFilesystemOperations` delegates to `IFilesystemNodeOperations` via its
120+
`nodeOperations()` accessor. Both have in-memory default implementations
121+
(`FilesystemOperationsBase`, `FilesystemNodeOperationsBase`) that operate
122+
directly on `gMetadata`.
123+
124+
The **`FsContext`** (`fs_context.h`) carries per-operation state: personality,
125+
session data, uid/gid, timestamps. The **`FilesystemOperationContext`**
126+
(`filesystem_operation_context.h`) carries optional transaction handles; in the
127+
current in-memory implementation these are unused (see the KV Store section
128+
below for context).
129+
130+
## Chunk Management
131+
132+
Chunk state is managed in `chunks.h` / `chunks.cc`. Key responsibilities:
133+
134+
- **Chunk lifecycle** -- creation, version bumps, deletion, truncation.
135+
- **Location tracking** -- which chunkserver holds which chunk and version,
136+
reported via `chunk_server_has_chunk()`.
137+
- **Replication decisions** -- tracking under-goal / over-goal chunks and
138+
issuing recover/remove operations.
139+
- **Operation callbacks** -- `chunk_got_create_status`,
140+
`chunk_got_replicate_status`, etc., process async results from chunkservers.
141+
- **ID generation** -- `gChunkIdGenerator` (an `IIdGeneratorWithState`).
142+
- **Server selection** -- `get_servers_for_new_chunk.*` implements the
143+
algorithm for picking chunkservers considering labels, weights, disk usage,
144+
and load balancing.
145+
146+
Files reference chunks through `FSNodeFile::chunks`, a vector of chunk IDs
147+
indexed by chunk index (offset / chunk_size).
148+
149+
The `gChunkChangedSignal` notifies observers when chunk metadata changes.
150+
151+
## Network Servers
152+
153+
The master communicates with five types of peers. The naming convention
154+
`mato*serv` means "**ma**ster **to** *X* **serv**ice":
155+
156+
| Module | Peer | Role |
157+
|----------------|-----------------|------|
158+
| `matoclserv` | FUSE clients | Handles all client filesystem requests (the largest module). Manages delayed chunk operations and session state. |
159+
| `matocsserv` | Chunkservers | Sends chunk create/delete/replicate/truncate/set-version commands. Tracks disk usage, labels, and connection state. |
160+
| `matomlserv` | Metaloggers & shadows | Broadcasts changelog entries and metadata snapshots for replication. |
161+
| `matontserv` | Notifier clients | Broadcasts changelog events for inotify-like functionality. |
162+
| `masterconn` | Active master | Used only when running in **Shadow** personality. Receives changelogs and metadata from the active master. |
163+
164+
## Metadata Persistence
165+
166+
Metadata durability is achieved through two complementary mechanisms:
167+
168+
### Full Metadata Snapshots
169+
170+
- **`IMetadataBackend`** (`metadata_backend_interface.h`) -- interface for
171+
loading and saving complete metadata. Methods: `init()`, `loadall()`,
172+
`store_fd()`, `commit_metadata_dump()`, `emergency_saves()`.
173+
- **`MetadataBackendFile`** (`metadata_backend_file.*`) -- the file-based
174+
implementation. Stores metadata in `metadata.sfs` with section-based format.
175+
Rotates previous copies on save.
176+
- **`MetadataLoader`** (`metadata_loader.h`) -- reads metadata sections from
177+
memory-mapped files, supporting async section loading.
178+
- **`IMetadataDumper` / `MetadataDumperFile`** -- handles periodic metadata
179+
dumps (foreground or background). The dumper can run in a forked child
180+
process to avoid blocking the event loop.
181+
182+
### Changelog (Write-Ahead Log)
183+
184+
- **`changelog.*`** -- incremental WAL. Each mutation is recorded as a text
185+
entry in the format `<version>: <ts>|<COMMAND>(arg1,arg2,...)`.
186+
- Changelogs are rotated (configurable `BACK_LOGS`), flushed after each write
187+
(configurable), and broadcast to metaloggers/shadows via `matomlserv`.
188+
- **`restore.*`** -- replays changelog entries to reconstruct metadata state.
189+
Used by shadow masters during synchronization and by `sfsmetarestore` for
190+
offline recovery.
191+
192+
## KV Store Backend (FoundationDB)
193+
194+
An alternative to file-based metadata storage, designed for distributed
195+
metadata:
196+
197+
- **`IKVConnector`** (`kv_connector_interface.h`) -- interface for KV store
198+
operations and event handlers for metadata changes (`onNodeChanged`,
199+
`onEdgeChanged`, `onEdgeRemoved`, `onDetainedAdded`, etc.).
200+
- **`kv_connector_fdb.*`** -- FoundationDB concrete implementation
201+
(conditionally compiled). This integration is currently not wired into the
202+
default master initialization path in `init.h`.
203+
- **`kv_common_keys.h`** -- defines key prefix conventions for all metadata
204+
sections in the KV store: `NODE_`, `EDGE_`, `FREE_`, `CHNK_`, `XATR_`,
205+
`ACLS_`, `QUOT_`, `FLCK_`, plus reverse indexes (`DIR_PARENT_`, `PARENT_`,
206+
`DIR_NODES_COUNT_`, `DIR_STATS_`).
207+
- `FilesystemOperationContext` carries optional transaction handles. In the
208+
current in-memory master implementation these are empty, and they serve as an
209+
extension point for transactional backends.
210+
211+
## Task Manager and Async Tasks
212+
213+
Long-running or recursive operations are decomposed into small incremental
214+
tasks to avoid blocking the single-threaded event loop:
215+
216+
- **`TaskManager`** (`task_manager.h`) -- generic job/task execution system.
217+
A **Job** is a named unit of work containing an ordered list of **Tasks**.
218+
The manager round-robins across jobs, executing a bounded batch of tasks per
219+
event loop iteration.
220+
- **`SnapshotTask`** (`snapshot_task.h`) -- splits filesystem snapshot
221+
(clone) operations into per-inode tasks.
222+
- **`RecursiveRemoveTask`** (`recursive_remove_task.h`) -- recursive directory
223+
deletion.
224+
- **`SetGoalTask`** / **`SetTrashtimeTask`** -- recursive goal or trashtime
225+
changes across a subtree.
226+
- **`DeferredMetadataDumpTask`** (`deferred_metadata_dump_task.h`) --
227+
post-failover metadata dump.
228+
229+
Jobs support cancellation and completion callbacks.
230+
231+
## Periodic Operations
232+
233+
Several maintenance routines run on timer-driven or per-loop schedules,
234+
registered in `fs_periodic_master_init()` (see `filesystem_periodic.cc`):
235+
236+
| Operation | Schedule | Purpose |
237+
|-----------|----------|---------|
238+
| **File integrity test** (`fs_periodic_file_test` / `fs_background_file_test`) | Every second (timer) + every loop (background) | Scans the entire node hash table in a configurable cycle time (`FILE_TEST_LOOP_MIN_TIME`, default 3600s). For each file node, checks chunk availability and copy counts. For each directory, validates parent-child pointer consistency. Builds a `gDefectiveNodes` map of inodes with unavailable chunks, under-goal chunks, or structural errors. |
239+
| **Background task processing** (`fs_background_task_manager_work`) | Every loop | Drives the `TaskManager`, processing a batch of tasks (snapshots, recursive removes, goal/trashtime changes) per iteration. |
240+
| **Checksum recalculation** (`fs_background_checksum_recalculation_a_bit`) | Every loop | Incrementally recalculates metadata checksums (nodes, xattrs, chunks) in the background, progressing through steps at a speed limit per iteration. |
241+
| **Trash cleanup** (`fs_periodic_emptytrash`) | Every 100ms | Purges expired trash entries whose deletion timestamp has passed. |
242+
| **Reserved file cleanup** (`fs_periodic_emptyreserved`) | Configurable period | Releases reserved files (deleted-but-still-open files) whose sessions are no longer active. |
243+
| **Chunk maintenance** (in `chunks.cc`) | Periodic | Handles chunk replication, deletion of excess copies, and rebalancing across chunkservers. |
244+
245+
## Initialization Sequence
246+
247+
Startup is orchestrated by ordered `RunTab` arrays in `init.h`. The sequence
248+
is dependency-ordered -- comments in the source mark critical orderings:
249+
250+
```
251+
1. prometheus_init -- Optional Prometheus metrics endpoint
252+
2. hstorage_init -- String storage backend (must be first)
253+
3. personality_init -- Set master/shadow personality (must be second)
254+
4. rnd_init -- Random number generator
255+
5. dcm_init -- Data cache manager (before fs_init and matoclserv)
256+
6. matoclserv_sessions_init -- Load persisted sessions (before fs_init)
257+
7. exports_init -- Client mount/export permission rules
258+
8. topology_init -- Network topology configuration
259+
9. metadata_backend_init -- Initialize MetadataBackendFile + inode ID generator
260+
10. fs_init -- Core filesystem: load metadata, register periodic ops
261+
11. chartsdata_init -- Monitoring charts data collection
262+
12. masterconn_init -- Shadow's connection to active master
263+
13. matomlserv_init -- Metalogger/shadow communication
264+
14. matocsserv_init -- Chunkserver communication
265+
15. matontserv_init -- Notifier communication
266+
16. matoclserv_network_init -- Client network init (last -- opens for business)
267+
```
268+
269+
Client connections are accepted only after all other subsystems are ready.
270+
271+
## Key Design Patterns
272+
273+
- **Strategy / interface-based extensibility** -- all major subsystems are
274+
behind pure virtual interfaces (`IFilesystemOperations`,
275+
`IFilesystemNodeOperations`, `IMetadataBackend`, `IKVConnector`,
276+
`hstorage::Storage`), allowing alternative implementations to be plugged in.
277+
- **Observer / signal pattern** -- `Signal<>` objects on `FilesystemMetadata`
278+
and in the chunk subsystem notify listeners about metadata changes without
279+
coupling producers to consumers.
280+
- **Global process state** -- core state is exposed through global variables:
281+
`gMetadata` is a raw pointer (`FilesystemMetadata *`), while
282+
`gFSOperations`, `gMetadataBackend`, `gInodeIdGenerator`, and
283+
`gChunkIdGenerator` are global `std::unique_ptr`s. They are initialized
284+
during startup and then used process-wide.
285+
- **Conditional compilation** -- `METARESTORE` and `METALOGGER` preprocessor
286+
guards exclude master-only or tool-only code paths, allowing the same source
287+
files to be linked into different binaries.
288+
- **Incremental stat propagation** -- directory `StatsRecord` values are
289+
maintained incrementally on every mutation and propagated up to root,
290+
avoiding expensive tree traversals for `dirinfo` queries.
291+
- **Cooperative multitasking** -- the master runs a single-threaded event loop.
292+
Long operations (snapshots, recursive removes, checksum recalculation) are
293+
split into small batches via `TaskManager` and `eventloop_make_next_poll_nonblocking()`
294+
to maintain responsiveness.

0 commit comments

Comments
 (0)