Our implementation will focus around several key components defined below.
-
StorageNode - This is the top level entity and it represents the entry point to a fully feature storage node that clients can connect to in order to retrieve file data and file metadata. It acts as a dependency injection entity by instantiating and wiring together all other components like StorageRaftMember, StorageNetwork, FileStore, MetadataStore, StorageWatchdog, SnapshotStore, TransactionLogStore, FileSystemService, StorageEndpoint, and MetricService.
-
StorageRaftMember- This is a component that houses all of our raft consensus logic for electing leaders and admitting changes to file data and metadata. Only 1 RaftMember can be a leader in any given moment with all other RaftNodes acting as followers. All metadata write transactions are proposed by the leader and once a quorum of followers has voted to approve the transaction(s) they are committed. The leader node is also responsible to triggering Metadata snapshots on all StorageNodes via special signaling proposals that require all StorageNodes to produce a transactionally consistent snapshot which then allows them to trim the transaction log. RaftMember makes use of PeerNetwork to interact with other RaftMembers. It handles requests that are routed to it via StorageEndpoint - or other RaftMembers. It delegates some operations to MetadataStore, FileStore, TransactionLogStore, and SnapshotStore.
-
StorageNetwork - This component provides peer-to-peer connectivity using libp2p for topic-based messaging (gossipsub), direct RPC, and peer management. It uses a Factory+Inner+Clone pattern where StorageNetworkFactory creates instances, StorageNetworkInner contains the actual swarm and state (wrapped in Arc), and StorageNetwork is a lightweight cloneable handle with a command channel. This architecture allows multiple components to hold StorageNetwork instances without ownership conflicts (critical for OpenRaft compatibility). Components join topics to get sender/receiver channels for pub/sub communication. The network event loop runs independently, processing both libp2p events and command channel messages. Peer validation supports two modes: explicit peer IDs (reject mismatches) or auto-ID mode (learn and store peer IDs on first connection). Learned peer IDs are persisted to disk to ensure consistency across restarts.
-
FileStore - This component is responsible for applying erasure encoding to file data and then assigning the resulting Chunks to disks across different StorageNodes in the StorageNetwork so as to achieve the required level of durability and availability per the file's storage configuration. FileStore must alse handle Chunk "read" and verification requests routed to it via StorageEndpoint. FileStore will only receive mutating operations from RaftMember.
-
MetadataStore - This component is responsible for housing all File, Stripe, Chunk, etc meta-data. Only StorageRaftMember may send it mutating operations but it must also handle read-only operations routed to it via StorageEndpoint.
-
SnapshotStore - This component is responsible for ingesting (storing) metadata snapshots that StorageRaftMember triggered against the MetadataStore. After StorageRaftMember tells MetadataStore to generate a transactionally consistent backup of the metadata store, StorageRaftMember will then make a mutating call to SnapshotStore to ingest that snapshot file which mostly amounts to updating its internal state of which snapshots are available as well as their transaction log details.StorageRaftMember may also trigger a "prune" operation where by SnapshotStore will be expected to delete, from disk, any snapshots that are too old and thus no longer needed. SnapshotStore must also handle read-only requests routed to it via StorageEndpoint from other StorageRaftMembers that need to resync to the current raft consensus but have fallen too far behind to catch up using only Transaction Log replays.
-
TransactionLogStore - This component is responsible for durably storing the transaction log. It only receive write requests from StorageRaftMember in the form of append or trim operations. It must also handle read-only operations routed to it via StorageEndpoint from other StorageRaftMembers that require a replay of the Transaction Log beginning at some tx until "now".
-
StorageEndpoint - This component offers a gRPC endpoint that facilities three responsibilities of this component. Firstly, some of the gRPC APIs are used by other StorageNodes to request chunk data. Second, some of the gRPC APIs are used to request Metadata Snapshots. Lastly, the remaining gRPC APIs are used by client FUSE Filesystems to perform read and write operations against our distributed FileSystem, translating into metadata (routed to RaftNode) and data (routed to DataNode) interactions. The Fuse gRPC APIs are routed to the FileSystemService component which handles both read and write operations via its connections to RaftStorageMember, MetadataStore, and FileStore.
-
StorageWatchdog - This component continuously monitors the availability and durability of all Files stored in the system by walking the Metastore and validating all Files, Stripes, and Chunks are present and valid on the StorageNodes and Disks they have been assigned to. The StorageWatch dog runs two kinds of checks: shallow check which is fast but only checks that chunks are present without validating their contents or ability to reassemble the Stripe they belong to. The second type of check is deeper, and thus more expensive, but involves using chunk data to reassemble, read, and validate entire Stripes. StorageWatchdog needs read-only access to StorageEndpoint via special process local APIs that avoid the need to make actual gRPC network calls. If, and when, StorageWatchdog finds an issue either with the local StorageNode or a remote StorageNode it submits a StorageConsistencyEvent to StorageRaftMember so it can decide how to handle recovering from the event (e.g. rebuild the Stripe, migrate Chunks, etc..)
-
MetricService - This component is used by all other components to publish metrics about their performance. For example, every time StorageRaftMember proposes a transaction it might publish a "TransactionProposal" event which the MetricService will aggregate into a "rate" metric (e.g. n per-second, and a total-since-start-up). The types and units of metrics supported by MetricService are defined by the MetricType and UnitType enums respectively. The publish_metric(...) method on MetricService has several variants but they all take 3 parameters publish_metric(value: u64/f64, type: MetricType, unit: MetricUnit)
-
FileSystemService - This component exposes FUSE compatible APIs that are required by clients running our Fuse Filesystem Client. FileSystemService interacts with StorageRaftMember to execute metadata write operations, MetadataStore directly for read operations, and FileStore for actual Chunk data read operations.
-
WormValidator - This is a standalone binary that boots up an embedded single node storage cluster and exercises its capabilities as if it were a wormfs fuse client. We plan to use this heavily for integration testing both during initial development and after. It focuses on simplicity, making it easy to reproducible test features of the system without having to frequently reinstall multiple nodes in test VMs. After each implementation task, the validator should be updated to cover new/additional features. This is in addition to traditional rust unit and integration tests.
-
BufferedFileHandle - This is a helper component that is instantiated per-FileHandle when a file is opened for writing. It buffers writes to data and meta-data in order to (a) reduce IO amplification via write coalescing (b) read your own write consistency.