Skip to content

Commit 4ad2c1c

Browse files
authored
adds some detials to storage docs (#103)
Signed-off-by: Ashraf Fouda <ashraf.m.fouda@gmail.com>
1 parent 21fab39 commit 4ad2c1c

File tree

2 files changed

+192
-22
lines changed

2 files changed

+192
-22
lines changed
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Storage Light Module
2+
3+
## ZBus
4+
5+
Storage light module is available on zbus over the same channel as the full storage module:
6+
7+
| module | object | version |
8+
|--------|--------|---------|
9+
| storage|[storage](#interface)| 0.0.1|
10+
11+
## Introduction
12+
13+
`storage_light` is a lightweight variant of the [storage module](../storage/readme.md). It implements the same `StorageModule` interface and provides identical functionality to consumers, but has enhanced device initialization logic designed for nodes with pre-partitioned disks.
14+
15+
Both modules are interchangeable at the zbus level — other modules access storage via the same `StorageModuleStub` regardless of which variant is running.
16+
17+
## Differences from Storage
18+
19+
The key difference is in the **device initialization** phase during boot. The standard storage module treats each whole disk as a single btrfs pool. The light variant adds:
20+
21+
### 1. Partition-Aware Initialization
22+
23+
Instead of requiring whole disks, `storage_light` can work with individual partitions:
24+
25+
- Detects if a disk is already partitioned (has child partitions)
26+
- Scans for unallocated space on partitioned disks using `parted`
27+
- Creates new partitions in free space (minimum 5 GiB) for btrfs pools
28+
- Refreshes device info after partition table changes
29+
30+
This allows ZOS to coexist with other operating systems or PXE boot partitions on the same disk.
31+
32+
### 2. PXE Partition Detection
33+
34+
Partitions labeled `ZOSPXE` are automatically skipped during initialization. This prevents the storage module from claiming boot partitions used for PXE network booting.
35+
36+
### 3. Enhanced Device Manager
37+
38+
The filesystem subpackage in `storage_light` extends the device manager with:
39+
40+
- `Children []DeviceInfo` field on `DeviceInfo` to track child partitions
41+
- `UUID` field for btrfs filesystem identification
42+
- `IsPartitioned()` method to check if a disk has child partitions
43+
- `IsPXEPartition()` method to detect PXE boot partitions
44+
- `GetUnallocatedSpaces()` method using `parted` to find free disk space
45+
- `AllocateEmptySpace()` method to create partitions in free space
46+
- `RefreshDeviceInfo()` method to reload device info after changes
47+
- `ClearCache()` on the device manager interface for refreshing the device list
48+
49+
## Initialization Flow
50+
51+
The boot process is similar to the standard storage module but with added partition handling:
52+
53+
1. Load kernel parameters (detect VM, check MissingSSD)
54+
2. Scan devices via DeviceManager
55+
3. For each device:
56+
- **If whole disk (not partitioned)**: Create btrfs pool on the entire device (same as standard)
57+
- **If partitioned**:
58+
- Skip partitions labeled `ZOSPXE`
59+
- Process existing partitions that have btrfs filesystems
60+
- Scan for unallocated space using `parted`
61+
- Create new partitions in free space >= 5 GiB
62+
- Create btrfs pools on new partitions
63+
- Mount pool, detect device type (SSD/HDD)
64+
- Add to SSD or HDD pool arrays
65+
4. Ensure cache exists (create if needed, start monitoring)
66+
5. Shut down unused HDD pools
67+
6. Start periodic disk power management
68+
69+
## When to Use Storage Light
70+
71+
Use `storage_light` instead of `storage` when:
72+
73+
- The node has disks with existing partition tables that must be preserved
74+
- PXE boot partitions exist on the same disks
75+
- The node dual-boots or shares disks with other systems
76+
- Disks have been partially allocated and have free space that should be used
77+
78+
## Architecture
79+
80+
The overall architecture (pool types, mount points, cache management, volume/disk/device operations) is identical to the [standard storage module](../storage/readme.md). Refer to that document for details on:
81+
82+
- Pool organization (SSD vs HDD)
83+
- Storage primitives (subvolumes, vdisks, devices)
84+
- Cache management and auto-sizing
85+
- Pool selection policies
86+
- Error handling and broken device tracking
87+
- Thread safety
88+
- The `StorageModule` interface definition
89+
90+
## Interface
91+
92+
Same as the [standard storage module](../storage/readme.md#interface). Both variants implement the same `StorageModule` interface defined in `pkg/storage.go`.

docs/internals/storage/readme.md

Lines changed: 100 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -10,51 +10,129 @@ Storage module is available on zbus over the following channel
1010

1111
## Introduction
1212

13-
This module responsible to manage everything related with storage. On start, storaged holds ownership of all node disks, and it separate it into 2 different sets:
13+
This module is responsible for managing everything related to storage. On start, storaged takes ownership of all node disks and separates them into two sets:
1414

15-
- SSD Storage: For each ssd disk available, a storage pool of type SSD is created
16-
- HDD Storage: For each HDD disk available, a storage pool of type HDD is created
15+
- **SSD pools**: One btrfs pool per SSD disk. Used for subvolumes (read-write layers), virtual disks (VM storage), and system cache.
16+
- **HDD pools**: One btrfs pool per HDD disk. Used exclusively for 0-DB device allocation.
1717

18-
Then `storaged` can provide the following storage primitives:
19-
- `subvolume`: (with quota). The btrfs subvolume can be used by used by `flistd` to support read-write operations on flists. Hence it can be used as rootfs for containers and VMs. This storage primitive is only supported on `ssd` pools.
20-
- On boot, storaged will always create a permanent subvolume with id `zos-cache` (of 100G) which will be used by the system to persist state and to hold cache of downloaded files.
21-
- `vdisk`: Virtual disk that can be attached to virtual machines. this is only possible on `ssd` pools.
22-
- `device`: that is a full disk that gets allocated and used by a single `0-db` service. Note that a single 0-db instance can serve multiple zdb namespaces for multiple users. This is only possible for on `hdd` pools.
18+
The module provides three storage primitives:
2319

24-
You already can tell that ZOS can work fine with no HDD (it will not be able to server zdb workloads though), but not without SSD. Hence a zos with no SSD will never register on the grid.
20+
- **Subvolume** (with quota): A btrfs subvolume used by `flistd` to support read-write operations on flists. Used as rootfs for containers and VMs. Only created on SSD pools.
21+
- On boot, a permanent subvolume `zos-cache` is always created (starting at 5 GiB) and bind-mounted at `/var/cache`. This volume holds system state and downloaded file caches.
22+
- **VDisk** (virtual disk): A sparse file with Copy-on-Write disabled (`FS_NOCOW_FL`), used as block storage for virtual machines. Only created on SSD pools inside a `vdisks` subvolume.
23+
- **Device**: A btrfs subvolume named `zdb` inside an HDD pool, allocated to a single 0-DB service. One 0-DB instance can serve multiple namespaces for multiple users. Only created on HDD pools.
2524

26-
List of sub-modules:
25+
ZOS can operate without HDDs (it will not serve ZDB workloads), but not without SSDs. A node with no SSD will never register on the grid.
2726

28-
- [disks](#disk-sub-module)
29-
- [0-db](#0-db-sub-module)
30-
- [booting](#booting)
27+
## Architecture
28+
29+
### Pool Organization
30+
31+
```
32+
Physical Disk (SSD) Physical Disk (HDD)
33+
| |
34+
v v
35+
btrfs pool (mounted at btrfs pool (mounted at
36+
/mnt/<label>) /mnt/<label>)
37+
| |
38+
+-- zos-cache (subvolume) +-- zdb (subvolume -> 0-DB device)
39+
+-- <workload> (subvolume)
40+
+-- vdisks/ (subvolume)
41+
+-- <vm-disk> (sparse file)
42+
```
43+
44+
### Device Type Detection
45+
46+
The module determines whether a disk is SSD or HDD using:
47+
1. A `.seektime` file persisted at the pool root (survives reboots)
48+
2. Fallback to the `seektime` tool or device rotational flag from lsblk
49+
50+
### Mount Points
51+
52+
| Resource | Path |
53+
|----------|------|
54+
| Pools | `/mnt/<pool-label>` |
55+
| Cache | `/var/cache` (bind mount to `zos-cache` subvolume) |
56+
| Volumes | `/mnt/<pool-label>/<volume-name>` |
57+
| VDisks | `/mnt/<pool-label>/vdisks/<disk-id>` |
58+
| Devices (0-DB) | `/mnt/<pool-label>/zdb` |
3159

3260
## On Node Booting
3361

3462
When the module boots:
3563

36-
- Make sure to mount all available pools
37-
- Scan available disks that are not used by any pool and create new pools on those disks. (all pools now are created with `RaidSingle` policy)
38-
- Try to find and mount a cache sub-volume under /var/cache.
39-
- If no cache sub-volume is available a new one is created and then mounted.
64+
1. Scans all available block devices using `lsblk`
65+
2. For each device not already used by a pool, creates a new btrfs filesystem (all pools use `RaidSingle` policy)
66+
3. Mounts all available pools
67+
4. Detects device type (SSD/HDD) for each pool
68+
5. Ensures a cache subvolume exists. If none is found, creates one on an SSD pool and bind-mounts it at `/var/cache`. Falls back to tmpfs if no SSD is available (sets `LimitedCache` flag)
69+
6. Starts cache monitoring goroutine (checks every 5 minutes, auto-grows at 60% utilization, shrinks below 20%)
70+
7. Shuts down and spins down unused HDD pools to save power
71+
8. Starts periodic disk power management
4072

4173
### zinit unit
4274

43-
The zinit unit file of the module specify the command line, test command, and the order where the services need to be booted.
75+
The zinit unit file specifies the command line, test command, and boot ordering.
4476

45-
Storage module is a dependency for almost all other system modules, hence it has high boot presidency (calculated on boot) by zinit based on the configuration.
77+
Storage module is a dependency for almost all other system modules, hence it has high boot precedence (calculated on boot) by zinit based on the configuration.
4678

47-
The storage module is only considered running, if (and only if) the /var/cache is ready
79+
The storage module is only considered running if (and only if) `/var/cache` is ready:
4880

4981
```yaml
5082
exec: storaged
5183
test: mountpoint /var/cache
5284
```
5385
54-
### Interface
86+
## Cache Management
5587
56-
```go
88+
The system cache is a special btrfs subvolume (`zos-cache`) that stores persistent system state and downloaded files.
89+
90+
| Parameter | Value |
91+
|-----------|-------|
92+
| Initial size | 5 GiB |
93+
| Check interval | 5 minutes |
94+
| Grow threshold | 60% utilization |
95+
| Shrink threshold | 20% utilization |
96+
| Fallback | tmpfs (if no SSD available) |
97+
98+
## Pool Selection Policies
99+
100+
When creating volumes or disks, the module selects a pool using one of these policies:
101+
102+
- **SSD Only**: Only considers SSD pools (used for volumes and vdisks)
103+
- **HDD Only**: Only considers HDD pools (used for 0-DB device allocation)
104+
- **SSD First**: Prefers SSD pools, falls back to HDD
57105

106+
Mounted pools are always prioritized over unmounted ones to avoid unnecessary spin-ups.
107+
108+
## Error Handling
109+
110+
The module tracks two categories of failures:
111+
112+
- **Broken Pools**: Pools that fail to mount. Tracked and reported via `BrokenPools()`.
113+
- **Broken Devices**: Devices that fail formatting, mounting, or type detection. Tracked and reported via `BrokenDevices()`.
114+
115+
These are exposed through the interface for monitoring and diagnostics.
116+
117+
## Thread Safety
118+
119+
All pool and volume operations are protected by a `sync.RWMutex`. Concurrent reads (lookups, listings) are allowed, while writes (create, delete, resize) are serialized.
120+
121+
## Consumers
122+
123+
Other modules access storage via zbus stubs:
124+
125+
| Consumer | Operations Used |
126+
|----------|----------------|
127+
| VM provisioner (`pkg/primitives/vm/`) | DiskCreate, DiskFormat, DiskWrite, DiskDelete |
128+
| Volume provisioner (`pkg/primitives/volume/`) | VolumeCreate, VolumeDelete, VolumeLookup |
129+
| ZMount provisioner (`pkg/primitives/zmount/`) | VolumeCreate, VolumeUpdate, VolumeDelete |
130+
| ZDB provisioner (`pkg/primitives/zdb/`) | DeviceAllocate, DeviceLookup |
131+
| Capacity oracle (`pkg/capacity/`) | Total, Metrics |
132+
133+
## Interface
134+
135+
```go
58136
// StorageModule is the storage subsystem interface
59137
// this should allow you to work with the following types of storage medium
60138
// - full disks (device) (these are used by zdb)

0 commit comments

Comments
 (0)