|
| 1 | +# Repository storage internals |
| 2 | + |
| 3 | +This guide explains how `StructureData` stores and loads array properties, what |
| 4 | +optimisations are already available to developers, and what the current format |
| 5 | +limitations are together with suggested future improvements. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Storage layout |
| 10 | + |
| 11 | +When a `StructureData` node is stored, its data is split across two backends: |
| 12 | + |
| 13 | +| Data | Backend | Queryable via `QueryBuilder`? | |
| 14 | +| --- | --- | --- | |
| 15 | +| Scalars and summary statistics (`cell`, `formula`, `n_sites`, …) | **Database attributes** | ✅ Yes | |
| 16 | +| Array properties (`positions`, `charges`, `symbols`, …) | **AiiDA repository** (`.npz` file) | ❌ No | |
| 17 | + |
| 18 | +The single repository file is called `properties.npz` (controlled by |
| 19 | +`StructureData._properties_filename`). It is a standard ZIP archive where |
| 20 | +every property is stored as an independent `.npy` entry — one entry per |
| 21 | +property name. |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## The `.npz` format in detail |
| 26 | + |
| 27 | +A `.npz` file created by `numpy.savez_compressed` is a **ZIP archive** with |
| 28 | +`zipfile.ZIP_DEFLATED` (DEFLATE) compression. Each member of the archive is |
| 29 | +an independent `.npy` binary file containing exactly **one** numpy array. |
| 30 | + |
| 31 | +```text |
| 32 | +properties.npz (ZIP archive) |
| 33 | +├── charges.npy |
| 34 | +├── positions.npy |
| 35 | +├── site_indices_flat.npy ← CSR part 1 |
| 36 | +├── site_indices_offsets.npy ← CSR part 2 |
| 37 | +└── symbols.npy |
| 38 | +``` |
| 39 | + |
| 40 | +Keys are stored in **sorted alphabetical order** to ensure a deterministic |
| 41 | +binary output, which is required for AiiDA's content-addressable file hashing. |
| 42 | + |
| 43 | +### Special case: `site_indices` (CSR encoding) |
| 44 | + |
| 45 | +`site_indices` is a ragged list-of-lists (one sub-list per kind, with a |
| 46 | +variable number of site indices). A homogeneous numpy array cannot represent |
| 47 | +it directly. It is therefore encoded using |
| 48 | +[CSR (Compressed Sparse Row)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)) |
| 49 | +format as two flat 1-D `int64` arrays: |
| 50 | + |
| 51 | +| Key in `.npz` | Shape | Content | |
| 52 | +| --- | --- | --- | |
| 53 | +| `site_indices_flat` | `(total_sites,)` | All indices concatenated | |
| 54 | +| `site_indices_offsets` | `(n_kinds + 1,)` | Cumulative start positions | |
| 55 | + |
| 56 | +**Example** — 3 kinds with 2, 1, and 3 sites respectively: |
| 57 | + |
| 58 | +```text |
| 59 | +site_indices = [[0, 1], [2], [3, 4, 5]] |
| 60 | +
|
| 61 | +→ site_indices_flat = [0, 1, 2, 3, 4, 5] |
| 62 | +→ site_indices_offsets = [0, 2, 3, 6] |
| 63 | +``` |
| 64 | + |
| 65 | +Decoding: `site_indices[i] = flat[offsets[i] : offsets[i+1]]` |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## Selective (per-property) loading |
| 70 | + |
| 71 | +`_load_properties_from_npz` accepts an optional `keys` parameter that limits |
| 72 | +which properties are decompressed: |
| 73 | + |
| 74 | +```python |
| 75 | +# Load only positions — charges, symbols, … are never touched |
| 76 | +positions = node._load_properties_from_npz(keys=['positions'])['positions'] |
| 77 | + |
| 78 | +# Load two properties at once |
| 79 | +data = node._load_properties_from_npz(keys=['positions', 'symbols']) |
| 80 | + |
| 81 | +# site_indices — the CSR pair is expanded automatically |
| 82 | +data = node._load_properties_from_npz(keys=['site_indices']) |
| 83 | +``` |
| 84 | + |
| 85 | +**Why this is efficient:** `numpy.NpzFile.__getitem__` calls |
| 86 | +`zipfile.ZipFile.open(key)` which seeks directly to the requested entry using |
| 87 | +the ZIP central directory. It does **not** decompress any other entry. |
| 88 | +Accessing `positions` in a file that also contains `charges`, `magmoms`, and |
| 89 | +`symbols` only decompresses `positions.npy`. |
| 90 | + |
| 91 | +:::{note} |
| 92 | +The full load (no `keys` argument) is cached on the node object after the |
| 93 | +first call, so repeated access to `.properties` does not re-read the file. |
| 94 | +::: |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +## Known limitation — no sliced or partial row access |
| 99 | + |
| 100 | +It is **not** currently possible to read a subset of rows from a property |
| 101 | +array (e.g. the positions of only the first 100 atoms out of 1 000 000). |
| 102 | + |
| 103 | +**Root cause:** DEFLATE is a streaming compression codec. To decompress byte |
| 104 | +offset *N* inside a compressed stream, every byte from offset 0 to *N* must |
| 105 | +be processed first. There is no random-access point in the middle of a |
| 106 | +compressed ZIP entry. |
| 107 | + |
| 108 | +`numpy.load(..., mmap_mode='r')` does **not** help here — `mmap_mode` is |
| 109 | +silently ignored for `.npz` files. From the numpy source (`npyio.py`): |
| 110 | + |
| 111 | +```python |
| 112 | +if magic.startswith(_ZIP_PREFIX) or magic.startswith(_ZIP_SUFFIX): |
| 113 | + # zip-file (assume .npz) |
| 114 | + ret = NpzFile(fid, ...) |
| 115 | + return ret # mmap_mode is never consulted |
| 116 | +elif magic == format.MAGIC_PREFIX: |
| 117 | + # .npy file |
| 118 | + if mmap_mode: # only reached for bare .npy files |
| 119 | + return format.open_memmap(file, mode=mmap_mode, ...) |
| 120 | +``` |
| 121 | + |
| 122 | +`mmap_mode` only works when loading a **bare `.npy` file from disk** (not from |
| 123 | +a stream and not from inside a ZIP). |
| 124 | + |
| 125 | +--- |
| 126 | + |
| 127 | +## Future improvements |
| 128 | + |
| 129 | +Two approaches would enable true random / sliced access: |
| 130 | + |
| 131 | +### Option A — One `.npy` object per property |
| 132 | + |
| 133 | +Store each array as a separate AiiDA repository object, e.g.: |
| 134 | + |
| 135 | +```python |
| 136 | +node.base.repository.put_object_from_filelike(buf, 'positions.npy') |
| 137 | +node.base.repository.put_object_from_filelike(buf, 'charges.npy') |
| 138 | +# … one call per property |
| 139 | +``` |
| 140 | + |
| 141 | +A `.npy` file holds exactly **one** array, so N properties → N repository |
| 142 | +objects (instead of the current single `properties.npz`). |
| 143 | + |
| 144 | +**Advantages:** |
| 145 | + |
| 146 | +- Files stored directly on disk support `np.load(..., mmap_mode='r')`, which |
| 147 | + memory-maps the array and allows zero-copy slicing of any row range without |
| 148 | + loading the full file into RAM. |
| 149 | + |
| 150 | +**Disadvantages:** |
| 151 | + |
| 152 | +- No compression → larger on-disk footprint. |
| 153 | +- N repository objects instead of 1 → more file-system entries and more AiiDA |
| 154 | + metadata overhead. |
| 155 | + |
| 156 | +### Option B — Zarr or HDF5 (recommended for large structures) |
| 157 | + |
| 158 | +Replace `properties.npz` with a **Zarr** store or an **HDF5** file (via |
| 159 | +`h5py`). Both formats: |
| 160 | + |
| 161 | +- Pack **multiple arrays into one file**. |
| 162 | +- Use **chunked storage** with selectable compression (gzip, Blosc, …). |
| 163 | +- Support **direct chunk-level random access** — reading row range |
| 164 | + `[i:j]` only decompresses the chunks that overlap that range. |
| 165 | + |
| 166 | +This gives the best of both worlds: a single repository object, compression, |
| 167 | +and efficient partial reads. |
| 168 | + |
| 169 | +| Format | Single file | Compressed | Sliced access | |
| 170 | +| --- | --- | --- | --- | |
| 171 | +| Current `.npz` | ✅ | ✅ DEFLATE | ❌ streaming only | |
| 172 | +| Per-property `.npy` | ❌ (N files) | ❌ | ✅ via `mmap_mode` | |
| 173 | +| Zarr / HDF5 | ✅ | ✅ chunked | ✅ chunk-level | |
| 174 | + |
| 175 | +--- |
0 commit comments