Skip to content

Commit 39107fb

Browse files
authored
adding slicing of structures and allowing single property loading from repo (#88)
1 parent ac0e115 commit 39107fb

File tree

5 files changed

+643
-13
lines changed

5 files changed

+643
-13
lines changed
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# Repository storage internals
2+
3+
This guide explains how `StructureData` stores and loads array properties, what
4+
optimisations are already available to developers, and what the current format
5+
limitations are together with suggested future improvements.
6+
7+
---
8+
9+
## Storage layout
10+
11+
When a `StructureData` node is stored, its data is split across two backends:
12+
13+
| Data | Backend | Queryable via `QueryBuilder`? |
14+
| --- | --- | --- |
15+
| Scalars and summary statistics (`cell`, `formula`, `n_sites`, …) | **Database attributes** | ✅ Yes |
16+
| Array properties (`positions`, `charges`, `symbols`, …) | **AiiDA repository** (`.npz` file) | ❌ No |
17+
18+
The single repository file is called `properties.npz` (controlled by
19+
`StructureData._properties_filename`). It is a standard ZIP archive where
20+
every property is stored as an independent `.npy` entry — one entry per
21+
property name.
22+
23+
---
24+
25+
## The `.npz` format in detail
26+
27+
A `.npz` file created by `numpy.savez_compressed` is a **ZIP archive** with
28+
`zipfile.ZIP_DEFLATED` (DEFLATE) compression. Each member of the archive is
29+
an independent `.npy` binary file containing exactly **one** numpy array.
30+
31+
```text
32+
properties.npz (ZIP archive)
33+
├── charges.npy
34+
├── positions.npy
35+
├── site_indices_flat.npy ← CSR part 1
36+
├── site_indices_offsets.npy ← CSR part 2
37+
└── symbols.npy
38+
```
39+
40+
Keys are stored in **sorted alphabetical order** to ensure a deterministic
41+
binary output, which is required for AiiDA's content-addressable file hashing.
42+
43+
### Special case: `site_indices` (CSR encoding)
44+
45+
`site_indices` is a ragged list-of-lists (one sub-list per kind, with a
46+
variable number of site indices). A homogeneous numpy array cannot represent
47+
it directly. It is therefore encoded using
48+
[CSR (Compressed Sparse Row)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format))
49+
format as two flat 1-D `int64` arrays:
50+
51+
| Key in `.npz` | Shape | Content |
52+
| --- | --- | --- |
53+
| `site_indices_flat` | `(total_sites,)` | All indices concatenated |
54+
| `site_indices_offsets` | `(n_kinds + 1,)` | Cumulative start positions |
55+
56+
**Example** — 3 kinds with 2, 1, and 3 sites respectively:
57+
58+
```text
59+
site_indices = [[0, 1], [2], [3, 4, 5]]
60+
61+
→ site_indices_flat = [0, 1, 2, 3, 4, 5]
62+
→ site_indices_offsets = [0, 2, 3, 6]
63+
```
64+
65+
Decoding: `site_indices[i] = flat[offsets[i] : offsets[i+1]]`
66+
67+
---
68+
69+
## Selective (per-property) loading
70+
71+
`_load_properties_from_npz` accepts an optional `keys` parameter that limits
72+
which properties are decompressed:
73+
74+
```python
75+
# Load only positions — charges, symbols, … are never touched
76+
positions = node._load_properties_from_npz(keys=['positions'])['positions']
77+
78+
# Load two properties at once
79+
data = node._load_properties_from_npz(keys=['positions', 'symbols'])
80+
81+
# site_indices — the CSR pair is expanded automatically
82+
data = node._load_properties_from_npz(keys=['site_indices'])
83+
```
84+
85+
**Why this is efficient:** `numpy.NpzFile.__getitem__` calls
86+
`zipfile.ZipFile.open(key)` which seeks directly to the requested entry using
87+
the ZIP central directory. It does **not** decompress any other entry.
88+
Accessing `positions` in a file that also contains `charges`, `magmoms`, and
89+
`symbols` only decompresses `positions.npy`.
90+
91+
:::{note}
92+
The full load (no `keys` argument) is cached on the node object after the
93+
first call, so repeated access to `.properties` does not re-read the file.
94+
:::
95+
96+
---
97+
98+
## Known limitation — no sliced or partial row access
99+
100+
It is **not** currently possible to read a subset of rows from a property
101+
array (e.g. the positions of only the first 100 atoms out of 1 000 000).
102+
103+
**Root cause:** DEFLATE is a streaming compression codec. To decompress byte
104+
offset *N* inside a compressed stream, every byte from offset 0 to *N* must
105+
be processed first. There is no random-access point in the middle of a
106+
compressed ZIP entry.
107+
108+
`numpy.load(..., mmap_mode='r')` does **not** help here — `mmap_mode` is
109+
silently ignored for `.npz` files. From the numpy source (`npyio.py`):
110+
111+
```python
112+
if magic.startswith(_ZIP_PREFIX) or magic.startswith(_ZIP_SUFFIX):
113+
# zip-file (assume .npz)
114+
ret = NpzFile(fid, ...)
115+
return ret # mmap_mode is never consulted
116+
elif magic == format.MAGIC_PREFIX:
117+
# .npy file
118+
if mmap_mode: # only reached for bare .npy files
119+
return format.open_memmap(file, mode=mmap_mode, ...)
120+
```
121+
122+
`mmap_mode` only works when loading a **bare `.npy` file from disk** (not from
123+
a stream and not from inside a ZIP).
124+
125+
---
126+
127+
## Future improvements
128+
129+
Two approaches would enable true random / sliced access:
130+
131+
### Option A — One `.npy` object per property
132+
133+
Store each array as a separate AiiDA repository object, e.g.:
134+
135+
```python
136+
node.base.repository.put_object_from_filelike(buf, 'positions.npy')
137+
node.base.repository.put_object_from_filelike(buf, 'charges.npy')
138+
# … one call per property
139+
```
140+
141+
A `.npy` file holds exactly **one** array, so N properties → N repository
142+
objects (instead of the current single `properties.npz`).
143+
144+
**Advantages:**
145+
146+
- Files stored directly on disk support `np.load(..., mmap_mode='r')`, which
147+
memory-maps the array and allows zero-copy slicing of any row range without
148+
loading the full file into RAM.
149+
150+
**Disadvantages:**
151+
152+
- No compression → larger on-disk footprint.
153+
- N repository objects instead of 1 → more file-system entries and more AiiDA
154+
metadata overhead.
155+
156+
### Option B — Zarr or HDF5 (recommended for large structures)
157+
158+
Replace `properties.npz` with a **Zarr** store or an **HDF5** file (via
159+
`h5py`). Both formats:
160+
161+
- Pack **multiple arrays into one file**.
162+
- Use **chunked storage** with selectable compression (gzip, Blosc, …).
163+
- Support **direct chunk-level random access** — reading row range
164+
`[i:j]` only decompresses the chunks that overlap that range.
165+
166+
This gives the best of both worlds: a single repository object, compression,
167+
and efficient partial reads.
168+
169+
| Format | Single file | Compressed | Sliced access |
170+
| --- | --- | --- | --- |
171+
| Current `.npz` || ✅ DEFLATE | ❌ streaming only |
172+
| Per-property `.npy` | ❌ (N files) || ✅ via `mmap_mode` |
173+
| Zarr / HDF5 || ✅ chunked | ✅ chunk-level |
174+
175+
---

docs/source/dev_guides/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,5 @@ Guides for developers who want to contribute to `aiida-atomistic` code or to mod
77
88
dev_adding_properties
99
dev_plugin_migration
10+
dev_repository_storage
1011
```

docs/source/how_to/creation_mutation.md

Lines changed: 93 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -527,6 +527,96 @@ structure.to_file('my_structure.cif')
527527
structure.to_file('my_structure.xyz')
528528
```
529529

530+
## Slicing structures
531+
532+
Both `StructureData` and `StructureBuilder` support Python-style indexing to
533+
extract a subset of sites into a new structure.
534+
535+
:::{important}
536+
The return type depends on the class you slice:
537+
538+
- **`StructureBuilder[…]`** → a new `StructureBuilder` (mutable, can be edited
539+
further before storing).
540+
- **`StructureData[…]`** → a new, **unstored** `StructureData` node. Call
541+
`.store()` when you are ready to persist the result.
542+
:::
543+
544+
All global properties — `cell`, `pbc`, `tot_charge`, `tot_magnetization`,
545+
`custom`, etc. — are **preserved** in the result unchanged. Derived per-site
546+
quantities such as `formula` and `n_sites` are recomputed automatically from the
547+
new site list.
548+
549+
### Supported index types
550+
551+
| Index type | Example | Description |
552+
| --- | --- | --- |
553+
| `int` | `s[0]`, `s[-1]` | Single site (negative indices supported) |
554+
| `slice` | `s[10:20]`, `s[::2]` | Contiguous or strided range |
555+
| `list` / `tuple` of ints | `s[[0, 5, 12]]` | Arbitrary, possibly non-contiguous selection |
556+
| 1-D numpy integer array | `s[np.array([0, 5, 12])]` | Same as list |
557+
558+
### Examples
559+
560+
```python
561+
from aiida_atomistic.data.structure import StructureData, StructureBuilder
562+
import numpy as np
563+
564+
# --- build a small 6-site structure ---
565+
sites = [
566+
{"symbol": "Fe", "position": [0.0, 0.0, float(i)], "charge": float(i)}
567+
for i in range(6)
568+
]
569+
builder = StructureBuilder(
570+
cell=np.eye(3) * 10.0,
571+
pbc=[True, True, True],
572+
sites=sites,
573+
)
574+
575+
# int — one site
576+
one = builder[0]
577+
print(one.properties.formula) # Fe
578+
579+
# slice — first three sites
580+
first3 = builder[0:3]
581+
print(first3.properties.formula) # Fe3
582+
583+
# slice with stride — every other site
584+
even = builder[::2]
585+
print(len(even)) # 3
586+
587+
# list — arbitrary selection
588+
subset = builder[[0, 2, 5]]
589+
print(subset.properties.formula) # Fe3
590+
591+
# negative index
592+
last = builder[-1]
593+
print(last.properties.sites[0].position) # [0. 0. 5.]
594+
```
595+
596+
### Slicing a stored `StructureData` node
597+
598+
`StructureData` supports exactly the same indexing syntax and returns a new,
599+
**unstored** `StructureData` node each time. The original node is never modified.
600+
601+
```python
602+
# Assume `node` is a stored StructureData retrieved from the database
603+
node = load_node(pk=42)
604+
605+
# Extract the first 100 sites and store as a new node
606+
sub = node[0:100]
607+
sub.store()
608+
609+
# Arbitrary selection from a mask
610+
mask = np.where(np.array(node.properties.charges) > 0.5)[0]
611+
charged = node[mask]
612+
charged.store()
613+
```
614+
615+
:::{note}
616+
`len(node)` and `len(builder)` return the number of sites, consistent with the
617+
indexing behaviour.
618+
:::
619+
530620
## Complete Example: Building a Complex Structure
531621

532622
Here's a complete workflow showing structure creation and modification:
@@ -572,7 +662,8 @@ print(f"Total charge: {np.sum(final_structure.properties.charges)}")
572662
```
573663

574664
**Output:**
575-
```
665+
666+
```text
576667
Structure: Fe2O2
577668
Kinds: {'Fe1', 'Fe2', 'O1'}
578669
Cell volume: 125.00 ų
@@ -581,3 +672,4 @@ Final structure ready for calculations!
581672
Is alloy: False
582673
Total charge: -1.0
583674
```
675+

0 commit comments

Comments
 (0)