Skip to content

Commit aafe19f

Browse files
authored
This fixes #89 (#91)
1 parent 39107fb commit aafe19f

File tree

9 files changed

+302
-219
lines changed

9 files changed

+302
-219
lines changed

docs/source/how_to/creation_mutation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ sites = [
7878
"symbol": ["Cu", "Zn"],
7979
"position": [0.0, 0.0, 0.0],
8080
"mass": 1.008,
81-
"weight": (0.6, 0.6)
81+
"weight": (0.6, 0.4)
8282
}
8383
]
8484
```
@@ -445,7 +445,7 @@ structuredata = StructureData.from_builder(builder)
445445

446446
# Immutable → Mutable (for editing)
447447
structurebuilder = structuredata.to_builder()
448-
structurebuilder = StructureBuilder.from_aiida(builder)
448+
structurebuilder = StructureBuilder.from_aiida(structuredata)
449449

450450
```
451451

docs/source/how_to/query.md

Lines changed: 55 additions & 179 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,11 @@ This page only concerns the `StructureData` object, as the `StructureBuilder` is
1111

1212
**Database (queryable via QueryBuilder):**
1313
- Global properties: `pbc`, `cell`, `periodicity`, `tot_magnetization`, `tot_charge`, `hubbard`, `custom` and so on. You can see the whole set accessing `StructureData.get_supported_properties()['global']`
14-
- Computed properties: `formula`, `cell_volume`, `dimensionality`, `is_alloy`, `has_vacancies`, `symbols`, `kind_names`, `n_sites` and so on. You can see the whole set accessing `StructureData.get_computed_properties()['global']`
14+
- Computed properties: `composition`, `cell_volume`, `dimensionality`, `is_alloy`, `has_vacancies`, `symbols`, `kind_names`, `n_sites` and so on. You can see the whole set accessing `StructureData.get_computed_properties()['global']`
1515

16+
**Not stored (computed on-the-fly only):**
17+
18+
- `formula` — use `structure.properties.formula` to access it, but it **cannot** be queried. Use `composition` for database queries instead.
1619

1720
**Repository (not queryable, loaded on access):**
1821
- Per-site arrays: `positions`, `masses`, `charges`, `magmoms`, `magnetizations`, `weights`
@@ -36,11 +39,11 @@ StructureData.get_queryable_properties()
3639

3740
**Queryable properties include:**
3841
- **Global**: `pbc`, `cell`, `periodicity`, `tot_magnetization`, `tot_charge`, `hubbard`, `custom`
39-
- **Computed**: `formula`, `cell_volume`, `dimensionality`, `is_alloy`, `has_vacancies`, `symbols`, `kind_names`, `n_sites`
42+
- **Computed**: `composition`, `cell_volume`, `dimensionality`, `is_alloy`, `has_vacancies`, `symbols`, `kind_names`, `n_sites`
4043
- **Statistics**: `max_charge`, `min_charge`, `max_magmom`, `min_magmom`, `max_magnetization`, `min_magnetization`
4144

4245
:::{note}
43-
**Per-site arrays like `positions`, `masses`, `charges`, `magmoms`, `magnetizations`, and `weights` are stored in the repository and cannot be queried directly.** Instead, use the statistical properties (`max_charge`, `min_charge`, etc.) to filter structures by value ranges.
46+
**`formula` is no longer stored in the database** and therefore cannot be queried. Use `composition` instead — it is a `dict` mapping element symbols to their count, e.g. `{"Fe": 2, "O": 3}`, and is fully queryable.
4447
:::
4548

4649
## Examples of simple queries
@@ -142,13 +145,15 @@ print(f"PK: {result[3]}")
142145

143146
### Structures by Number of Atoms
144147

148+
Use `attributes.n_sites` for total atom count:
149+
145150
```python
146151
# Less than 6 atoms
147152
nr_atoms = 6
148153
qb = QueryBuilder()
149154
qb.append(
150155
StructureData,
151-
filters={'attributes.symbols': {'shorter': nr_atoms}}
156+
filters={'attributes.n_sites': {'<': nr_atoms}}
152157
)
153158
print(f"Structures with < {nr_atoms} atoms: {len(qb.all())}")
154159

@@ -157,7 +162,7 @@ nr_atoms = 5
157162
qb = QueryBuilder()
158163
qb.append(
159164
StructureData,
160-
filters={'attributes.symbols': {'longer': nr_atoms}}
165+
filters={'attributes.n_sites': {'>': nr_atoms}}
161166
)
162167
print(f"Structures with > {nr_atoms} atoms: {len(qb.all())}")
163168

@@ -166,10 +171,7 @@ nr_atoms = 2
166171
qb = QueryBuilder()
167172
qb.append(
168173
StructureData,
169-
filters={'attributes.symbols': {'and': [
170-
{'shorter': nr_atoms + 1},
171-
{'longer': nr_atoms - 1}
172-
]}}
174+
filters={'attributes.n_sites': nr_atoms}
173175
)
174176
print(f"Structures with exactly {nr_atoms} atoms: {len(qb.all())}")
175177
```
@@ -237,211 +239,85 @@ Statistical properties enable efficient filtering without loading large arrays:
237239

238240
### Specific Chemical Formula
239241

242+
`formula` is no longer stored in the database. Use `composition` to search by element
243+
content instead. `composition` is a `dict` like `{"Fe": 2, "O": 3}`, stored as a
244+
JSON attribute, so all standard QueryBuilder dict/key filters apply.
245+
240246
```python
241-
formula = 'HO'
247+
# Structures containing iron
242248
qb = QueryBuilder()
243249
qb.append(
244250
StructureData,
245-
filters={'attributes.formula': formula} # or {'==': formula}
251+
filters={'attributes.composition': {'has_key': 'Fe'}}
246252
)
247-
print(f"Structures with formula {formula}: {len(qb.all())}")
248-
```
249-
250-
### Specific Number of Atoms of an Element
251-
252-
For multiple atoms of the same element:
253+
print(f"Structures containing Fe: {len(qb.all())}")
253254

254-
```python
255-
element = 'H'
256-
nr_atoms = 2
255+
# Structures containing both Fe and O
257256
qb = QueryBuilder()
258257
qb.append(
259258
StructureData,
260-
filters={
261-
'attributes.formula': {'like': f'%{element}{nr_atoms}%'}
262-
},
263-
project=['attributes.formula', 'id']
259+
filters={'attributes.composition': {'and': [
260+
{'has_key': 'Fe'},
261+
{'has_key': 'O'},
262+
]}}
264263
)
265-
print(f"Structures with {nr_atoms} {element} atoms: {len(qb.all())}")
264+
print(f"Fe-O structures: {len(qb.all())}")
266265
```
267266

268-
:::{warning}
269-
This approach may match unintended formulas (e.g., searching for `Mn2` might also match `Mn20`). Use regex post-processing for precise matches.
270-
:::
271-
272-
### Exactly One Atom of an Element
267+
### Specific Number of Atoms of an Element
273268

274-
For a single atom, use regex to ensure no digits follow:
269+
Because `composition` is a queryable dict, you can filter on the **count** of an element
270+
directly — no regex needed:
275271

276272
```python
277-
import re
278-
279-
element = 'H'
273+
# Exactly 2 Fe atoms
280274
qb = QueryBuilder()
281275
qb.append(
282276
StructureData,
283-
filters={'attributes.formula': {'like': f'%{element}%'}},
284-
project=['attributes.formula', 'id']
277+
filters={'attributes.composition.Fe': 2},
278+
project=['attributes.composition', 'id']
285279
)
280+
print(f"Structures with exactly 2 Fe: {len(qb.all())}")
286281

287-
res = []
288-
for struct in qb.iterall():
289-
formula = struct[0]
290-
# Match H not followed by any digit
291-
if formula and re.search(f'{element}(?![0-9])', formula):
292-
res.append(struct)
293-
294-
print(f"Structures with exactly one {element}: {len(res)}")
295-
```
296-
297-
**Regex explanation:**
298-
- `H` - matches the element symbol
299-
- `(?![0-9])` - negative lookahead: ensures H is NOT followed by a digit
300-
- This matches formulas where H appears alone (exactly 1 atom)
301-
302-
### Exactly N Atoms of an Element
303-
304-
For precise matching of specific atom counts:
305-
306-
```python
307-
element = 'Mn'
308-
nr_atoms = 2
282+
# At least 3 H atoms
309283
qb = QueryBuilder()
310284
qb.append(
311285
StructureData,
312-
filters={'attributes.formula': {'like': f'%{element}{nr_atoms}%'}},
313-
project=['attributes.formula', 'id']
286+
filters={'attributes.composition.H': {'>': 2}},
287+
project=['attributes.composition', 'id']
314288
)
315-
316-
res = []
317-
for struct in qb.iterall():
318-
formula = struct[0]
319-
# Match element followed by the number, but not by another digit
320-
if formula and re.search(f'{element}{nr_atoms}(?![0-9])', formula):
321-
res.append(struct)
322-
323-
print(f"Structures with exactly {nr_atoms} {element}: {len(res)}")
324-
print(f"Formulas: {[s[0] for s in res]}")
289+
print(f"Structures with ≥ 3 H: {len(qb.all())}")
325290
```
326291

327292
### Binaries and Ternaries
328293

329-
Find structures with specific numbers of elements using regex:
294+
`composition` stores one key per distinct element, so the number of keys equals the
295+
number of distinct elements. Use `has_key` / `!has_key` to check for the presence of
296+
elements, or load the node and check `len(structure.properties.composition)`:
330297

331298
```python
332-
import re
333-
334-
# Binary compounds (2 elements)
335-
number_of_elements = 2
336-
qb = QueryBuilder()
337-
qb.append(
338-
StructureData,
339-
filters={'attributes.symbols': {'longer': number_of_elements - 1}},
340-
project=['attributes.formula', 'id']
341-
)
342-
343-
res = []
344-
for struct in qb.iterall():
345-
formula = struct[0]
346-
# Pattern: exactly 2 occurrences of [Capital][lowercase]*[digits]*
347-
pattern = '^' + '[A-Z][a-z]*[0-9]*' * number_of_elements + '$'
348-
if formula and re.search(pattern, formula):
349-
res.append(struct)
350-
351-
print(f"Binary compounds: {len(res)}")
352-
print(f"Examples: {[s[0] for s in res[:5]]}")
353-
354-
# Ternary compounds (3 elements)
355-
number_of_elements = 3
299+
# Binary compounds (exactly 2 distinct elements) — database-side pre-filter
300+
# then Python-side length check
356301
qb = QueryBuilder()
357-
qb.append(
358-
StructureData,
359-
filters={'attributes.symbols': {'longer': number_of_elements - 1}},
360-
project=['attributes.formula', 'id']
361-
)
362-
363-
res = []
364-
for struct in qb.iterall():
365-
formula = struct[0]
366-
pattern = '^' + '[A-Z][a-z]*[0-9]*' * number_of_elements + '$'
367-
if formula and re.search(pattern, formula):
368-
res.append(struct)
369-
370-
print(f"Ternary compounds: {len(res)}")
371-
```
372-
373-
**Regex pattern explanation:**
374-
- `^` - start of string
375-
- `[A-Z]` - capital letter (element symbol start)
376-
- `[a-z]*` - zero or more lowercase letters (element symbol continuation)
377-
- `[0-9]*` - zero or more digits (stoichiometry)
378-
- Repeated `number_of_elements` times
379-
- `$` - end of string
302+
qb.append(StructureData, project=['*'])
380303

381-
This ensures the formula has exactly the specified number of element symbols.
304+
binaries = [
305+
s for (s,) in qb.iterall()
306+
if len(s.properties.composition) == 2
307+
]
308+
print(f"Binary compounds: {len(binaries)}")
309+
print(f"Examples: {[s.properties.formula for s in binaries[:5]]}")
382310

383-
## Best Practices
384-
385-
1. **Filter early**: Use QueryBuilder filters to reduce the result set before post-processing
386-
2. **Project efficiently**: Only retrieve the attributes you need
387-
3. **Use statistical properties**: Query `max_charge`, `min_charge`, etc. instead of loading full arrays
388-
4. **Use regex carefully**: Regex post-processing is powerful but slower than database filters
389-
5. **Check for None**: Always validate that projected values exist before using them in regex
390-
6. **Combine filters**: Use `and`, `or`, and negation (`!`) to build complex queries
391-
7. **Understand storage locations**: Database properties are fast to query; repository properties require loading the node
392-
393-
:::{note}
394-
**Storage Model Impact on Queries**
395-
396-
- **Fast queries**: Properties in the database (`formula`, `symbols`, `n_sites`, statistics)
397-
- **Requires loading**: Per-site arrays in the repository (`positions`, `charges`, `magmoms`)
398-
- **Best practice**: Filter using database properties first, then load nodes to access repository arrays
399-
400-
Example efficient workflow:
401-
```python
402-
# First: Filter in database by statistics
311+
# If your database is large, pre-filter with a known element to reduce the scan:
403312
qb = QueryBuilder()
404313
qb.append(
405314
StructureData,
406-
filters={'attributes': {'and': [
407-
{'max_charge': {'>': 1.0}},
408-
{'formula': {'like': '%Fe%'}}
409-
]}}
315+
filters={'attributes.composition': {'has_key': 'Fe'}},
316+
project=['*']
410317
)
411-
412-
# Then: Load only matching nodes to access full charge arrays
413-
for (structure,) in qb.iterall():
414-
charges = structure.properties.charges # Loads from repository
415-
# Process individual charge values...
416-
```
417-
:::
418-
419-
## Performance Tips
420-
421-
- **Use `qb.iterall()`** instead of `qb.all()` for large result sets to avoid loading everything into memory
422-
- **Filter at database level**: Apply as many filters as possible using QueryBuilder before loading nodes
423-
- **Use statistical properties**: Query `max_charge`, `min_charge`, etc. to avoid loading repository arrays
424-
- **Use `project`** to retrieve only needed database attributes
425-
- **Load repository data last**: Access `positions`, `charges`, `magmoms` only after filtering
426-
- **For very large databases**: Consider adding pagination with `limit` and `offset`
427-
428-
:::{important}
429-
**Performance Comparison**
430-
431-
**Fast** (database query only):
432-
```python
433-
qb = QueryBuilder()
434-
qb.append(StructureData, filters={'attributes.max_charge': {'>': 1.0}})
435-
results = qb.all() # Fast - no repository access
436-
```
437-
438-
**Slow** (loading all arrays):
439-
```python
440-
qb = QueryBuilder()
441-
qb.append(StructureData)
442-
for (s,) in qb.iterall():
443-
if "charges" in s.get_defined_properties():
444-
if max(s.properties.charges) > 1.0: # Slow - loads from repository for every structure with charges
445-
results.append(s)
318+
fe_binaries = [
319+
s for (s,) in qb.iterall()
320+
if len(s.properties.composition) == 2
321+
]
322+
print(f"Fe-containing binaries: {len(fe_binaries)}")
446323
```
447-
:::

0 commit comments

Comments
 (0)