@@ -11,8 +11,11 @@ This page only concerns the `StructureData` object, as the `StructureBuilder` is
1111
1212** Database (queryable via QueryBuilder):**
1313- Global properties: ` pbc ` , ` cell ` , ` periodicity ` , ` tot_magnetization ` , ` tot_charge ` , ` hubbard ` , ` custom ` and so on. You can see the whole set accessing ` StructureData.get_supported_properties()['global'] `
14- - Computed properties: ` formula ` , ` cell_volume ` , ` dimensionality ` , ` is_alloy ` , ` has_vacancies ` , ` symbols ` , ` kind_names ` , ` n_sites ` and so on. You can see the whole set accessing ` StructureData.get_computed_properties()['global'] `
14+ - Computed properties: ` composition ` , ` cell_volume ` , ` dimensionality ` , ` is_alloy ` , ` has_vacancies ` , ` symbols ` , ` kind_names ` , ` n_sites ` and so on. You can see the whole set accessing ` StructureData.get_computed_properties()['global'] `
1515
16+ ** Not stored (computed on-the-fly only):**
17+
18+ - ` formula ` — use ` structure.properties.formula ` to access it, but it ** cannot** be queried. Use ` composition ` for database queries instead.
1619
1720** Repository (not queryable, loaded on access):**
1821- Per-site arrays: ` positions ` , ` masses ` , ` charges ` , ` magmoms ` , ` magnetizations ` , ` weights `
@@ -36,11 +39,11 @@ StructureData.get_queryable_properties()
3639
3740** Queryable properties include:**
3841- ** Global** : ` pbc ` , ` cell ` , ` periodicity ` , ` tot_magnetization ` , ` tot_charge ` , ` hubbard ` , ` custom `
39- - ** Computed** : ` formula ` , ` cell_volume ` , ` dimensionality ` , ` is_alloy ` , ` has_vacancies ` , ` symbols ` , ` kind_names ` , ` n_sites `
42+ - ** Computed** : ` composition ` , ` cell_volume ` , ` dimensionality ` , ` is_alloy ` , ` has_vacancies ` , ` symbols ` , ` kind_names ` , ` n_sites `
4043- ** Statistics** : ` max_charge ` , ` min_charge ` , ` max_magmom ` , ` min_magmom ` , ` max_magnetization ` , ` min_magnetization `
4144
4245:::{note}
43- ** Per-site arrays like ` positions ` , ` masses ` , ` charges ` , ` magmoms ` , ` magnetizations ` , and ` weights ` are stored in the repository and cannot be queried directly. ** Instead, use the statistical properties ( ` max_charge ` , ` min_charge ` , etc.) to filter structures by value ranges .
46+ ** ` formula ` is no longer stored in the database ** and therefore cannot be queried. Use ` composition ` instead — it is a ` dict ` mapping element symbols to their count, e.g. ` {"Fe": 2, "O": 3} ` , and is fully queryable .
4447:::
4548
4649## Examples of simple queries
@@ -142,13 +145,15 @@ print(f"PK: {result[3]}")
142145
143146### Structures by Number of Atoms
144147
148+ Use ` attributes.n_sites ` for total atom count:
149+
145150``` python
146151# Less than 6 atoms
147152nr_atoms = 6
148153qb = QueryBuilder()
149154qb.append(
150155 StructureData,
151- filters = {' attributes.symbols ' : {' shorter ' : nr_atoms}}
156+ filters = {' attributes.n_sites ' : {' < ' : nr_atoms}}
152157)
153158print (f " Structures with < { nr_atoms} atoms: { len (qb.all())} " )
154159
@@ -157,7 +162,7 @@ nr_atoms = 5
157162qb = QueryBuilder()
158163qb.append(
159164 StructureData,
160- filters = {' attributes.symbols ' : {' longer ' : nr_atoms}}
165+ filters = {' attributes.n_sites ' : {' > ' : nr_atoms}}
161166)
162167print (f " Structures with > { nr_atoms} atoms: { len (qb.all())} " )
163168
@@ -166,10 +171,7 @@ nr_atoms = 2
166171qb = QueryBuilder()
167172qb.append(
168173 StructureData,
169- filters = {' attributes.symbols' : {' and' : [
170- {' shorter' : nr_atoms + 1 },
171- {' longer' : nr_atoms - 1 }
172- ]}}
174+ filters = {' attributes.n_sites' : nr_atoms}
173175)
174176print (f " Structures with exactly { nr_atoms} atoms: { len (qb.all())} " )
175177```
@@ -237,211 +239,85 @@ Statistical properties enable efficient filtering without loading large arrays:
237239
238240### Specific Chemical Formula
239241
242+ ` formula ` is no longer stored in the database. Use ` composition ` to search by element
243+ content instead. ` composition ` is a ` dict ` like ` {"Fe": 2, "O": 3} ` , stored as a
244+ JSON attribute, so all standard QueryBuilder dict/key filters apply.
245+
240246``` python
241- formula = ' HO '
247+ # Structures containing iron
242248qb = QueryBuilder()
243249qb.append(
244250 StructureData,
245- filters = {' attributes.formula ' : formula} # or {'== ': formula }
251+ filters = {' attributes.composition ' : { ' has_key ' : ' Fe ' } }
246252)
247- print (f " Structures with formula { formula} : { len (qb.all())} " )
248- ```
249-
250- ### Specific Number of Atoms of an Element
251-
252- For multiple atoms of the same element:
253+ print (f " Structures containing Fe: { len (qb.all())} " )
253254
254- ``` python
255- element = ' H'
256- nr_atoms = 2
255+ # Structures containing both Fe and O
257256qb = QueryBuilder()
258257qb.append(
259258 StructureData,
260- filters = {
261- ' attributes.formula ' : { ' like ' : f ' % { element }{ nr_atoms } % ' }
262- },
263- project = [ ' attributes.formula ' , ' id ' ]
259+ filters = {' attributes.composition ' : { ' and ' : [
260+ { ' has_key ' : ' Fe ' },
261+ { ' has_key ' : ' O ' },
262+ ]}}
264263)
265- print (f " Structures with { nr_atoms } { element } atoms : { len (qb.all())} " )
264+ print (f " Fe-O structures : { len (qb.all())} " )
266265```
267266
268- :::{warning}
269- This approach may match unintended formulas (e.g., searching for ` Mn2 ` might also match ` Mn20 ` ). Use regex post-processing for precise matches.
270- :::
271-
272- ### Exactly One Atom of an Element
267+ ### Specific Number of Atoms of an Element
273268
274- For a single atom, use regex to ensure no digits follow:
269+ Because ` composition ` is a queryable dict, you can filter on the ** count** of an element
270+ directly — no regex needed:
275271
276272``` python
277- import re
278-
279- element = ' H'
273+ # Exactly 2 Fe atoms
280274qb = QueryBuilder()
281275qb.append(
282276 StructureData,
283- filters = {' attributes.formula ' : { ' like ' : f ' % { element } % ' } },
284- project = [' attributes.formula ' , ' id' ]
277+ filters = {' attributes.composition.Fe ' : 2 },
278+ project = [' attributes.composition ' , ' id' ]
285279)
280+ print (f " Structures with exactly 2 Fe: { len (qb.all())} " )
286281
287- res = []
288- for struct in qb.iterall():
289- formula = struct[0 ]
290- # Match H not followed by any digit
291- if formula and re.search(f ' { element} (?![0-9]) ' , formula):
292- res.append(struct)
293-
294- print (f " Structures with exactly one { element} : { len (res)} " )
295- ```
296-
297- ** Regex explanation:**
298- - ` H ` - matches the element symbol
299- - ` (?![0-9]) ` - negative lookahead: ensures H is NOT followed by a digit
300- - This matches formulas where H appears alone (exactly 1 atom)
301-
302- ### Exactly N Atoms of an Element
303-
304- For precise matching of specific atom counts:
305-
306- ``` python
307- element = ' Mn'
308- nr_atoms = 2
282+ # At least 3 H atoms
309283qb = QueryBuilder()
310284qb.append(
311285 StructureData,
312- filters = {' attributes.formula ' : {' like ' : f ' % { element }{ nr_atoms } % ' }},
313- project = [' attributes.formula ' , ' id' ]
286+ filters = {' attributes.composition.H ' : {' > ' : 2 }},
287+ project = [' attributes.composition ' , ' id' ]
314288)
315-
316- res = []
317- for struct in qb.iterall():
318- formula = struct[0 ]
319- # Match element followed by the number, but not by another digit
320- if formula and re.search(f ' { element}{ nr_atoms} (?![0-9]) ' , formula):
321- res.append(struct)
322-
323- print (f " Structures with exactly { nr_atoms} { element} : { len (res)} " )
324- print (f " Formulas: { [s[0 ] for s in res]} " )
289+ print (f " Structures with ≥ 3 H: { len (qb.all())} " )
325290```
326291
327292### Binaries and Ternaries
328293
329- Find structures with specific numbers of elements using regex:
294+ ` composition ` stores one key per distinct element, so the number of keys equals the
295+ number of distinct elements. Use ` has_key ` / ` !has_key ` to check for the presence of
296+ elements, or load the node and check ` len(structure.properties.composition) ` :
330297
331298``` python
332- import re
333-
334- # Binary compounds (2 elements)
335- number_of_elements = 2
336- qb = QueryBuilder()
337- qb.append(
338- StructureData,
339- filters = {' attributes.symbols' : {' longer' : number_of_elements - 1 }},
340- project = [' attributes.formula' , ' id' ]
341- )
342-
343- res = []
344- for struct in qb.iterall():
345- formula = struct[0 ]
346- # Pattern: exactly 2 occurrences of [Capital][lowercase]*[digits]*
347- pattern = ' ^' + ' [A-Z][a-z]*[0-9]*' * number_of_elements + ' $'
348- if formula and re.search(pattern, formula):
349- res.append(struct)
350-
351- print (f " Binary compounds: { len (res)} " )
352- print (f " Examples: { [s[0 ] for s in res[:5 ]]} " )
353-
354- # Ternary compounds (3 elements)
355- number_of_elements = 3
299+ # Binary compounds (exactly 2 distinct elements) — database-side pre-filter
300+ # then Python-side length check
356301qb = QueryBuilder()
357- qb.append(
358- StructureData,
359- filters = {' attributes.symbols' : {' longer' : number_of_elements - 1 }},
360- project = [' attributes.formula' , ' id' ]
361- )
362-
363- res = []
364- for struct in qb.iterall():
365- formula = struct[0 ]
366- pattern = ' ^' + ' [A-Z][a-z]*[0-9]*' * number_of_elements + ' $'
367- if formula and re.search(pattern, formula):
368- res.append(struct)
369-
370- print (f " Ternary compounds: { len (res)} " )
371- ```
372-
373- ** Regex pattern explanation:**
374- - ` ^ ` - start of string
375- - ` [A-Z] ` - capital letter (element symbol start)
376- - ` [a-z]* ` - zero or more lowercase letters (element symbol continuation)
377- - ` [0-9]* ` - zero or more digits (stoichiometry)
378- - Repeated ` number_of_elements ` times
379- - ` $ ` - end of string
302+ qb.append(StructureData, project = [' *' ])
380303
381- This ensures the formula has exactly the specified number of element symbols.
304+ binaries = [
305+ s for (s,) in qb.iterall()
306+ if len (s.properties.composition) == 2
307+ ]
308+ print (f " Binary compounds: { len (binaries)} " )
309+ print (f " Examples: { [s.properties.formula for s in binaries[:5 ]]} " )
382310
383- ## Best Practices
384-
385- 1 . ** Filter early** : Use QueryBuilder filters to reduce the result set before post-processing
386- 2 . ** Project efficiently** : Only retrieve the attributes you need
387- 3 . ** Use statistical properties** : Query ` max_charge ` , ` min_charge ` , etc. instead of loading full arrays
388- 4 . ** Use regex carefully** : Regex post-processing is powerful but slower than database filters
389- 5 . ** Check for None** : Always validate that projected values exist before using them in regex
390- 6 . ** Combine filters** : Use ` and ` , ` or ` , and negation (` ! ` ) to build complex queries
391- 7 . ** Understand storage locations** : Database properties are fast to query; repository properties require loading the node
392-
393- :::{note}
394- ** Storage Model Impact on Queries**
395-
396- - ** Fast queries** : Properties in the database (` formula ` , ` symbols ` , ` n_sites ` , statistics)
397- - ** Requires loading** : Per-site arrays in the repository (` positions ` , ` charges ` , ` magmoms ` )
398- - ** Best practice** : Filter using database properties first, then load nodes to access repository arrays
399-
400- Example efficient workflow:
401- ``` python
402- # First: Filter in database by statistics
311+ # If your database is large, pre-filter with a known element to reduce the scan:
403312qb = QueryBuilder()
404313qb.append(
405314 StructureData,
406- filters = {' attributes' : {' and' : [
407- {' max_charge' : {' >' : 1.0 }},
408- {' formula' : {' like' : ' %F e%' }}
409- ]}}
315+ filters = {' attributes.composition' : {' has_key' : ' Fe' }},
316+ project = [' *' ]
410317)
411-
412- # Then: Load only matching nodes to access full charge arrays
413- for (structure,) in qb.iterall():
414- charges = structure.properties.charges # Loads from repository
415- # Process individual charge values...
416- ```
417- :::
418-
419- ## Performance Tips
420-
421- - ** Use ` qb.iterall() ` ** instead of ` qb.all() ` for large result sets to avoid loading everything into memory
422- - ** Filter at database level** : Apply as many filters as possible using QueryBuilder before loading nodes
423- - ** Use statistical properties** : Query ` max_charge ` , ` min_charge ` , etc. to avoid loading repository arrays
424- - ** Use ` project ` ** to retrieve only needed database attributes
425- - ** Load repository data last** : Access ` positions ` , ` charges ` , ` magmoms ` only after filtering
426- - ** For very large databases** : Consider adding pagination with ` limit ` and ` offset `
427-
428- :::{important}
429- ** Performance Comparison**
430-
431- ** Fast** (database query only):
432- ``` python
433- qb = QueryBuilder()
434- qb.append(StructureData, filters = {' attributes.max_charge' : {' >' : 1.0 }})
435- results = qb.all() # Fast - no repository access
436- ```
437-
438- ** Slow** (loading all arrays):
439- ``` python
440- qb = QueryBuilder()
441- qb.append(StructureData)
442- for (s,) in qb.iterall():
443- if " charges" in s.get_defined_properties():
444- if max (s.properties.charges) > 1.0 : # Slow - loads from repository for every structure with charges
445- results.append(s)
318+ fe_binaries = [
319+ s for (s,) in qb.iterall()
320+ if len (s.properties.composition) == 2
321+ ]
322+ print (f " Fe-containing binaries: { len (fe_binaries)} " )
446323```
447- :::
0 commit comments