Skip to content

Commit 51b46b6

Browse files
authored
Merge pull request #30 from linkml/claude/issue-29-20251124-2117
Add comprehensive documentation for dynamic value set validation
2 parents d919b98 + 8a8a373 commit 51b46b6

File tree

2 files changed

+386
-0
lines changed

2 files changed

+386
-0
lines changed
Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
# Dynamic Value Sets and Validation
2+
3+
Dynamic value sets are a powerful feature in LinkML that allows enums to be populated dynamically from ontologies rather than having hardcoded permissible values. This enables validation against large, evolving controlled vocabularies without manually maintaining enum lists.
4+
5+
## What are Dynamic Value Sets?
6+
7+
Dynamic value sets use the `reachable_from` specification to define enums that are populated from ontology terms. Instead of listing every possible value, you specify:
8+
9+
- **Source ontology**: The ontology to query
10+
- **Source nodes**: Root terms to start from
11+
- **Relationship types**: How to traverse the ontology (e.g., subClassOf)
12+
- **Include self**: Whether to include the root terms themselves
13+
14+
## Available Dynamic Value Sets
15+
16+
The `valuesets` repository contains numerous dynamic value sets across different domains:
17+
18+
### Biological Entities (`bio/bio_entities.yaml`)
19+
20+
#### Cell Types
21+
```yaml
22+
CellType:
23+
description: Any cell type from the Cell Ontology (CL)
24+
reachable_from:
25+
source_ontology: obo:cl
26+
source_nodes:
27+
- CL:0000000 # cell
28+
include_self: true
29+
relationship_types:
30+
- rdfs:subClassOf
31+
```
32+
33+
#### Diseases
34+
```yaml
35+
Disease:
36+
description: Human diseases from the Mondo Disease Ontology
37+
reachable_from:
38+
source_ontology: obo:mondo
39+
source_nodes:
40+
- MONDO:0000001 # disease
41+
include_self: true
42+
relationship_types:
43+
- rdfs:subClassOf
44+
```
45+
46+
#### Chemical Entities
47+
```yaml
48+
ChemicalEntity:
49+
description: Any chemical entity from ChEBI ontology
50+
reachable_from:
51+
source_ontology: obo:chebi
52+
source_nodes:
53+
- CHEBI:24431 # chemical entity
54+
include_self: true
55+
relationship_types:
56+
- rdfs:subClassOf
57+
```
58+
59+
### Anatomical Structures
60+
```yaml
61+
MetazoanAnatomicalStructure:
62+
description: Any anatomical structure found in metazoan organisms
63+
reachable_from:
64+
source_ontology: obo:uberon
65+
source_nodes:
66+
- UBERON:0000061 # anatomical structure
67+
include_self: true
68+
relationship_types:
69+
- rdfs:subClassOf
70+
```
71+
72+
### Taxonomy (`bio/taxonomy.yaml`)
73+
```yaml
74+
OrganismTaxonEnum:
75+
description: All organism taxa from NCBI Taxonomy
76+
reachable_from:
77+
source_nodes:
78+
- NCBITaxon:1 # root
79+
is_direct: false
80+
relationship_types:
81+
- rdfs:subClassOf
82+
```
83+
84+
### Investigation Protocols (`investigation.yaml`)
85+
```yaml
86+
StudyDesignEnum:
87+
description: Study design classifications from OBI
88+
reachable_from:
89+
source_nodes:
90+
- OBI:0500000 # study design
91+
is_direct: false
92+
relationship_types:
93+
- rdfs:subClassOf
94+
```
95+
96+
## Using Dynamic Value Sets in Schemas
97+
98+
### Basic Usage
99+
```yaml
100+
# In your schema file
101+
slots:
102+
cell_type:
103+
description: Type of cell being studied
104+
range: CellType # References the dynamic enum
105+
106+
disease:
107+
description: Disease under investigation
108+
range: Disease # References the dynamic enum
109+
```
110+
111+
### Instance Data Validation
112+
```yaml
113+
# Example instance data
114+
person:
115+
cell_type: CL:0000540 # neuron
116+
disease: MONDO:0005148 # type 2 diabetes mellitus
117+
```
118+
119+
## Validation Approaches
120+
121+
### 1. Static Validation
122+
Current LinkML validators can check that values match the ontology prefix patterns:
123+
124+
```python
125+
from linkml.validators.jsonschemavalidator import JsonSchemaValidator
126+
127+
# Validate that cell type follows CL: pattern
128+
validator = JsonSchemaValidator(schema="path/to/schema.yaml")
129+
report = validator.validate(instance_data)
130+
```
131+
132+
### 2. Ontology-based Validation
133+
134+
For full dynamic validation, you can use ontology access tools:
135+
136+
```python
137+
from oaklib import get_adapter
138+
from linkml_runtime.utils.schemaview import SchemaView
139+
140+
# Load ontology adapter
141+
cl_adapter = get_adapter("obo:cl")
142+
143+
# Check if a term is a valid cell type
144+
def validate_cell_type(term_id: str) -> bool:
145+
"""Validate that term_id is a subclass of cell (CL:0000000)"""
146+
return cl_adapter.is_subclass_of(term_id, "CL:0000000")
147+
148+
# Example usage
149+
is_valid = validate_cell_type("CL:0000540") # True - neuron is a cell
150+
```
151+
152+
### 3. Batch Validation with OAK
153+
154+
```python
155+
from oaklib import get_adapter
156+
157+
def validate_disease_terms(term_ids: list[str]) -> dict[str, bool]:
158+
"""Validate multiple disease terms against MONDO"""
159+
mondo_adapter = get_adapter("obo:mondo")
160+
results = {}
161+
162+
for term_id in term_ids:
163+
try:
164+
# Check if term exists and is a disease
165+
is_valid = mondo_adapter.is_subclass_of(term_id, "MONDO:0000001")
166+
results[term_id] = is_valid
167+
except Exception:
168+
results[term_id] = False
169+
170+
return results
171+
172+
# Example usage
173+
disease_terms = ["MONDO:0005148", "MONDO:0004992", "INVALID:123"]
174+
validation_results = validate_disease_terms(disease_terms)
175+
```
176+
177+
## Practical Examples
178+
179+
### Example 1: Cell Biology Study
180+
181+
```yaml
182+
# Schema definition
183+
classes:
184+
CellExperiment:
185+
attributes:
186+
cell_type:
187+
range: CellType
188+
required: true
189+
treatment_compound:
190+
range: ChemicalEntity
191+
required: false
192+
193+
# Instance data
194+
experiment_1:
195+
cell_type: CL:0000540 # neuron
196+
treatment_compound: CHEBI:15377 # water
197+
198+
experiment_2:
199+
cell_type: CL:0000136 # fat cell
200+
treatment_compound: CHEBI:27732 # caffeine
201+
```
202+
203+
### Example 2: Disease Research
204+
205+
```yaml
206+
# Schema definition
207+
classes:
208+
DiseaseStudy:
209+
attributes:
210+
primary_disease:
211+
range: Disease
212+
required: true
213+
comorbidities:
214+
range: Disease
215+
multivalued: true
216+
affected_anatomy:
217+
range: MetazoanAnatomicalStructure
218+
multivalued: true
219+
220+
# Instance data
221+
diabetes_study:
222+
primary_disease: MONDO:0005148 # type 2 diabetes
223+
comorbidities:
224+
- MONDO:0005267 # heart disease
225+
- MONDO:0005147 # type 1 diabetes
226+
affected_anatomy:
227+
- UBERON:0001264 # pancreas
228+
- UBERON:0004535 # cardiovascular system
229+
```
230+
231+
### Example 3: Taxonomic Classification
232+
233+
```yaml
234+
# Schema definition
235+
classes:
236+
OrganismSample:
237+
attributes:
238+
species:
239+
range: OrganismTaxonEnum
240+
required: true
241+
genus:
242+
range: OrganismTaxonEnum
243+
required: false
244+
245+
# Instance data
246+
mouse_sample:
247+
species: NCBITaxon:10090 # Mus musculus (house mouse)
248+
genus: NCBITaxon:10088 # Mus (mouse genus)
249+
250+
human_sample:
251+
species: NCBITaxon:9606 # Homo sapiens
252+
genus: NCBITaxon:9605 # Homo
253+
```
254+
255+
## Validation Tools and Libraries
256+
257+
### OAK (Ontology Access Kit)
258+
The primary tool for working with ontologies in the LinkML ecosystem:
259+
260+
```bash
261+
# Install OAK
262+
pip install oaklib
263+
264+
# Basic ontology queries
265+
runoak -i obo:cl descendants CL:0000000 # All cell types
266+
runoak -i obo:mondo info MONDO:0005148 # Diabetes info
267+
runoak -i obo:chebi ancestors CHEBI:15377 # Water ancestors
268+
```
269+
270+
### Custom Validation Functions
271+
272+
```python
273+
from oaklib import get_adapter
274+
from typing import Dict, List, Optional
275+
276+
class DynamicEnumValidator:
277+
"""Validator for dynamic enums using ontology lookup"""
278+
279+
def __init__(self):
280+
self.adapters = {
281+
'cl': get_adapter('obo:cl'),
282+
'mondo': get_adapter('obo:mondo'),
283+
'chebi': get_adapter('obo:chebi'),
284+
'uberon': get_adapter('obo:uberon'),
285+
'ncbitaxon': get_adapter('obo:ncbitaxon')
286+
}
287+
288+
def validate_term(self, term_id: str, root_term: str) -> bool:
289+
"""Validate that term_id is reachable from root_term"""
290+
prefix = term_id.split(':')[0].lower()
291+
if prefix not in self.adapters:
292+
return False
293+
294+
adapter = self.adapters[prefix]
295+
try:
296+
return adapter.is_subclass_of(term_id, root_term)
297+
except Exception:
298+
return False
299+
300+
def validate_cell_type(self, term_id: str) -> bool:
301+
"""Validate cell type against CL:0000000"""
302+
return self.validate_term(term_id, "CL:0000000")
303+
304+
def validate_disease(self, term_id: str) -> bool:
305+
"""Validate disease against MONDO:0000001"""
306+
return self.validate_term(term_id, "MONDO:0000001")
307+
308+
def validate_chemical(self, term_id: str) -> bool:
309+
"""Validate chemical against CHEBI:24431"""
310+
return self.validate_term(term_id, "CHEBI:24431")
311+
312+
# Usage example
313+
validator = DynamicEnumValidator()
314+
print(validator.validate_cell_type("CL:0000540")) # True
315+
print(validator.validate_disease("MONDO:0005148")) # True
316+
print(validator.validate_chemical("CHEBI:15377")) # True
317+
```
318+
319+
## Best Practices
320+
321+
### 1. Choose Appropriate Root Terms
322+
- Use specific enough root terms to avoid overly broad value sets
323+
- For cell types, consider using specific cell lineages rather than the root "cell" term
324+
- For diseases, use disease categories (infectious, genetic, etc.) when appropriate
325+
326+
### 2. Include Ontology Prefixes in Schema
327+
```yaml
328+
prefixes:
329+
CL: http://purl.obolibrary.org/obo/CL_
330+
MONDO: http://purl.obolibrary.org/obo/MONDO_
331+
CHEBI: http://purl.obolibrary.org/obo/CHEBI_
332+
UBERON: http://purl.obolibrary.org/obo/UBERON_
333+
```
334+
335+
### 3. Validate During Development
336+
- Test dynamic enums with representative data during schema development
337+
- Use OAK to explore ontology hierarchies before choosing root terms
338+
- Document expected term formats and validation requirements
339+
340+
### 4. Handle Validation Errors Gracefully
341+
```python
342+
def safe_validate_term(term_id: str, validator_func) -> Optional[bool]:
343+
"""Safely validate a term with error handling"""
344+
try:
345+
return validator_func(term_id)
346+
except Exception as e:
347+
print(f"Validation error for {term_id}: {e}")
348+
return None
349+
```
350+
351+
## Limitations and Considerations
352+
353+
### Current Limitations
354+
- Runtime enum expansion is still under development
355+
- Some ontology adapters may require internet connectivity
356+
- Large ontologies can make validation slow
357+
- Not all ontologies may be available through OAK
358+
359+
### Performance Considerations
360+
- Cache ontology adapters when validating multiple terms
361+
- Consider using local ontology files for better performance
362+
- Batch validation calls when possible
363+
364+
### Future Developments
365+
- Automated enum materialization from ontologies
366+
- Better integration with LinkML validators
367+
- Support for more relationship types and boolean combinations
368+
- Subset filtering capabilities
369+
370+
## Additional Resources
371+
372+
- [LinkML Dynamic Enums Documentation](https://linkml.io/linkml/schemas/enums.html#dynamic-enums)
373+
- [OAK (Ontology Access Kit) Documentation](https://incatools.github.io/ontology-access-kit/)
374+
- [LinkML GitHub Discussion on Dynamic Enums](https://github.com/orgs/linkml/discussions/2300)
375+
- [BioPortal Ontology Repository](https://bioportal.bioontology.org/)
376+
- [OBO Foundry Ontologies](http://www.obofoundry.org/)
377+
378+
---
379+
380+
*This documentation covers the current state of dynamic value set validation in LinkML. As the framework continues to evolve, some features may become available that aren't yet implemented.*

docs/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ A collection of commonly used value sets
44

55
- Auto-generated [schema documentation](elements/#enumerations)
66

7+
## How-to Guides
8+
9+
- [Dynamic Value Sets and Validation](how-to-guides/dynamic-valuesets-validation.md) - Guide to using and validating dynamic value sets with ontology integration
10+
- [Agentic IDE Support](how-to-guides/agentic-ide-support.md)
11+
- [Sync UniProt Species](how-to-guides/sync-uniprot-species.md)
12+
713
Note: this schema consists ONLY of enums, so it is normal
814
that classes and slots are empty.
915

0 commit comments

Comments
 (0)