|
| 1 | +# Dynamic Value Sets and Validation |
| 2 | + |
| 3 | +Dynamic value sets are a powerful feature in LinkML that allows enums to be populated dynamically from ontologies rather than having hardcoded permissible values. This enables validation against large, evolving controlled vocabularies without manually maintaining enum lists. |
| 4 | + |
| 5 | +## What are Dynamic Value Sets? |
| 6 | + |
| 7 | +Dynamic value sets use the `reachable_from` specification to define enums that are populated from ontology terms. Instead of listing every possible value, you specify: |
| 8 | + |
| 9 | +- **Source ontology**: The ontology to query |
| 10 | +- **Source nodes**: Root terms to start from |
| 11 | +- **Relationship types**: How to traverse the ontology (e.g., subClassOf) |
| 12 | +- **Include self**: Whether to include the root terms themselves |
| 13 | + |
| 14 | +## Available Dynamic Value Sets |
| 15 | + |
| 16 | +The `valuesets` repository contains numerous dynamic value sets across different domains: |
| 17 | + |
| 18 | +### Biological Entities (`bio/bio_entities.yaml`) |
| 19 | + |
| 20 | +#### Cell Types |
| 21 | +```yaml |
| 22 | +CellType: |
| 23 | + description: Any cell type from the Cell Ontology (CL) |
| 24 | + reachable_from: |
| 25 | + source_ontology: obo:cl |
| 26 | + source_nodes: |
| 27 | + - CL:0000000 # cell |
| 28 | + include_self: true |
| 29 | + relationship_types: |
| 30 | + - rdfs:subClassOf |
| 31 | +``` |
| 32 | +
|
| 33 | +#### Diseases |
| 34 | +```yaml |
| 35 | +Disease: |
| 36 | + description: Human diseases from the Mondo Disease Ontology |
| 37 | + reachable_from: |
| 38 | + source_ontology: obo:mondo |
| 39 | + source_nodes: |
| 40 | + - MONDO:0000001 # disease |
| 41 | + include_self: true |
| 42 | + relationship_types: |
| 43 | + - rdfs:subClassOf |
| 44 | +``` |
| 45 | +
|
| 46 | +#### Chemical Entities |
| 47 | +```yaml |
| 48 | +ChemicalEntity: |
| 49 | + description: Any chemical entity from ChEBI ontology |
| 50 | + reachable_from: |
| 51 | + source_ontology: obo:chebi |
| 52 | + source_nodes: |
| 53 | + - CHEBI:24431 # chemical entity |
| 54 | + include_self: true |
| 55 | + relationship_types: |
| 56 | + - rdfs:subClassOf |
| 57 | +``` |
| 58 | +
|
| 59 | +### Anatomical Structures |
| 60 | +```yaml |
| 61 | +MetazoanAnatomicalStructure: |
| 62 | + description: Any anatomical structure found in metazoan organisms |
| 63 | + reachable_from: |
| 64 | + source_ontology: obo:uberon |
| 65 | + source_nodes: |
| 66 | + - UBERON:0000061 # anatomical structure |
| 67 | + include_self: true |
| 68 | + relationship_types: |
| 69 | + - rdfs:subClassOf |
| 70 | +``` |
| 71 | +
|
| 72 | +### Taxonomy (`bio/taxonomy.yaml`) |
| 73 | +```yaml |
| 74 | +OrganismTaxonEnum: |
| 75 | + description: All organism taxa from NCBI Taxonomy |
| 76 | + reachable_from: |
| 77 | + source_nodes: |
| 78 | + - NCBITaxon:1 # root |
| 79 | + is_direct: false |
| 80 | + relationship_types: |
| 81 | + - rdfs:subClassOf |
| 82 | +``` |
| 83 | + |
| 84 | +### Investigation Protocols (`investigation.yaml`) |
| 85 | +```yaml |
| 86 | +StudyDesignEnum: |
| 87 | + description: Study design classifications from OBI |
| 88 | + reachable_from: |
| 89 | + source_nodes: |
| 90 | + - OBI:0500000 # study design |
| 91 | + is_direct: false |
| 92 | + relationship_types: |
| 93 | + - rdfs:subClassOf |
| 94 | +``` |
| 95 | + |
| 96 | +## Using Dynamic Value Sets in Schemas |
| 97 | + |
| 98 | +### Basic Usage |
| 99 | +```yaml |
| 100 | +# In your schema file |
| 101 | +slots: |
| 102 | + cell_type: |
| 103 | + description: Type of cell being studied |
| 104 | + range: CellType # References the dynamic enum |
| 105 | +
|
| 106 | + disease: |
| 107 | + description: Disease under investigation |
| 108 | + range: Disease # References the dynamic enum |
| 109 | +``` |
| 110 | + |
| 111 | +### Instance Data Validation |
| 112 | +```yaml |
| 113 | +# Example instance data |
| 114 | +person: |
| 115 | + cell_type: CL:0000540 # neuron |
| 116 | + disease: MONDO:0005148 # type 2 diabetes mellitus |
| 117 | +``` |
| 118 | + |
| 119 | +## Validation Approaches |
| 120 | + |
| 121 | +### 1. Static Validation |
| 122 | +Current LinkML validators can check that values match the ontology prefix patterns: |
| 123 | + |
| 124 | +```python |
| 125 | +from linkml.validators.jsonschemavalidator import JsonSchemaValidator |
| 126 | +
|
| 127 | +# Validate that cell type follows CL: pattern |
| 128 | +validator = JsonSchemaValidator(schema="path/to/schema.yaml") |
| 129 | +report = validator.validate(instance_data) |
| 130 | +``` |
| 131 | + |
| 132 | +### 2. Ontology-based Validation |
| 133 | + |
| 134 | +For full dynamic validation, you can use ontology access tools: |
| 135 | + |
| 136 | +```python |
| 137 | +from oaklib import get_adapter |
| 138 | +from linkml_runtime.utils.schemaview import SchemaView |
| 139 | +
|
| 140 | +# Load ontology adapter |
| 141 | +cl_adapter = get_adapter("obo:cl") |
| 142 | +
|
| 143 | +# Check if a term is a valid cell type |
| 144 | +def validate_cell_type(term_id: str) -> bool: |
| 145 | + """Validate that term_id is a subclass of cell (CL:0000000)""" |
| 146 | + return cl_adapter.is_subclass_of(term_id, "CL:0000000") |
| 147 | +
|
| 148 | +# Example usage |
| 149 | +is_valid = validate_cell_type("CL:0000540") # True - neuron is a cell |
| 150 | +``` |
| 151 | + |
| 152 | +### 3. Batch Validation with OAK |
| 153 | + |
| 154 | +```python |
| 155 | +from oaklib import get_adapter |
| 156 | +
|
| 157 | +def validate_disease_terms(term_ids: list[str]) -> dict[str, bool]: |
| 158 | + """Validate multiple disease terms against MONDO""" |
| 159 | + mondo_adapter = get_adapter("obo:mondo") |
| 160 | + results = {} |
| 161 | + |
| 162 | + for term_id in term_ids: |
| 163 | + try: |
| 164 | + # Check if term exists and is a disease |
| 165 | + is_valid = mondo_adapter.is_subclass_of(term_id, "MONDO:0000001") |
| 166 | + results[term_id] = is_valid |
| 167 | + except Exception: |
| 168 | + results[term_id] = False |
| 169 | + |
| 170 | + return results |
| 171 | +
|
| 172 | +# Example usage |
| 173 | +disease_terms = ["MONDO:0005148", "MONDO:0004992", "INVALID:123"] |
| 174 | +validation_results = validate_disease_terms(disease_terms) |
| 175 | +``` |
| 176 | + |
| 177 | +## Practical Examples |
| 178 | + |
| 179 | +### Example 1: Cell Biology Study |
| 180 | + |
| 181 | +```yaml |
| 182 | +# Schema definition |
| 183 | +classes: |
| 184 | + CellExperiment: |
| 185 | + attributes: |
| 186 | + cell_type: |
| 187 | + range: CellType |
| 188 | + required: true |
| 189 | + treatment_compound: |
| 190 | + range: ChemicalEntity |
| 191 | + required: false |
| 192 | +
|
| 193 | +# Instance data |
| 194 | +experiment_1: |
| 195 | + cell_type: CL:0000540 # neuron |
| 196 | + treatment_compound: CHEBI:15377 # water |
| 197 | +
|
| 198 | +experiment_2: |
| 199 | + cell_type: CL:0000136 # fat cell |
| 200 | + treatment_compound: CHEBI:27732 # caffeine |
| 201 | +``` |
| 202 | + |
| 203 | +### Example 2: Disease Research |
| 204 | + |
| 205 | +```yaml |
| 206 | +# Schema definition |
| 207 | +classes: |
| 208 | + DiseaseStudy: |
| 209 | + attributes: |
| 210 | + primary_disease: |
| 211 | + range: Disease |
| 212 | + required: true |
| 213 | + comorbidities: |
| 214 | + range: Disease |
| 215 | + multivalued: true |
| 216 | + affected_anatomy: |
| 217 | + range: MetazoanAnatomicalStructure |
| 218 | + multivalued: true |
| 219 | +
|
| 220 | +# Instance data |
| 221 | +diabetes_study: |
| 222 | + primary_disease: MONDO:0005148 # type 2 diabetes |
| 223 | + comorbidities: |
| 224 | + - MONDO:0005267 # heart disease |
| 225 | + - MONDO:0005147 # type 1 diabetes |
| 226 | + affected_anatomy: |
| 227 | + - UBERON:0001264 # pancreas |
| 228 | + - UBERON:0004535 # cardiovascular system |
| 229 | +``` |
| 230 | + |
| 231 | +### Example 3: Taxonomic Classification |
| 232 | + |
| 233 | +```yaml |
| 234 | +# Schema definition |
| 235 | +classes: |
| 236 | + OrganismSample: |
| 237 | + attributes: |
| 238 | + species: |
| 239 | + range: OrganismTaxonEnum |
| 240 | + required: true |
| 241 | + genus: |
| 242 | + range: OrganismTaxonEnum |
| 243 | + required: false |
| 244 | + |
| 245 | +# Instance data |
| 246 | +mouse_sample: |
| 247 | + species: NCBITaxon:10090 # Mus musculus (house mouse) |
| 248 | + genus: NCBITaxon:10088 # Mus (mouse genus) |
| 249 | +
|
| 250 | +human_sample: |
| 251 | + species: NCBITaxon:9606 # Homo sapiens |
| 252 | + genus: NCBITaxon:9605 # Homo |
| 253 | +``` |
| 254 | + |
| 255 | +## Validation Tools and Libraries |
| 256 | + |
| 257 | +### OAK (Ontology Access Kit) |
| 258 | +The primary tool for working with ontologies in the LinkML ecosystem: |
| 259 | + |
| 260 | +```bash |
| 261 | +# Install OAK |
| 262 | +pip install oaklib |
| 263 | +
|
| 264 | +# Basic ontology queries |
| 265 | +runoak -i obo:cl descendants CL:0000000 # All cell types |
| 266 | +runoak -i obo:mondo info MONDO:0005148 # Diabetes info |
| 267 | +runoak -i obo:chebi ancestors CHEBI:15377 # Water ancestors |
| 268 | +``` |
| 269 | + |
| 270 | +### Custom Validation Functions |
| 271 | + |
| 272 | +```python |
| 273 | +from oaklib import get_adapter |
| 274 | +from typing import Dict, List, Optional |
| 275 | +
|
| 276 | +class DynamicEnumValidator: |
| 277 | + """Validator for dynamic enums using ontology lookup""" |
| 278 | + |
| 279 | + def __init__(self): |
| 280 | + self.adapters = { |
| 281 | + 'cl': get_adapter('obo:cl'), |
| 282 | + 'mondo': get_adapter('obo:mondo'), |
| 283 | + 'chebi': get_adapter('obo:chebi'), |
| 284 | + 'uberon': get_adapter('obo:uberon'), |
| 285 | + 'ncbitaxon': get_adapter('obo:ncbitaxon') |
| 286 | + } |
| 287 | + |
| 288 | + def validate_term(self, term_id: str, root_term: str) -> bool: |
| 289 | + """Validate that term_id is reachable from root_term""" |
| 290 | + prefix = term_id.split(':')[0].lower() |
| 291 | + if prefix not in self.adapters: |
| 292 | + return False |
| 293 | + |
| 294 | + adapter = self.adapters[prefix] |
| 295 | + try: |
| 296 | + return adapter.is_subclass_of(term_id, root_term) |
| 297 | + except Exception: |
| 298 | + return False |
| 299 | + |
| 300 | + def validate_cell_type(self, term_id: str) -> bool: |
| 301 | + """Validate cell type against CL:0000000""" |
| 302 | + return self.validate_term(term_id, "CL:0000000") |
| 303 | + |
| 304 | + def validate_disease(self, term_id: str) -> bool: |
| 305 | + """Validate disease against MONDO:0000001""" |
| 306 | + return self.validate_term(term_id, "MONDO:0000001") |
| 307 | + |
| 308 | + def validate_chemical(self, term_id: str) -> bool: |
| 309 | + """Validate chemical against CHEBI:24431""" |
| 310 | + return self.validate_term(term_id, "CHEBI:24431") |
| 311 | +
|
| 312 | +# Usage example |
| 313 | +validator = DynamicEnumValidator() |
| 314 | +print(validator.validate_cell_type("CL:0000540")) # True |
| 315 | +print(validator.validate_disease("MONDO:0005148")) # True |
| 316 | +print(validator.validate_chemical("CHEBI:15377")) # True |
| 317 | +``` |
| 318 | + |
| 319 | +## Best Practices |
| 320 | + |
| 321 | +### 1. Choose Appropriate Root Terms |
| 322 | +- Use specific enough root terms to avoid overly broad value sets |
| 323 | +- For cell types, consider using specific cell lineages rather than the root "cell" term |
| 324 | +- For diseases, use disease categories (infectious, genetic, etc.) when appropriate |
| 325 | + |
| 326 | +### 2. Include Ontology Prefixes in Schema |
| 327 | +```yaml |
| 328 | +prefixes: |
| 329 | + CL: http://purl.obolibrary.org/obo/CL_ |
| 330 | + MONDO: http://purl.obolibrary.org/obo/MONDO_ |
| 331 | + CHEBI: http://purl.obolibrary.org/obo/CHEBI_ |
| 332 | + UBERON: http://purl.obolibrary.org/obo/UBERON_ |
| 333 | +``` |
| 334 | + |
| 335 | +### 3. Validate During Development |
| 336 | +- Test dynamic enums with representative data during schema development |
| 337 | +- Use OAK to explore ontology hierarchies before choosing root terms |
| 338 | +- Document expected term formats and validation requirements |
| 339 | + |
| 340 | +### 4. Handle Validation Errors Gracefully |
| 341 | +```python |
| 342 | +def safe_validate_term(term_id: str, validator_func) -> Optional[bool]: |
| 343 | + """Safely validate a term with error handling""" |
| 344 | + try: |
| 345 | + return validator_func(term_id) |
| 346 | + except Exception as e: |
| 347 | + print(f"Validation error for {term_id}: {e}") |
| 348 | + return None |
| 349 | +``` |
| 350 | + |
| 351 | +## Limitations and Considerations |
| 352 | + |
| 353 | +### Current Limitations |
| 354 | +- Runtime enum expansion is still under development |
| 355 | +- Some ontology adapters may require internet connectivity |
| 356 | +- Large ontologies can make validation slow |
| 357 | +- Not all ontologies may be available through OAK |
| 358 | + |
| 359 | +### Performance Considerations |
| 360 | +- Cache ontology adapters when validating multiple terms |
| 361 | +- Consider using local ontology files for better performance |
| 362 | +- Batch validation calls when possible |
| 363 | + |
| 364 | +### Future Developments |
| 365 | +- Automated enum materialization from ontologies |
| 366 | +- Better integration with LinkML validators |
| 367 | +- Support for more relationship types and boolean combinations |
| 368 | +- Subset filtering capabilities |
| 369 | + |
| 370 | +## Additional Resources |
| 371 | + |
| 372 | +- [LinkML Dynamic Enums Documentation](https://linkml.io/linkml/schemas/enums.html#dynamic-enums) |
| 373 | +- [OAK (Ontology Access Kit) Documentation](https://incatools.github.io/ontology-access-kit/) |
| 374 | +- [LinkML GitHub Discussion on Dynamic Enums](https://github.com/orgs/linkml/discussions/2300) |
| 375 | +- [BioPortal Ontology Repository](https://bioportal.bioontology.org/) |
| 376 | +- [OBO Foundry Ontologies](http://www.obofoundry.org/) |
| 377 | + |
| 378 | +--- |
| 379 | + |
| 380 | +*This documentation covers the current state of dynamic value set validation in LinkML. As the framework continues to evolve, some features may become available that aren't yet implemented.* |
0 commit comments