-
Notifications
You must be signed in to change notification settings - Fork 0
feat: add configurable cache strategies for dynamic enums #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements two caching strategies for dynamic enum validation: **Progressive (default)**: Validates terms lazily - checks cache first, then queries ontology directly for closure membership. Valid terms are added to cache incrementally as encountered. Scales well for large ontologies (SNOMED with 100k+ terms). **Greedy**: Expands entire enum upfront and caches all terms. Good for smaller, frequently-validated enums and CI/CD where determinism matters. Changes: - Add CacheStrategy enum (PROGRESSIVE, GREEDY) to models.py - Change enum cache format from JSON to CSV for git-friendliness - Add is_value_in_enum() for progressive validation via closure lookup - Add _add_to_enum_cache() for append-only progressive caching - Add --cache-strategy CLI option to validate-data command - Support cache_strategy in oak_config.yaml and Python API Documentation: - Update docs/caching.md with enum caching strategies section - Update docs/configuration.md with cache_strategy options - Update docs/cli-reference.md with --cache-strategy option - Add caching strategies tutorial to notebooks/02_advanced_usage.ipynb 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Claude finished @cmungall's task —— View job Pull Request Code ReviewI've completed a comprehensive review of this PR that adds configurable caching strategies for dynamic enums. Here's my detailed analysis: 🔍 Overall AssessmentThis is a well-architected and thoughtfully implemented feature that addresses important performance and usability concerns. The code quality is high with good separation of concerns, comprehensive testing, and excellent documentation. ✅ StrengthsArchitecture & Design
Code Quality
Testing
🚨 Issues Found1. Security - File Path Validation (Medium Priority)Location: The cache file path generation uses user input without sufficient validation: prefix_dir = self.config.cache_dir / prefix.lower() # base.py:140
safe_name = re.sub(r"[^\w\-]", "_", enum_name.lower()) # base.py:338Issue: No validation that Recommendation: Add path traversal protection: def _get_cache_file(self, prefix: str) -> Path:
# Sanitize prefix to prevent path traversal
safe_prefix = re.sub(r'[^\w\-]', '_', prefix.lower())
if not safe_prefix or safe_prefix in ('.', '..'):
raise ValueError(f"Invalid prefix for caching: {prefix}")
prefix_dir = self.config.cache_dir / safe_prefix2. Bug - Cache File Corruption Risk (Medium Priority)Location: The progressive caching uses append mode without file locking: with open(cache_file, "a", newline="") as f: # base.py:407
writer = csv.DictWriter(f, fieldnames=["curie"])
if not file_exists:
writer.writeheader()
writer.writerow({"curie": value})Issue: Concurrent processes could corrupt the CSV file or write duplicate headers Recommendation: Use file locking or atomic writes: import fcntl # Unix only, use appropriate cross-platform solution
with open(cache_file, "a", newline="") as f:
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
# ... write operations3. Performance - Inefficient Cache Loading (Low Priority)Location: The label caching reloads and rewrites entire cache file for single additions: def _save_to_cache(self, prefix: str, curie: str, label: str) -> None:
existing = self._load_cache(prefix) # Loads entire file
existing[curie] = label
# Writes entire cache backRecommendation: For progressive mode, consider append-only operations or batch updates. 📊 Performance AnalysisProgressive Mode ✅
Greedy Mode
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds configurable caching strategies for dynamic enum validation, addressing scalability concerns for large ontologies. The implementation introduces progressive (default, lazy validation) and greedy (upfront expansion) modes, with progressive scaling well for large ontologies like SNOMED (100k+ terms) while greedy provides deterministic behavior suitable for CI/CD. The cache format was also changed from JSON to CSV for better git-friendliness and human readability.
Key changes:
- Added
CacheStrategyenum withPROGRESSIVEandGREEDYmodes - Implemented lazy validation in progressive mode (checks cache → ontology → adds to cache)
- Added strict mode validation to catch fabricated/non-existent term IDs
- Changed enum cache format from JSON to CSV
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_plugins.py | Extensive test coverage for strict/lenient modes, cache strategies, and dynamic enum validation with bindings |
| tests/test_cli.py | Added test verifying --lenient option appears in CLI help |
| src/linkml_term_validator/plugins/dynamic_enum_plugin.py | Refactored to support both progressive and greedy caching strategies for dynamic enum validation |
| src/linkml_term_validator/plugins/binding_plugin.py | Added strict mode, dynamic enum closure validation, and term existence checking |
| src/linkml_term_validator/plugins/base.py | Added enum caching infrastructure, progressive validation, and dynamic enum expansion methods |
| src/linkml_term_validator/models.py | Added CacheStrategy enum and cache_strategy field to ValidationConfig |
| src/linkml_term_validator/cli.py | Added --cache-strategy and --lenient CLI options |
| notebooks/02_advanced_usage.ipynb | Added tutorial section demonstrating cache strategies |
| mkdocs.yml | Added new how-to guide for binding validation |
| docs/howto-bindings.md | New comprehensive guide for validating bindings with dynamic enums |
| docs/configuration.md | Updated with cache strategy configuration examples |
| docs/cli-reference.md | Added documentation for new CLI options |
| docs/caching.md | Expanded with detailed cache strategy explanations and recommendations |
| docs/binding-validation.md | Updated with strict mode documentation |
| docs/anti-hallucination.md | Minor formatting improvements |
Co-authored-by: Copilot <[email protected]>
|
Claude finished @cmungall's task —— View job Pull Request Code ReviewI've completed a comprehensive analysis of this PR that adds configurable caching strategies for dynamic enum validation. Here's my detailed review: 🔍 Overall AssessmentThis is a well-architected and thoughtfully implemented feature that addresses important performance and usability concerns. The code quality is high with excellent separation of concerns, comprehensive testing, and thorough documentation. ✅ StrengthsArchitecture & Design
Code Quality
Testing
🚨 Issues Found1. Security - File Path Validation (Medium Priority)Location: The cache file path generation uses user input without sufficient validation: prefix_dir = self.config.cache_dir / prefix.lower() # base.py:140
safe_name = re.sub(r"[^\w\-]", "_", enum_name.lower()) # base.py:338Issue: No validation that Recommendation: Add path traversal protection: def _get_cache_file(self, prefix: str) -> Path:
# Sanitize prefix to prevent path traversal
safe_prefix = re.sub(r'[^\w\-]', '_', prefix.lower())
if not safe_prefix or safe_prefix in ('.', '..'):
raise ValueError(f"Invalid prefix for caching: {prefix}")
prefix_dir = self.config.cache_dir / safe_prefix2. Bug - Cache File Corruption Risk (Medium Priority)Location: The progressive caching uses append mode without file locking: with open(cache_file, "a", newline="") as f: # base.py:407
writer = csv.DictWriter(f, fieldnames=["curie"])
if not file_exists:
writer.writeheader()
writer.writerow({"curie": value})Issue: Concurrent processes could corrupt the CSV file or write duplicate headers Recommendation: Use file locking or atomic writes: import fcntl # Unix only, use appropriate cross-platform solution
with open(cache_file, "a", newline="") as f:
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
# ... write operations3. Performance - Inefficient Cache Loading (Low Priority)Location: The label caching reloads and rewrites entire cache file for single additions: def _save_to_cache(self, prefix: str, curie: str, label: str) -> None:
existing = self._load_cache(prefix) # Loads entire file
existing[curie] = label
# Writes entire cache backRecommendation: For progressive mode, consider append-only operations or batch updates. 4. Minor - Misleading Comment (Low Priority)Location: As noted by copilot-pull-request-reviewer, the comment is misleading: return False # Term doesn't existShould be: 📊 Performance AnalysisProgressive Mode ✅
Greedy Mode
|
Summary
--cache-strategyCLI option andcache_strategyconfig supportDetails
Progressive Caching (Default)
Greedy Caching
Cache Format
CSV files with
curieheader - pandas-compatible, grep-able, git-diff-friendly:curie GO:0008150 GO:0007049Test plan
🤖 Generated with Claude Code