-
Notifications
You must be signed in to change notification settings - Fork 0
Add Entrez-based reference sources (GEO, BioProject, BioSample) #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces a shared EntrezSummarySource base class for NCBI Entrez E-utilities summary endpoints, with three concrete implementations: - GEOSource: GSE/GDS series from the gds database - BioProjectSource: PRJNA/PRJEB/PRJDB from the bioproject database - BioSampleSource: SAMN/SAME/SAMD from the biosample database Key features: - Automatic detection of prefixed and bare accessions - Reference prefix map for normalizing alternate prefixes (e.g., NCBIGeo → GEO) - Shared validation config loading from .linkml-reference-validator.yaml - Full test coverage with mocked Entrez calls - Documentation with usage examples and prefix aliasing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Claude finished @cmungall's task —— View job Code Review Complete ✅I've conducted a comprehensive code review of this PR. Here's my analysis: Review Checklist:
SummaryThis is a well-designed, high-quality PR that extends the linkml-reference-validator with NCBI Entrez support. The architecture is sound, code quality is excellent, and test coverage is comprehensive. ✅ StrengthsExcellent Architecture
Robust Implementation
Comprehensive Test Coverage
Good Documentation
🔍 Minor Areas for ConsiderationRegex Pattern SafetyThe ID patterns use escaped backslashes in raw strings: ID_PATTERNS = (r"^GSE\\d+$", r"^GDS\\d+$") # entrez.py:199While functionally correct, this could be simplified to: ID_PATTERNS = (r"^GSE\d+$", r"^GDS\d+$")Error RecoveryThe implementation handles API failures gracefully by returning Configuration ComplexityThe prefix normalization logic in 🛡️ Security Assessment
⚡ Performance Considerations
🧪 Test QualityThe test suite is exemplary:
📚 Code Quality
🎯 RecommendationAPPROVE - This PR is ready for merge. It demonstrates excellent software engineering practices, comprehensive testing, and thoughtful design. The implementation is robust, secure, and well-integrated with the existing architecture. The additions enable powerful new validation capabilities for genomics/omics datasets while maintaining the high code quality standards of the project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for validating supporting text against three NCBI Entrez-based reference sources (GEO, BioProject, BioSample) through a shared EntrezSummarySource base class. It also introduces configurable prefix mapping to normalize alternate prefix styles (e.g., geo: → GEO, NCBIGeo: → GEO) and enhances CLI commands with config file support for both validation and repair settings.
Key changes:
- New
EntrezSummarySourcebase class that standardizes fetching from NCBI E-utilitiesesummaryendpoints - Three concrete source implementations (GEOSource, BioProjectSource, BioSampleSource) with pattern-based accession recognition
reference_prefix_mapconfiguration option for normalizing alternate prefixes across all reference sources
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
src/linkml_reference_validator/etl/sources/entrez.py |
Implements new EntrezSummarySource base class and three concrete Entrez-backed sources with configurable field mappings |
src/linkml_reference_validator/models.py |
Adds reference_prefix_map field to ReferenceValidationConfig for prefix normalization |
src/linkml_reference_validator/etl/reference_fetcher.py |
Implements prefix normalization and mapping logic for reference ID parsing |
src/linkml_reference_validator/etl/sources/__init__.py |
Exports new Entrez source classes |
src/linkml_reference_validator/cli/shared.py |
Adds config file loading with validation section extraction |
src/linkml_reference_validator/cli/validate.py |
Updates validate commands to accept config file option |
src/linkml_reference_validator/cli/repair.py |
Refactors repair config loading and adds config file support to text command |
src/linkml_reference_validator/cli/cache.py |
Adds config file option to cache reference command |
tests/test_validation_config.py |
New tests for validation config loading from YAML files |
tests/test_sources.py |
Tests for new Entrez sources including fetch and can_handle behavior |
tests/test_reference_fetcher.py |
Tests for prefix mapping functionality |
docs/how-to/validate-entrez.md |
New documentation guide for using Entrez-based sources |
docs/how-to/add-reference-source.md |
Updated with EntrezSummarySource example for custom sources |
docs/how-to/repair-validation-errors.md |
Updated to show validation and repair config sections |
docs/reference/cli.md |
Documents new --config option across commands and configuration structure |
mkdocs.yml |
Adds validate-entrez guide to navigation |
| yaml = YAML(typ="safe") | ||
| with open(config_file) as f: | ||
| config_data = yaml.load(f) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for file I/O operations. If the config file cannot be read (e.g., due to permissions issues, corrupted file, or invalid YAML), the function will raise an unhandled exception. Consider wrapping the file operations in a try-except block to catch FileNotFoundError, PermissionError, and YAML parsing errors, then either log a warning and return defaults or raise a more informative error message.
| if validation_data is None: | ||
| return ReferenceValidationConfig() | ||
|
|
||
| return ReferenceValidationConfig(**validation_data) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for invalid configuration values. If the validation_data dict contains invalid values for ReferenceValidationConfig fields (e.g., wrong type, out of range values), the Pydantic model initialization will raise a ValidationError. This should be caught and handled gracefully, either by logging a warning and returning defaults or providing a clear error message to the user about which configuration value is invalid.
| if "repair" in config_data: | ||
| repair_data = config_data.get("repair") | ||
| if isinstance(repair_data, dict): | ||
| return RepairConfig(**repair_data) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for invalid configuration values. If the repair_data dict contains invalid values for RepairConfig fields, the Pydantic model initialization will raise a ValidationError. This should be caught and handled gracefully, either by logging a warning and returning defaults or providing a clear error message to the user about which configuration value is invalid.
|
|
||
| repair_keys = set(RepairConfig.model_fields.keys()) | ||
| if repair_keys.intersection(config_data.keys()): | ||
| return RepairConfig(**config_data) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for invalid configuration values. If the config_data dict contains invalid values for RepairConfig fields (at line 296), the Pydantic model initialization will raise a ValidationError. This should be caught and handled gracefully, either by logging a warning and returning defaults or providing a clear error message to the user about which configuration value is invalid.
| ENTREZ_DB = "biosample" | ||
| TITLE_FIELDS = ("Title", "title", "Description") | ||
| CONTENT_FIELDS = ("Description", "Title", "title") | ||
| ID_PATTERNS = (r"^SAM[END]\\d+$",) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect regex pattern: the pattern uses \\d in a raw string, which will match a literal backslash followed by 'd', not a digit character class. In raw strings (r"..."), you should use \d (single backslash) to match digits. The pattern should be r"^SAM[END]\d+$" instead of r"^SAM[END]\\d+$". This will cause the can_handle method to fail to recognize valid BioSample accession patterns.
| def test_can_handle_entrez_sources(self, source, valid_id, invalid_id): | ||
| """Should handle prefixed Entrez references and reject others.""" | ||
| assert source.can_handle(valid_id) | ||
| assert not source.can_handle(invalid_id) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for bare accession patterns. The test only validates prefixed references (e.g., "geo:GSE12345") but does not test the bare accession pattern matching (e.g., "GSE12345", "GDS12345") that uses ID_PATTERNS. Add test cases for bare accessions to ensure the regex patterns in ID_PATTERNS work correctly, especially given the regex bug in the ID_PATTERNS themselves.
| ENTREZ_DB = "example_db" | ||
| TITLE_FIELDS = ("title", "name") | ||
| CONTENT_FIELDS = ("summary", "description") | ||
| ID_PATTERNS = (r"^EX\\d+$",) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation contains incorrect regex pattern example. The pattern r"^EX\\d+$" uses double backslash in a raw string, which will match a literal backslash followed by 'd', not a digit. The correct pattern should be r"^EX\d+$" (single backslash in raw string). This documentation error mirrors the bug found in the actual implementation code and should be corrected to avoid misleading developers.
| ID_PATTERNS = (r"^EX\\d+$",) | |
| ID_PATTERNS = (r"^EX\d+$",) |
| ... ENTREZ_DB = "example_db" | ||
| ... TITLE_FIELDS = ("title",) | ||
| ... CONTENT_FIELDS = ("summary",) | ||
| ... ID_PATTERNS = (r"^EX\\d+$",) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation example contains incorrect regex pattern. The pattern r"^EX\\d+$" uses double backslash in a raw string, which will match a literal backslash followed by 'd', not a digit. The correct pattern should be r"^EX\d+$". This documentation error mirrors the bug in the ID_PATTERNS used by the concrete implementations and should be corrected.
| ... ID_PATTERNS = (r"^EX\\d+$",) | |
| ... ID_PATTERNS = (r"^EX\d+$",) |
| ENTREZ_DB = "gds" | ||
| TITLE_FIELDS = ("title", "description", "summary") | ||
| CONTENT_FIELDS = ("summary", "description", "title") | ||
| ID_PATTERNS = (r"^GSE\\d+$", r"^GDS\\d+$") |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect regex pattern: the pattern uses \\d in a raw string, which will match a literal backslash followed by 'd', not a digit character class. In raw strings (r"..."), you should use \d (single backslash) to match digits. The pattern should be r"^GSE\d+$" instead of r"^GSE\\d+$". This will cause the can_handle method to fail to recognize valid GEO accession patterns.
| ENTREZ_DB = "bioproject" | ||
| TITLE_FIELDS = ("Project_Title", "Project_Name", "title") | ||
| CONTENT_FIELDS = ("Project_Description", "Description", "title") | ||
| ID_PATTERNS = (r"^PRJ[EDN][A-Z]?\\d+$",) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect regex pattern: the pattern uses \\d in a raw string, which will match a literal backslash followed by 'd', not a digit character class. In raw strings (r"..."), you should use \d (single backslash) to match digits. The pattern should be r"^PRJ[EDN][A-Z]?\d+$" instead of r"^PRJ[EDN][A-Z]?\\d+$". This will cause the can_handle method to fail to recognize valid BioProject accession patterns.
Summary
EntrezSummarySourcebase class for NCBI Entrez E-utilitiesesummaryendpointsgeo:GSE12345,GDS*) - fetches fromgdsdatabasebioproject:PRJNA*,PRJEB*,PRJDB*) - fetches frombioprojectdatabasebiosample:SAMN*,SAME*,SAMD*) - fetches frombiosampledatabasereference_prefix_mapconfig option for normalizing alternate prefixes (e.g.,NCBIGeo→GEO).linkml-reference-validator.yamlMotivation
GEO, BioProject, and SRA all use the same NCBI Entrez API infrastructure as PMIDSource. This provides a natural extension for validating supporting text against genomics/omics dataset metadata.
Test plan
can_handle()with prefixed and bare accessionsDocumentation
docs/how-to/validate-entrez.mdwith usage examplesdocs/how-to/add-reference-source.mdwith Entrez example🤖 Generated with Claude Code