-
Notifications
You must be signed in to change notification settings - Fork 0
Add Entrez-based reference sources (GEO, BioProject, BioSample) #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| # Validating Entrez Accessions | ||
|
|
||
| This guide shows how to validate supporting text against NCBI Entrez records for GEO, BioProject, and BioSample. | ||
|
|
||
| ## Overview | ||
|
|
||
| These sources use the NCBI Entrez E-utilities `esummary` endpoint: | ||
|
|
||
| - **GEO** (GSE/GDS): summaries from the `gds` database | ||
| - **BioProject** (PRJNA/PRJEB/PRJDB): summaries from the `bioproject` database | ||
| - **BioSample** (SAMN/SAME/SAMD): summaries from the `biosample` database | ||
|
|
||
| The validator uses the returned summary/description fields as the content for matching. | ||
|
|
||
| ## Basic Usage | ||
|
|
||
| ### GEO (GSE or GDS) | ||
|
|
||
| ```bash | ||
| linkml-reference-validator validate text \ | ||
| "RNA-seq analysis of cardiac tissue" \ | ||
| GEO:GSE12345 | ||
| ``` | ||
|
|
||
| ### BioProject | ||
|
|
||
| ```bash | ||
| linkml-reference-validator validate text \ | ||
| "Whole genome sequencing project for strain X" \ | ||
| BioProject:PRJNA12345 | ||
| ``` | ||
|
|
||
| ### BioSample | ||
|
|
||
| ```bash | ||
| linkml-reference-validator validate text \ | ||
| "Human liver biopsy sample description" \ | ||
| BioSample:SAMN12345678 | ||
| ``` | ||
|
|
||
| ## Accepted Identifier Formats | ||
|
|
||
| You can use either prefixed or bare accessions: | ||
|
|
||
| ``` | ||
| GEO:GSE12345 | ||
| GDS12345 | ||
| BioProject:PRJNA12345 | ||
| PRJEB12345 | ||
| BioSample:SAMN12345678 | ||
| SAME1234567 | ||
| ``` | ||
|
|
||
| ## Prefix Aliases and Normalization | ||
|
|
||
| Prefixes are case-insensitive and can be normalized with a configuration map. This | ||
| is useful when data uses alternate prefix styles such as `geo:` or `NCBIGeo:`. | ||
|
|
||
| Create `.linkml-reference-validator.yaml` with a `validation` section: | ||
|
|
||
| ```yaml | ||
| validation: | ||
| reference_prefix_map: | ||
| geo: GEO | ||
| NCBIGeo: GEO | ||
| NCBIBioProject: BIOPROJECT | ||
| NCBIBioSample: BIOSAMPLE | ||
| ``` | ||
|
|
||
| You can also configure this programmatically: | ||
|
|
||
| ```python | ||
| from linkml_reference_validator.models import ReferenceValidationConfig | ||
|
|
||
| config = ReferenceValidationConfig( | ||
| reference_prefix_map={"geo": "GEO", "NCBIGeo": "GEO"} | ||
| ) | ||
| ``` | ||
|
|
||
| Pass the config file to CLI commands with `--config .linkml-reference-validator.yaml`. | ||
|
|
||
| ## Pre-caching Entrez Records | ||
|
|
||
| For offline validation or to speed up repeated validations: | ||
|
|
||
| ```bash | ||
| linkml-reference-validator cache reference GEO:GSE12345 | ||
| linkml-reference-validator cache reference BioProject:PRJNA12345 | ||
| linkml-reference-validator cache reference BioSample:SAMN12345678 | ||
| ``` | ||
|
|
||
| Cached references are stored in `references_cache/` as markdown files with YAML frontmatter. | ||
|
|
||
| ## Rate Limiting and Email | ||
|
|
||
| NCBI requires a valid contact email for Entrez API usage. Configure it in your settings: | ||
|
|
||
| ```python | ||
| from linkml_reference_validator.models import ReferenceValidationConfig | ||
|
|
||
| config = ReferenceValidationConfig( | ||
| email="[email protected]", | ||
| rate_limit_delay=0.5, | ||
| ) | ||
| ``` | ||
|
|
||
| ## Content Availability | ||
|
|
||
| Entrez summaries vary by record. If a summary field is missing, the validator will return | ||
| `content_type: unavailable` and matching may fail. | ||
|
|
||
| ## See Also | ||
|
|
||
| - [Adding a New Reference Source](add-reference-source.md) | ||
| - [Quickstart](../quickstart.md) | ||
| - [CLI Reference](../reference/cli.md) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,13 +13,16 @@ | |
| from ruamel.yaml import YAML | ||
| from typing_extensions import Annotated | ||
|
|
||
| from linkml_reference_validator.models import ( | ||
| ReferenceValidationConfig, | ||
| RepairConfig, | ||
| ) | ||
| from linkml_reference_validator.models import RepairConfig | ||
| from linkml_reference_validator.validation.repairer import SupportingTextRepairer | ||
|
|
||
| from .shared import CacheDirOption, VerboseOption, setup_logging | ||
| from .shared import ( | ||
| CacheDirOption, | ||
| VerboseOption, | ||
| ConfigFileOption, | ||
| setup_logging, | ||
| load_validation_config, | ||
| ) | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
@@ -75,13 +78,7 @@ def data_command( | |
| output: OutputOption = None, | ||
| cache_dir: CacheDirOption = None, | ||
| verbose: VerboseOption = False, | ||
| config_file: Annotated[ | ||
| Optional[Path], | ||
| typer.Option( | ||
| "--config", | ||
| help="Path to repair configuration file (.yaml)", | ||
| ), | ||
| ] = None, | ||
| config_file: ConfigFileOption = None, | ||
| ): | ||
| """Repair supporting text in a data file. | ||
|
|
||
|
|
@@ -120,7 +117,7 @@ def data_command( | |
| repair_config.dry_run = dry_run | ||
|
|
||
| # Set up validation config | ||
| val_config = ReferenceValidationConfig() | ||
| val_config = load_validation_config(config_file) | ||
| if cache_dir: | ||
| val_config.cache_dir = cache_dir | ||
|
|
||
|
|
@@ -198,6 +195,7 @@ def text_command( | |
| cache_dir: CacheDirOption = None, | ||
| verbose: VerboseOption = False, | ||
| auto_fix_threshold: AutoFixThresholdOption = 0.95, | ||
| config_file: ConfigFileOption = None, | ||
| ): | ||
| """Attempt to repair a single supporting text quote. | ||
|
|
||
|
|
@@ -214,7 +212,7 @@ def text_command( | |
| """ | ||
| setup_logging(verbose) | ||
|
|
||
| val_config = ReferenceValidationConfig() | ||
| val_config = load_validation_config(config_file) | ||
| if cache_dir: | ||
| val_config.cache_dir = cache_dir | ||
|
|
||
|
|
@@ -283,10 +281,21 @@ def _load_repair_config(config_file: Optional[Path]) -> RepairConfig: | |
| if config_data is None: | ||
| return RepairConfig() | ||
|
|
||
| if not isinstance(config_data, dict): | ||
| return RepairConfig() | ||
|
|
||
| # Extract repair section if present | ||
| repair_data = config_data.get("repair", config_data) | ||
| if "repair" in config_data: | ||
| repair_data = config_data.get("repair") | ||
| if isinstance(repair_data, dict): | ||
| return RepairConfig(**repair_data) | ||
|
||
| return RepairConfig() | ||
|
|
||
| repair_keys = set(RepairConfig.model_fields.keys()) | ||
| if repair_keys.intersection(config_data.keys()): | ||
| return RepairConfig(**config_data) | ||
|
||
|
|
||
| return RepairConfig(**repair_data) | ||
| return RepairConfig() | ||
|
|
||
|
|
||
| def _extract_evidence_items( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation contains incorrect regex pattern example. The pattern
r"^EX\\d+$"uses double backslash in a raw string, which will match a literal backslash followed by 'd', not a digit. The correct pattern should ber"^EX\d+$"(single backslash in raw string). This documentation error mirrors the bug found in the actual implementation code and should be corrected to avoid misleading developers.