Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions docs/how-to/add-reference-source.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,33 @@ Each reference source is a Python class that:
2. Implements `prefix()` and `fetch()` methods
3. Registers itself with the `ReferenceSourceRegistry`

## Entrez Summary Sources (Recommended for NCBI IDs)

If your source is backed by NCBI Entrez, prefer the built-in `EntrezSummarySource`
base class. It provides shared rate limiting, email configuration, and summary parsing.

```python
# src/linkml_reference_validator/etl/sources/my_entrez.py
"""Entrez summary source example."""

from linkml_reference_validator.etl.sources.entrez import EntrezSummarySource
from linkml_reference_validator.etl.sources.base import ReferenceSourceRegistry


@ReferenceSourceRegistry.register
class ExampleEntrezSource(EntrezSummarySource):
"""Fetch summaries from an Entrez database."""

PREFIX = "EXAMPLE"
ENTREZ_DB = "example_db"
TITLE_FIELDS = ("title", "name")
CONTENT_FIELDS = ("summary", "description")
ID_PATTERNS = (r"^EX\\d+$",)
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation contains incorrect regex pattern example. The pattern r"^EX\\d+$" uses double backslash in a raw string, which will match a literal backslash followed by 'd', not a digit. The correct pattern should be r"^EX\d+$" (single backslash in raw string). This documentation error mirrors the bug found in the actual implementation code and should be corrected to avoid misleading developers.

Suggested change
ID_PATTERNS = (r"^EX\\d+$",)
ID_PATTERNS = (r"^EX\d+$",)

Copilot uses AI. Check for mistakes.
```

`TITLE_FIELDS` and `CONTENT_FIELDS` are checked in order, and the first non-empty value
is used for the `ReferenceContent`.

## Step 1: Create the Source Class

Create a new file in `src/linkml_reference_validator/etl/sources/`:
Expand Down
8 changes: 7 additions & 1 deletion docs/how-to/repair-validation-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,15 @@ RECOMMENDED REMOVALS:

## Configuration File

Create `.linkml-reference-validator.yaml` for project-specific settings:
Create `.linkml-reference-validator.yaml` for project-specific settings. You can
include both validation and repair settings:

```yaml
validation:
reference_prefix_map:
geo: GEO
NCBIGeo: GEO

repair:
# Confidence thresholds
auto_fix_threshold: 0.95
Expand Down
116 changes: 116 additions & 0 deletions docs/how-to/validate-entrez.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Validating Entrez Accessions

This guide shows how to validate supporting text against NCBI Entrez records for GEO, BioProject, and BioSample.

## Overview

These sources use the NCBI Entrez E-utilities `esummary` endpoint:

- **GEO** (GSE/GDS): summaries from the `gds` database
- **BioProject** (PRJNA/PRJEB/PRJDB): summaries from the `bioproject` database
- **BioSample** (SAMN/SAME/SAMD): summaries from the `biosample` database

The validator uses the returned summary/description fields as the content for matching.

## Basic Usage

### GEO (GSE or GDS)

```bash
linkml-reference-validator validate text \
"RNA-seq analysis of cardiac tissue" \
GEO:GSE12345
```

### BioProject

```bash
linkml-reference-validator validate text \
"Whole genome sequencing project for strain X" \
BioProject:PRJNA12345
```

### BioSample

```bash
linkml-reference-validator validate text \
"Human liver biopsy sample description" \
BioSample:SAMN12345678
```

## Accepted Identifier Formats

You can use either prefixed or bare accessions:

```
GEO:GSE12345
GDS12345
BioProject:PRJNA12345
PRJEB12345
BioSample:SAMN12345678
SAME1234567
```

## Prefix Aliases and Normalization

Prefixes are case-insensitive and can be normalized with a configuration map. This
is useful when data uses alternate prefix styles such as `geo:` or `NCBIGeo:`.

Create `.linkml-reference-validator.yaml` with a `validation` section:

```yaml
validation:
reference_prefix_map:
geo: GEO
NCBIGeo: GEO
NCBIBioProject: BIOPROJECT
NCBIBioSample: BIOSAMPLE
```

You can also configure this programmatically:

```python
from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig(
reference_prefix_map={"geo": "GEO", "NCBIGeo": "GEO"}
)
```

Pass the config file to CLI commands with `--config .linkml-reference-validator.yaml`.

## Pre-caching Entrez Records

For offline validation or to speed up repeated validations:

```bash
linkml-reference-validator cache reference GEO:GSE12345
linkml-reference-validator cache reference BioProject:PRJNA12345
linkml-reference-validator cache reference BioSample:SAMN12345678
```

Cached references are stored in `references_cache/` as markdown files with YAML frontmatter.

## Rate Limiting and Email

NCBI requires a valid contact email for Entrez API usage. Configure it in your settings:

```python
from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig(
email="[email protected]",
rate_limit_delay=0.5,
)
```

## Content Availability

Entrez summaries vary by record. If a summary field is missing, the validator will return
`content_type: unavailable` and matching may fail.

## See Also

- [Adding a New Reference Source](add-reference-source.md)
- [Quickstart](../quickstart.md)
- [CLI Reference](../reference/cli.md)
17 changes: 14 additions & 3 deletions docs/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ linkml-reference-validator validate text [OPTIONS] TEXT REFERENCE_ID
### Options

- `--cache-dir PATH` - Directory for caching references (default: `references_cache`)
- `--config PATH` - Path to validation configuration file (.yaml)
- `--verbose, -v` - Verbose output with detailed logging
- `--help` - Show help message

Expand Down Expand Up @@ -138,6 +139,7 @@ linkml-reference-validator validate data [OPTIONS] DATA_FILE
- `--schema PATH, -s PATH` (required) - Path to LinkML schema file
- `--target-class TEXT, -t TEXT` - Target class to validate (optional)
- `--cache-dir PATH, -c PATH` - Directory for caching references (default: `references_cache`)
- `--config PATH` - Path to validation configuration file (.yaml)
- `--verbose, -v` - Verbose output with detailed logging
- `--help` - Show help message

Expand Down Expand Up @@ -240,6 +242,7 @@ linkml-reference-validator repair text [OPTIONS] TEXT REFERENCE_ID
### Options

- `--cache-dir PATH, -c PATH` - Directory for caching references
- `--config PATH` - Path to configuration file (.yaml)
- `--verbose, -v` - Verbose output with detailed logging
- `--auto-fix-threshold FLOAT, -a FLOAT` - Minimum similarity for auto-fixes (default: 0.95)
- `--help` - Show help message
Expand Down Expand Up @@ -318,7 +321,7 @@ linkml-reference-validator repair data [OPTIONS] DATA_FILE
- `--dry-run / --no-dry-run, -n / -N` - Show changes without applying (default: dry-run)
- `--auto-fix-threshold FLOAT, -a FLOAT` - Minimum similarity for auto-fixes (default: 0.95)
- `--output PATH, -o PATH` - Output file path (default: overwrite with backup)
- `--config PATH` - Path to repair configuration file
- `--config PATH` - Path to configuration file (.yaml)
- `--cache-dir PATH, -c PATH` - Directory for caching references
- `--verbose, -v` - Verbose output with detailed logging
- `--help` - Show help message
Expand Down Expand Up @@ -412,11 +415,18 @@ Summary:

---

## Repair Configuration File
## Configuration File

Create `.linkml-reference-validator.yaml` for project-specific settings:
Create `.linkml-reference-validator.yaml` for project-specific settings. Use
the `validation` section for reference fetching behavior and `repair` for
auto-fix settings.

```yaml
validation:
reference_prefix_map:
geo: GEO
NCBIGeo: GEO

repair:
# Confidence thresholds
auto_fix_threshold: 0.95
Expand Down Expand Up @@ -471,6 +481,7 @@ linkml-reference-validator cache reference [OPTIONS] REFERENCE_ID
### Options

- `--cache-dir PATH, -c PATH` - Directory for caching references (default: `references_cache`)
- `--config PATH` - Path to validation configuration file (.yaml)
- `--force, -f` - Force re-fetch even if cached
- `--verbose, -v` - Verbose output with detailed logging
- `--help` - Show help message
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ nav:
- Python API: notebooks/03_python_api.ipynb
- How-To Guides:
- Validating OBO Files: how-to/validate-obo-files.md
- Validating Entrez Accessions: how-to/validate-entrez.md
- Validating DOIs: how-to/validate-dois.md
- Validating URLs: how-to/validate-urls.md
- Using Local Files and URLs: how-to/use-local-files-and-urls.md
Expand Down
14 changes: 10 additions & 4 deletions src/linkml_reference_validator/cli/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,14 @@
from typing_extensions import Annotated

from linkml_reference_validator.etl.reference_fetcher import ReferenceFetcher
from linkml_reference_validator.models import ReferenceValidationConfig

from .shared import CacheDirOption, VerboseOption, ForceOption, setup_logging
from .shared import (
CacheDirOption,
VerboseOption,
ForceOption,
ConfigFileOption,
setup_logging,
load_validation_config,
)

logger = logging.getLogger(__name__)

Expand All @@ -22,6 +27,7 @@
@cache_app.command(name="reference")
def reference_command(
reference_id: Annotated[str, typer.Argument(help="Reference ID (e.g., PMID:12345678 or DOI:10.1234/example)")],
config_file: ConfigFileOption = None,
cache_dir: CacheDirOption = None,
force: ForceOption = False,
verbose: VerboseOption = False,
Expand All @@ -41,7 +47,7 @@ def reference_command(
"""
setup_logging(verbose)

config = ReferenceValidationConfig()
config = load_validation_config(config_file)
if cache_dir:
config.cache_dir = cache_dir

Expand Down
41 changes: 25 additions & 16 deletions src/linkml_reference_validator/cli/repair.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,16 @@
from ruamel.yaml import YAML
from typing_extensions import Annotated

from linkml_reference_validator.models import (
ReferenceValidationConfig,
RepairConfig,
)
from linkml_reference_validator.models import RepairConfig
from linkml_reference_validator.validation.repairer import SupportingTextRepairer

from .shared import CacheDirOption, VerboseOption, setup_logging
from .shared import (
CacheDirOption,
VerboseOption,
ConfigFileOption,
setup_logging,
load_validation_config,
)

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -75,13 +78,7 @@ def data_command(
output: OutputOption = None,
cache_dir: CacheDirOption = None,
verbose: VerboseOption = False,
config_file: Annotated[
Optional[Path],
typer.Option(
"--config",
help="Path to repair configuration file (.yaml)",
),
] = None,
config_file: ConfigFileOption = None,
):
"""Repair supporting text in a data file.

Expand Down Expand Up @@ -120,7 +117,7 @@ def data_command(
repair_config.dry_run = dry_run

# Set up validation config
val_config = ReferenceValidationConfig()
val_config = load_validation_config(config_file)
if cache_dir:
val_config.cache_dir = cache_dir

Expand Down Expand Up @@ -198,6 +195,7 @@ def text_command(
cache_dir: CacheDirOption = None,
verbose: VerboseOption = False,
auto_fix_threshold: AutoFixThresholdOption = 0.95,
config_file: ConfigFileOption = None,
):
"""Attempt to repair a single supporting text quote.

Expand All @@ -214,7 +212,7 @@ def text_command(
"""
setup_logging(verbose)

val_config = ReferenceValidationConfig()
val_config = load_validation_config(config_file)
if cache_dir:
val_config.cache_dir = cache_dir

Expand Down Expand Up @@ -283,10 +281,21 @@ def _load_repair_config(config_file: Optional[Path]) -> RepairConfig:
if config_data is None:
return RepairConfig()

if not isinstance(config_data, dict):
return RepairConfig()

# Extract repair section if present
repair_data = config_data.get("repair", config_data)
if "repair" in config_data:
repair_data = config_data.get("repair")
if isinstance(repair_data, dict):
return RepairConfig(**repair_data)
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for invalid configuration values. If the repair_data dict contains invalid values for RepairConfig fields, the Pydantic model initialization will raise a ValidationError. This should be caught and handled gracefully, either by logging a warning and returning defaults or providing a clear error message to the user about which configuration value is invalid.

Copilot uses AI. Check for mistakes.
return RepairConfig()

repair_keys = set(RepairConfig.model_fields.keys())
if repair_keys.intersection(config_data.keys()):
return RepairConfig(**config_data)
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for invalid configuration values. If the config_data dict contains invalid values for RepairConfig fields (at line 296), the Pydantic model initialization will raise a ValidationError. This should be caught and handled gracefully, either by logging a warning and returning defaults or providing a clear error message to the user about which configuration value is invalid.

Copilot uses AI. Check for mistakes.

return RepairConfig(**repair_data)
return RepairConfig()


def _extract_evidence_items(
Expand Down
Loading