diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md index c65d51a..8575c7f 100644 --- a/docs/concepts/how-it-works.md +++ b/docs/concepts/how-it-works.md @@ -46,6 +46,33 @@ Parts: ["MUC1 oncoprotein", "nuclear targeting"] # Both parts must exist in the reference ``` +### 4. Title Validation + +In addition to excerpt/quote validation, the validator can verify reference titles using **exact matching** (not substring). Titles are validated when: + +- A slot implements `dcterms:title` or has `slot_uri: dcterms:title` +- A slot is named `title` (fallback) + +**Example:** +```yaml +reference_title: "MUC1 oncoprotein blocks nuclear targeting of c-Abl" +``` + +Title matching uses the same normalization as excerpts (case, whitespace, punctuation, Greek letters) but requires the **entire title to match**, not just a substring. + +```python +# These match after normalization: +expected = "Role of JAK1 in Cell-Signaling" +actual = "Role of JAK1 in Cell Signaling" +# Both normalize to: "role of jak1 in cell signaling" + +# These do NOT match (partial title): +expected = "Role of JAK1" # Missing "in Cell Signaling" +actual = "Role of JAK1 in Cell Signaling" +``` + +See [Validating Reference Titles](../how-to/validate-titles.md) for detailed usage. + ## Why Deterministic Matching? ### Not Fuzzy Matching @@ -171,10 +198,17 @@ classes: slot_uri: linkml:excerpt # Marks as quoted text reference: slot_uri: linkml:authoritative_reference # Marks as reference ID + reference_title: + slot_uri: dcterms:title # Marks as reference title (optional) ``` When LinkML validates data, it calls our plugin for fields marked with these URIs. +The plugin discovers fields via: +- `implements` attribute (e.g., `implements: [dcterms:title]`) +- `slot_uri` attribute (e.g., `slot_uri: dcterms:title`) +- Fallback slot names (`reference`, `supporting_text`, `title`) + ## Editorial Conventions ### Square Brackets `[...]` diff --git a/docs/how-to/validate-titles.md b/docs/how-to/validate-titles.md new file mode 100644 index 0000000..51afe5b --- /dev/null +++ b/docs/how-to/validate-titles.md @@ -0,0 +1,146 @@ +# Validating Reference Titles + +This guide explains how to validate that reference titles in your data match the actual titles from the source publications. + +## Overview + +Title validation ensures that when you cite a reference with a title, that title matches what the publication actually has. Unlike excerpt validation (which uses substring matching), title validation uses **exact matching after normalization**. + +## When to Use Title Validation + +Title validation is useful when: + +- Your data includes reference titles that should match the source +- You want to catch typos or outdated titles +- You need to verify metadata accuracy in curated datasets + +## Schema Setup + +Mark title fields in your LinkML schema using `dcterms:title`: + +### Using `implements` + +```yaml +id: https://example.org/my-schema +name: my-schema + +prefixes: + linkml: https://w3id.org/linkml/ + dcterms: http://purl.org/dc/terms/ + +classes: + Evidence: + attributes: + reference: + implements: + - linkml:authoritative_reference + reference_title: + implements: + - dcterms:title + supporting_text: + implements: + - linkml:excerpt +``` + +### Using `slot_uri` + +```yaml +classes: + Evidence: + attributes: + reference: + slot_uri: linkml:authoritative_reference + title: + slot_uri: dcterms:title + supporting_text: + slot_uri: linkml:excerpt +``` + +## Example Data + +**data.yaml:** +```yaml +- reference: PMID:16888623 + reference_title: "MUC1 oncoprotein blocks nuclear targeting of c-Abl" + supporting_text: "MUC1 oncoprotein blocks nuclear targeting" +``` + +**Validate:** +```bash +linkml-reference-validator validate data \ + data.yaml \ + --schema schema.yaml \ + --target-class Evidence +``` + +## What Gets Normalized + +Title matching allows for minor orthographic variations: + +| Variation | Example | +|-----------|---------| +| **Case** | `"JAK1 Protein"` matches `"jak1 protein"` | +| **Whitespace** | `"Cell Signaling"` matches `"Cell Signaling"` | +| **Punctuation** | `"T-Cell Receptor"` matches `"T Cell Receptor"` | +| **Greek letters** | `"α-catenin"` matches `"alpha-catenin"` | +| **Trailing periods** | `"Study Title."` matches `"Study Title"` | + +## Title-Only Validation + +You can validate titles without excerpts. If your data has reference and title fields but no excerpt field, the validator will validate the title alone: + +```yaml +classes: + Reference: + attributes: + id: + implements: + - linkml:authoritative_reference + title: + implements: + - dcterms:title +``` + +```yaml +- id: PMID:16888623 + title: "MUC1 oncoprotein blocks nuclear targeting of c-Abl" +``` + +## Combined Validation + +When both title and excerpt fields are present, both are validated together: + +1. The excerpt is checked for substring match in the reference content +2. The title is checked for exact match (after normalization) against the reference title + +If either fails, validation fails with a specific error message. + +## Error Messages + +### Title Mismatch + +``` +Title mismatch for PMID:16888623: expected 'Wrong Title' but got 'MUC1 oncoprotein blocks nuclear targeting of c-Abl' +``` + +### Reference Has No Title + +``` +Reference PMID:99999999 has no title to validate against +``` + +## Differences from Excerpt Validation + +| Aspect | Title Validation | Excerpt Validation | +|--------|------------------|-------------------| +| **Matching** | Exact (after normalization) | Substring | +| **Partial matches** | Not allowed | Allowed with `...` | +| **Editorial notes** | Not supported | `[brackets]` removed | +| **Use case** | Metadata accuracy | Quote verification | + +## Best Practices + +1. **Use exact titles**: Copy the title exactly from the source +2. **Don't abbreviate**: Title must match completely +3. **Check special characters**: Greek letters, subscripts, etc. +4. **Verify after fetching**: The cached reference shows the actual title diff --git a/docs/index.md b/docs/index.md index 7308c79..a699af1 100644 --- a/docs/index.md +++ b/docs/index.md @@ -9,6 +9,7 @@ linkml-reference-validator ensures that text excerpts in your data accurately ma - **Deterministic validation** - No fuzzy matching or AI hallucinations - **Multiple reference sources** - PubMed, DOIs, local files, and URLs - **Editorial convention support** - Handles `[clarifications]` and `...` ellipsis +- **Title validation** - Verify reference titles with `dcterms:title` - **Multiple interfaces** - CLI for quick checks, Python API for integration - **LinkML integration** - Validates data files with `linkml:excerpt` annotations - **Smart caching** - Stores references locally to avoid repeated API calls diff --git a/docs/quickstart.md b/docs/quickstart.md index 24d5d26..56557ac 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -111,6 +111,7 @@ linkml-reference-validator validate text \ - **Automatic Caching**: References cached locally after first fetch - **Editorial Notes**: Use `[...]` for clarifications: `"MUC1 [mucin 1] oncoprotein"` - **Ellipsis**: Use `...` for omitted text: `"MUC1 ... nuclear targeting"` +- **Title Validation**: Verify reference titles with `dcterms:title` - **Deterministic Matching**: Substring-based (not AI/fuzzy matching) - **PubMed & PMC**: Fetches from NCBI automatically - **DOI Support**: Fetches metadata from Crossref API @@ -121,5 +122,6 @@ linkml-reference-validator validate text \ - **[Tutorial 1: Getting Started](notebooks/01_getting_started.ipynb)** - CLI basics with real examples - **[Tutorial 2: Advanced Usage](notebooks/02_advanced_usage.ipynb)** - Data validation with LinkML schemas +- **[Validating Reference Titles](how-to/validate-titles.md)** - Verify titles with `dcterms:title` - **[Concepts](concepts/how-it-works.md)** - Understanding the validation process - **[CLI Reference](reference/cli.md)** - Complete command documentation diff --git a/mkdocs.yml b/mkdocs.yml index 06c70ab..4103757 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -35,6 +35,7 @@ nav: - Validating Entrez Accessions: how-to/validate-entrez.md - Validating DOIs: how-to/validate-dois.md - Validating URLs: how-to/validate-urls.md + - Validating Reference Titles: how-to/validate-titles.md - Using Local Files and URLs: how-to/use-local-files-and-urls.md - Adding a New Reference Source: how-to/add-reference-source.md - Concepts: diff --git a/src/linkml_reference_validator/plugins/reference_validation_plugin.py b/src/linkml_reference_validator/plugins/reference_validation_plugin.py index 3998ba6..5c20f8d 100644 --- a/src/linkml_reference_validator/plugins/reference_validation_plugin.py +++ b/src/linkml_reference_validator/plugins/reference_validation_plugin.py @@ -2,338 +2,495 @@ import logging from collections.abc import Iterator +from importlib.util import find_spec from pathlib import Path from typing import Any, Optional -from linkml.validator.plugins import ValidationPlugin # type: ignore -from linkml.validator.report import ValidationResult as LinkMLValidationResult # type: ignore -from linkml.validator.report import Severity # type: ignore -from linkml.validator.validation_context import ValidationContext # type: ignore -from linkml_runtime.utils.schemaview import SchemaView # type: ignore - from linkml_reference_validator.models import ReferenceValidationConfig from linkml_reference_validator.validation.supporting_text_validator import ( SupportingTextValidator, ) -logger = logging.getLogger(__name__) - - -class ReferenceValidationPlugin(ValidationPlugin): - """LinkML validation plugin for supporting text validation. - - This plugin integrates with the LinkML validation framework to validate - that supporting text quotes actually appear in their referenced publications. - - The plugin discovers reference and excerpt fields using LinkML's interface - mechanism. It looks for: - - Slots implementing linkml:authoritative_reference - - Slots implementing linkml:excerpt - - Examples: - >>> config = ReferenceValidationConfig() - >>> plugin = ReferenceValidationPlugin(config=config) - >>> plugin.config.cache_dir - PosixPath('references_cache') - """ +_LINKML_AVAILABLE = ( + find_spec("linkml") is not None and find_spec("linkml.validator") is not None +) - def __init__( - self, - config: Optional[ReferenceValidationConfig] = None, - cache_dir: Optional[str] = None, - ): - """Initialize the validation plugin. +logger = logging.getLogger(__name__) - Args: - config: Full configuration object (if provided, other args ignored) - cache_dir: Directory for caching references - Examples: - >>> plugin = ReferenceValidationPlugin(cache_dir="/tmp/cache") - >>> plugin.config.cache_dir - PosixPath('/tmp/cache') - """ - if config is None: - config = ReferenceValidationConfig() - if cache_dir is not None: - config.cache_dir = Path(cache_dir) +if _LINKML_AVAILABLE: + # NOTE: `linkml` is optional. We only import LinkML modules when available. + # Ruff's E402 doesn't like imports in blocks, so we silence it per-import. + from linkml.validator.plugins import ValidationPlugin # type: ignore # noqa: E402 + from linkml.validator.report import ( # type: ignore # noqa: E402 + Severity, + ValidationResult as LinkMLValidationResult, + ) + from linkml.validator.validation_context import ( # type: ignore # noqa: E402 + ValidationContext, + ) + from linkml_runtime.utils.schemaview import SchemaView # type: ignore # noqa: E402 - self.config = config - self.validator = SupportingTextValidator(config) - self.schema_view: Optional[SchemaView] = None + class ReferenceValidationPlugin(ValidationPlugin): + """LinkML validation plugin for supporting text validation. - def pre_process(self, context: ValidationContext) -> None: - """Pre-process hook called before validation. + This plugin integrates with the LinkML validation framework to validate + that supporting text quotes actually appear in their referenced publications. - Args: - context: Validation context from LinkML + The plugin discovers reference and excerpt fields using LinkML's interface + mechanism. It looks for: + - Slots implementing linkml:authoritative_reference + - Slots implementing linkml:excerpt Examples: - >>> from linkml.validator.validation_context import ValidationContext >>> config = ReferenceValidationConfig() >>> plugin = ReferenceValidationPlugin(config=config) - >>> # Would be called by LinkML validator + >>> plugin.config.cache_dir + PosixPath('references_cache') """ - if hasattr(context, "schema_view") and context.schema_view: - self.schema_view = context.schema_view - logger.info("ReferenceValidationPlugin initialized") - - def process( - self, - instance: dict[str, Any], - context: ValidationContext, - ) -> Iterator[LinkMLValidationResult]: - """Validate an instance. - - Args: - instance: Data instance to validate - context: Validation context - - Yields: - ValidationResult objects for any issues found - Examples: - >>> from linkml.validator.validation_context import ValidationContext - >>> config = ReferenceValidationConfig() - >>> plugin = ReferenceValidationPlugin(config=config) - >>> # Would be called by LinkML validator: - >>> # results = list(plugin.process(instance, context)) - """ - if not self.schema_view: - logger.warning("No schema view available for validation") - return - - target_class = context.target_class if hasattr(context, "target_class") else None - if not target_class: - logger.warning("No target class specified") - return - - yield from self._validate_instance(instance, target_class, path="") - - def _validate_instance( # type: ignore - self, - instance: dict[str, Any], - class_name: str, - path: str, - ) -> Iterator[LinkMLValidationResult]: - """Recursively validate an instance and nested objects. - - Args: - instance: Instance data - class_name: Class name from schema - path: Current path in data structure - - Yields: - ValidationResult objects - """ - if not self.schema_view: - return - class_def = self.schema_view.get_class(class_name) - if not class_def: - return - - reference_fields = self._find_reference_fields(class_name) - excerpt_fields = self._find_excerpt_fields(class_name) - - for excerpt_field in excerpt_fields: - excerpt_value = instance.get(excerpt_field) - if not excerpt_value: - continue - - for ref_field in reference_fields: - ref_value = instance.get(ref_field) - if ref_value: - reference_id = self._extract_reference_id(ref_value) - expected_title = self._extract_title(ref_value) - if reference_id: - yield from self._validate_excerpt( - excerpt_value, - reference_id, - expected_title, - f"{path}.{excerpt_field}" if path else excerpt_field, - ) - - for slot_name, value in instance.items(): - if value is None: - continue - - slot = self.schema_view.induced_slot(slot_name, class_name) - if not slot: - continue - - slot_path = f"{path}.{slot_name}" if path else slot_name - - if isinstance(value, dict): - range_class = slot.range - if range_class and self.schema_view.get_class(range_class): - yield from self._validate_instance(value, range_class, slot_path) - - elif isinstance(value, list): - for i, item in enumerate(value): - item_path = f"{slot_path}[{i}]" - if isinstance(item, dict): - range_class = slot.range - if range_class and self.schema_view.get_class(range_class): - yield from self._validate_instance(item, range_class, item_path) - - def _find_reference_fields(self, class_name: str) -> list[str]: # type: ignore - """Find slots that implement linkml:authoritative_reference. - - Args: - class_name: Class to search - - Returns: - List of slot names + def __init__( + self, + config: Optional[ReferenceValidationConfig] = None, + cache_dir: Optional[str] = None, + ): + """Initialize the validation plugin. + + Args: + config: Full configuration object (if provided, other args ignored) + cache_dir: Directory for caching references + + Examples: + >>> plugin = ReferenceValidationPlugin(cache_dir="/tmp/cache") + >>> plugin.config.cache_dir + PosixPath('/tmp/cache') + """ + if config is None: + config = ReferenceValidationConfig() + if cache_dir is not None: + config.cache_dir = Path(cache_dir) + + self.config = config + self.validator = SupportingTextValidator(config) + self.schema_view: Optional[SchemaView] = None + + def pre_process(self, context: ValidationContext) -> None: + """Pre-process hook called before validation. + + Args: + context: Validation context from LinkML + + Examples: + >>> from linkml.validator.validation_context import ValidationContext + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> # Would be called by LinkML validator + """ + if hasattr(context, "schema_view") and context.schema_view: + self.schema_view = context.schema_view + logger.info("ReferenceValidationPlugin initialized") + + def process( + self, + instance: dict[str, Any], + context: ValidationContext, + ) -> Iterator[LinkMLValidationResult]: + """Validate an instance. + + Args: + instance: Data instance to validate + context: Validation context + + Yields: + ValidationResult objects for any issues found + + Examples: + >>> from linkml.validator.validation_context import ValidationContext + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> # Would be called by LinkML validator: + >>> # results = list(plugin.process(instance, context)) + """ + if not self.schema_view: + logger.warning("No schema view available for validation") + return + + target_class = ( + context.target_class if hasattr(context, "target_class") else None + ) + if not target_class: + logger.warning("No target class specified") + return + + yield from self._validate_instance(instance, target_class, path="") + + def _validate_instance( # type: ignore + self, + instance: dict[str, Any], + class_name: str, + path: str, + ) -> Iterator[LinkMLValidationResult]: + """Recursively validate an instance and nested objects. + + Args: + instance: Instance data + class_name: Class name from schema + path: Current path in data structure + + Yields: + ValidationResult objects + """ + if not self.schema_view: + return + class_def = self.schema_view.get_class(class_name) + if not class_def: + return + + reference_fields = self._find_reference_fields(class_name) + excerpt_fields = self._find_excerpt_fields(class_name) + title_fields = self._find_title_fields(class_name) + + # Track whether we've validated with excerpt (which includes title validation) + validated_with_excerpt = False + + for excerpt_field in excerpt_fields: + excerpt_value = instance.get(excerpt_field) + if not excerpt_value: + continue + + for ref_field in reference_fields: + ref_value = instance.get(ref_field) + if ref_value: + reference_id = self._extract_reference_id(ref_value) + # Get title from title field or from reference dict + expected_title = None + for title_field in title_fields: + title_value = instance.get(title_field) + if title_value: + expected_title = title_value + break + if not expected_title: + expected_title = self._extract_title(ref_value) + if reference_id: + validated_with_excerpt = True + yield from self._validate_excerpt( + excerpt_value, + reference_id, + expected_title, + f"{path}.{excerpt_field}" if path else excerpt_field, + ) + + # If no excerpt validation was done, validate title independently + if not validated_with_excerpt and title_fields: + for title_field in title_fields: + title_value = instance.get(title_field) + if not title_value: + continue + + for ref_field in reference_fields: + ref_value = instance.get(ref_field) + if ref_value: + reference_id = self._extract_reference_id(ref_value) + if reference_id: + yield from self._validate_title( + title_value, + reference_id, + f"{path}.{title_field}" if path else title_field, + ) + + for slot_name, value in instance.items(): + if value is None: + continue + + slot = self.schema_view.induced_slot(slot_name, class_name) + if not slot: + continue + + slot_path = f"{path}.{slot_name}" if path else slot_name + + if isinstance(value, dict): + range_class = slot.range + if range_class and self.schema_view.get_class(range_class): + yield from self._validate_instance(value, range_class, slot_path) + + elif isinstance(value, list): + for i, item in enumerate(value): + item_path = f"{slot_path}[{i}]" + if isinstance(item, dict): + range_class = slot.range + if range_class and self.schema_view.get_class(range_class): + yield from self._validate_instance( + item, range_class, item_path + ) + + def _find_reference_fields(self, class_name: str) -> list[str]: # type: ignore + """Find slots that implement linkml:authoritative_reference. + + Args: + class_name: Class to search + + Returns: + List of slot names + + Examples: + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> # Would need schema_view to actually work + """ + fields: list[str] = [] + if not self.schema_view: + return fields + class_def = self.schema_view.get_class(class_name) + if not class_def: + return fields + + for slot_name in self.schema_view.class_slots(class_name): + slot = self.schema_view.induced_slot(slot_name, class_name) + if slot and slot.implements: + for interface in slot.implements: + if ( + "authoritative_reference" in interface + or "reference" in interface.lower() + ): + fields.append(slot_name) + break + + if "reference" in [s for s in self.schema_view.class_slots(class_name)]: + if "reference" not in fields: + fields.append("reference") + if "reference_id" in [s for s in self.schema_view.class_slots(class_name)]: + if "reference_id" not in fields: + fields.append("reference_id") - Examples: - >>> config = ReferenceValidationConfig() - >>> plugin = ReferenceValidationPlugin(config=config) - >>> # Would need schema_view to actually work - """ - fields: list[str] = [] - if not self.schema_view: return fields - class_def = self.schema_view.get_class(class_name) - if not class_def: - return fields - - for slot_name in self.schema_view.class_slots(class_name): - slot = self.schema_view.induced_slot(slot_name, class_name) - if slot and slot.implements: - for interface in slot.implements: - if "authoritative_reference" in interface or "reference" in interface.lower(): - fields.append(slot_name) - break - - if "reference" in [s for s in self.schema_view.class_slots(class_name)]: - if "reference" not in fields: - fields.append("reference") - if "reference_id" in [s for s in self.schema_view.class_slots(class_name)]: - if "reference_id" not in fields: - fields.append("reference_id") - - return fields - - def _find_excerpt_fields(self, class_name: str) -> list[str]: # type: ignore - """Find slots that implement linkml:excerpt. - Args: - class_name: Class to search + def _find_excerpt_fields(self, class_name: str) -> list[str]: # type: ignore + """Find slots that implement linkml:excerpt. + + Args: + class_name: Class to search + + Returns: + List of slot names + + Examples: + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> # Would need schema_view to actually work + """ + fields: list[str] = [] + if not self.schema_view: + return fields + class_def = self.schema_view.get_class(class_name) + if not class_def: + return fields + + for slot_name in self.schema_view.class_slots(class_name): + slot = self.schema_view.induced_slot(slot_name, class_name) + if slot and slot.implements: + for interface in slot.implements: + if ( + "excerpt" in interface + or "supporting_text" in interface.lower() + ): + fields.append(slot_name) + break + + if "supporting_text" in [s for s in self.schema_view.class_slots(class_name)]: + if "supporting_text" not in fields: + fields.append("supporting_text") - Returns: - List of slot names - - Examples: - >>> config = ReferenceValidationConfig() - >>> plugin = ReferenceValidationPlugin(config=config) - >>> # Would need schema_view to actually work - """ - fields: list[str] = [] - if not self.schema_view: - return fields - class_def = self.schema_view.get_class(class_name) - if not class_def: return fields - for slot_name in self.schema_view.class_slots(class_name): - slot = self.schema_view.induced_slot(slot_name, class_name) - if slot and slot.implements: - for interface in slot.implements: - if "excerpt" in interface or "supporting_text" in interface.lower(): - fields.append(slot_name) - break - - if "supporting_text" in [s for s in self.schema_view.class_slots(class_name)]: - if "supporting_text" not in fields: - fields.append("supporting_text") - - return fields + def _find_title_fields(self, class_name: str) -> list[str]: # type: ignore + """Find slots that implement dcterms:title or have slot_uri dcterms:title. + + Args: + class_name: Class to search + + Returns: + List of slot names + + Examples: + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> # Would need schema_view to actually work + """ + fields: list[str] = [] + if not self.schema_view: + return fields + class_def = self.schema_view.get_class(class_name) + if not class_def: + return fields + + for slot_name in self.schema_view.class_slots(class_name): + slot = self.schema_view.induced_slot(slot_name, class_name) + if not slot: + continue + + # Check implements for dcterms:title + if slot.implements: + for interface in slot.implements: + if "dcterms:title" in interface or "title" in interface.lower(): + fields.append(slot_name) + break + if slot_name in fields: + continue + + # Check slot_uri for dcterms:title + if slot.slot_uri and "dcterms:title" in slot.slot_uri: + fields.append(slot_name) + + # Fallback: check for common title slot names + if "title" in [s for s in self.schema_view.class_slots(class_name)]: + if "title" not in fields: + fields.append("title") - def _extract_reference_id(self, reference_value: Any) -> Optional[str]: - """Extract reference ID from various value formats. - - Supports: - - String: "PMID:12345678" - - Dict with 'id': {"id": "PMID:12345678", "title": "..."} + return fields - Args: - reference_value: Reference value from data + def _extract_reference_id(self, reference_value: Any) -> Optional[str]: + """Extract reference ID from various value formats. + + Supports: + - String: "PMID:12345678" + - Dict with 'id': {"id": "PMID:12345678", "title": "..."} + + Args: + reference_value: Reference value from data + + Returns: + Reference ID string or None + + Examples: + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> plugin._extract_reference_id("PMID:12345678") + 'PMID:12345678' + >>> plugin._extract_reference_id({"id": "PMID:12345678"}) + 'PMID:12345678' + """ + if isinstance(reference_value, str): + return reference_value + elif isinstance(reference_value, dict): + return reference_value.get("id") or reference_value.get("reference_id") + return None + + def _extract_title(self, reference_value: Any) -> Optional[str]: + """Extract title from reference value. + + Args: + reference_value: Reference value from data + + Returns: + Title string or None + + Examples: + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> plugin._extract_title({"id": "PMID:12345678", "title": "Test"}) + 'Test' + """ + if isinstance(reference_value, dict): + return reference_value.get("title") or reference_value.get( + "reference_title" + ) + return None + + def _validate_excerpt( + self, + excerpt: str, + reference_id: str, + expected_title: Optional[str], + path: str, + ) -> Iterator[LinkMLValidationResult]: + """Validate an excerpt against a reference. + + Args: + excerpt: Supporting text to validate + reference_id: Reference identifier + expected_title: Optional expected title + path: Path in data structure + + Yields: + ValidationResult if validation fails + """ + result = self.validator.validate( + excerpt, reference_id, expected_title=expected_title, path=path + ) - Returns: - Reference ID string or None + if not result.is_valid: + yield LinkMLValidationResult( + type="reference_validation", + severity=Severity.ERROR + if result.severity.value == "ERROR" + else Severity.WARNING, + message=result.message or "Supporting text validation failed", + instance={"supporting_text": excerpt, "reference_id": reference_id}, + instantiates=path, + ) + + def _validate_title( + self, + title: str, + reference_id: str, + path: str, + ) -> Iterator[LinkMLValidationResult]: + """Validate a title against a reference. + + Uses exact matching after normalization (case, whitespace, punctuation). + + Args: + title: Expected title to validate + reference_id: Reference identifier + path: Path in data structure + + Yields: + ValidationResult if validation fails + """ + result = self.validator.validate_title( + reference_id, expected_title=title, path=path + ) - Examples: - >>> config = ReferenceValidationConfig() - >>> plugin = ReferenceValidationPlugin(config=config) - >>> plugin._extract_reference_id("PMID:12345678") - 'PMID:12345678' - >>> plugin._extract_reference_id({"id": "PMID:12345678"}) - 'PMID:12345678' + if not result.is_valid: + yield LinkMLValidationResult( + type="reference_validation", + severity=Severity.ERROR + if result.severity.value == "ERROR" + else Severity.WARNING, + message=result.message or "Title validation failed", + instance={"title": title, "reference_id": reference_id}, + instantiates=path, + ) + + def post_process(self, context: ValidationContext) -> None: + """Post-process hook called after validation. + + Args: + context: Validation context + + Examples: + >>> from linkml.validator.validation_context import ValidationContext + >>> config = ReferenceValidationConfig() + >>> plugin = ReferenceValidationPlugin(config=config) + >>> # Would be called by LinkML validator + """ + logger.info("ReferenceValidationPlugin validation complete") + +else: + + class ReferenceValidationPlugin: # type: ignore[no-redef] + """Placeholder when `linkml` is not installed. + + This module is intentionally importable without LinkML installed (so plugin + discovery / module scanning won't crash). Attempting to *use* this plugin + without LinkML will fail fast. """ - if isinstance(reference_value, str): - return reference_value - elif isinstance(reference_value, dict): - return reference_value.get("id") or reference_value.get("reference_id") - return None - def _extract_title(self, reference_value: Any) -> Optional[str]: - """Extract title from reference value. - - Args: - reference_value: Reference value from data - - Returns: - Title string or None - - Examples: - >>> config = ReferenceValidationConfig() - >>> plugin = ReferenceValidationPlugin(config=config) - >>> plugin._extract_title({"id": "PMID:12345678", "title": "Test"}) - 'Test' - """ - if isinstance(reference_value, dict): - return reference_value.get("title") or reference_value.get("reference_title") - return None - - def _validate_excerpt( - self, - excerpt: str, - reference_id: str, - expected_title: Optional[str], - path: str, - ) -> Iterator[LinkMLValidationResult]: - """Validate an excerpt against a reference. - - Args: - excerpt: Supporting text to validate - reference_id: Reference identifier - expected_title: Optional expected title - path: Path in data structure - - Yields: - ValidationResult if validation fails - """ - result = self.validator.validate(excerpt, reference_id, expected_title=expected_title, path=path) - - if not result.is_valid: - yield LinkMLValidationResult( - type="reference_validation", - severity=Severity.ERROR if result.severity.value == "ERROR" else Severity.WARNING, - message=result.message or "Supporting text validation failed", - instance={"supporting_text": excerpt, "reference_id": reference_id}, - instantiates=path, + def __init__( + self, + config: Optional[ReferenceValidationConfig] = None, + cache_dir: Optional[str] = None, + ): + raise ImportError( + "`linkml` is not installed; `ReferenceValidationPlugin` is unavailable." ) - - def post_process(self, context: ValidationContext) -> None: - """Post-process hook called after validation. - - Args: - context: Validation context - - Examples: - >>> from linkml.validator.validation_context import ValidationContext - >>> config = ReferenceValidationConfig() - >>> plugin = ReferenceValidationPlugin(config=config) - >>> # Would be called by LinkML validator - """ - logger.info("ReferenceValidationPlugin validation complete") diff --git a/src/linkml_reference_validator/validation/supporting_text_validator.py b/src/linkml_reference_validator/validation/supporting_text_validator.py index f475299..8f4f34f 100644 --- a/src/linkml_reference_validator/validation/supporting_text_validator.py +++ b/src/linkml_reference_validator/validation/supporting_text_validator.py @@ -47,6 +47,78 @@ def __init__(self, config: ReferenceValidationConfig): self.config = config self.fetcher = ReferenceFetcher(config) + def validate_title( + self, + reference_id: str, + expected_title: str, + path: Optional[str] = None, + ) -> ValidationResult: + """Validate title against a reference. + + Performs exact matching after normalization (case, whitespace, punctuation). + Unlike excerpt validation, this is NOT substring matching. + + Args: + reference_id: The reference identifier (e.g., "PMID:12345678") + expected_title: The title to validate against reference + path: Optional path in data structure for error reporting + + Returns: + ValidationResult with match details + + Examples: + >>> config = ReferenceValidationConfig() + >>> validator = SupportingTextValidator(config) + >>> # Would validate in real usage: + >>> # result = validator.validate_title("PMID:12345678", "Study Title") + """ + reference = self.fetcher.fetch(reference_id) + + if not reference: + return ValidationResult( + is_valid=False, + reference_id=reference_id, + supporting_text="", + severity=ValidationSeverity.ERROR, + message=f"Could not fetch reference: {reference_id}", + path=path, + ) + + if not reference.title: + return ValidationResult( + is_valid=False, + reference_id=reference_id, + supporting_text="", + severity=ValidationSeverity.ERROR, + message=f"Reference {reference_id} has no title to validate against", + path=path, + ) + + normalized_expected = self.normalize_text(expected_title) + normalized_actual = self.normalize_text(reference.title) + + if normalized_expected == normalized_actual: + return ValidationResult( + is_valid=True, + reference_id=reference_id, + supporting_text="", + severity=ValidationSeverity.INFO, + message=f"Title validated successfully for {reference_id}", + path=path, + ) + else: + return ValidationResult( + is_valid=False, + reference_id=reference_id, + supporting_text="", + severity=ValidationSeverity.ERROR, + message=( + f"Title mismatch for {reference_id}: " + f"expected '{expected_title}' but got '{reference.title}'" + ), + path=path, + ) + def validate( self, supporting_text: str, diff --git a/tests/test_title_validation.py b/tests/test_title_validation.py new file mode 100644 index 0000000..1339240 --- /dev/null +++ b/tests/test_title_validation.py @@ -0,0 +1,386 @@ +"""Tests for title validation against dcterms:title.""" + +import pytest +from linkml_reference_validator.models import ( + ReferenceContent, + ReferenceValidationConfig, + ValidationSeverity, +) +from linkml_reference_validator.plugins.reference_validation_plugin import ( + ReferenceValidationPlugin, +) +from linkml_reference_validator.validation.supporting_text_validator import ( + SupportingTextValidator, +) + + +@pytest.fixture +def config(tmp_path): + """Create a test configuration.""" + return ReferenceValidationConfig( + cache_dir=tmp_path / "cache", + rate_limit_delay=0.0, + ) + + +@pytest.fixture +def validator(config): + """Create a validator.""" + return SupportingTextValidator(config) + + +@pytest.fixture +def plugin(config): + """Create a validation plugin.""" + return ReferenceValidationPlugin(config=config) + + +class TestTitleValidation: + """Tests for title validation in SupportingTextValidator.""" + + def test_validate_title_exact_match(self, validator, mocker): + """Test title validation with exact match.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="Role of JAK1 in Cell Signaling", + ) + + assert result.is_valid is True + assert result.severity == ValidationSeverity.INFO + + def test_validate_title_case_insensitive(self, validator, mocker): + """Test title validation is case insensitive.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="role of jak1 in cell signaling", + ) + + assert result.is_valid is True + + def test_validate_title_whitespace_normalization(self, validator, mocker): + """Test title validation normalizes whitespace.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="Role of JAK1 in Cell Signaling", + ) + + assert result.is_valid is True + + def test_validate_title_punctuation_normalization(self, validator, mocker): + """Test title validation normalizes punctuation.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="Role of JAK1 in Cell-Signaling", + ) + + assert result.is_valid is True + + def test_validate_title_mismatch(self, validator, mocker): + """Test title validation fails on mismatch.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="A Completely Different Title", + ) + + assert result.is_valid is False + assert result.severity == ValidationSeverity.ERROR + assert "Title mismatch" in result.message + + def test_validate_title_greek_letter_normalization(self, validator, mocker): + """Test title validation handles Greek letters.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="α-catenin Function in Cells", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="alpha-catenin Function in Cells", + ) + + assert result.is_valid is True + + def test_validate_title_with_trailing_period(self, validator, mocker): + """Test title validation handles trailing punctuation.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling.", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="Role of JAK1 in Cell Signaling", + ) + + assert result.is_valid is True + + def test_validate_title_not_substring(self, validator, mocker): + """Test title validation is exact, not substring matching.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + # A partial title should NOT match + result = validator.validate( + "protein functions", + "PMID:123", + expected_title="Role of JAK1", # Missing "in Cell Signaling" + ) + + assert result.is_valid is False + assert "Title mismatch" in result.message + + +class TestTitleValidationStandalone: + """Tests for standalone title validation without excerpt.""" + + def test_validate_title_only(self, validator, mocker): + """Test validating title alone without supporting text.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate_title( + "PMID:123", + expected_title="Role of JAK1 in Cell Signaling", + ) + + assert result.is_valid is True + assert result.severity == ValidationSeverity.INFO + + def test_validate_title_only_mismatch(self, validator, mocker): + """Test title-only validation fails on mismatch.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title="Role of JAK1 in Cell Signaling", + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate_title( + "PMID:123", + expected_title="Wrong Title", + ) + + assert result.is_valid is False + assert "Title mismatch" in result.message + + def test_validate_title_only_no_reference_title(self, validator, mocker): + """Test title validation when reference has no title.""" + mock_fetch = mocker.patch.object(validator.fetcher, "fetch") + mock_fetch.return_value = ReferenceContent( + reference_id="PMID:123", + title=None, + content="The protein functions in cell cycle regulation.", + ) + + result = validator.validate_title( + "PMID:123", + expected_title="Some Title", + ) + + assert result.is_valid is False + assert "no title" in result.message.lower() + + +class TestPluginTitleFieldDiscovery: + """Tests for title field discovery in the plugin.""" + + def test_find_title_fields_dcterms(self, plugin, mocker): + """Test finding title fields implementing dcterms:title.""" + mock_class_def = mocker.MagicMock() + mock_slot = mocker.MagicMock() + mock_slot.implements = ["dcterms:title"] + + plugin.schema_view = mocker.MagicMock() + plugin.schema_view.get_class.return_value = mock_class_def + plugin.schema_view.class_slots.return_value = ["reference_title", "other_field"] + plugin.schema_view.induced_slot.side_effect = lambda name, cls: ( + mock_slot if name == "reference_title" else None + ) + + fields = plugin._find_title_fields("Evidence") + + assert "reference_title" in fields + + def test_find_title_fields_slot_uri(self, plugin, mocker): + """Test finding title fields via slot_uri dcterms:title.""" + mock_class_def = mocker.MagicMock() + mock_slot = mocker.MagicMock() + mock_slot.implements = None + mock_slot.slot_uri = "dcterms:title" + + plugin.schema_view = mocker.MagicMock() + plugin.schema_view.get_class.return_value = mock_class_def + plugin.schema_view.class_slots.return_value = ["title", "other_field"] + plugin.schema_view.induced_slot.side_effect = lambda name, cls: ( + mock_slot if name == "title" else mocker.MagicMock(implements=None, slot_uri=None) + ) + + fields = plugin._find_title_fields("Evidence") + + assert "title" in fields + + def test_find_title_fields_fallback(self, plugin, mocker): + """Test fallback to 'title' slot name.""" + mock_class_def = mocker.MagicMock() + mock_slot = mocker.MagicMock() + mock_slot.implements = None + mock_slot.slot_uri = None + + plugin.schema_view = mocker.MagicMock() + plugin.schema_view.get_class.return_value = mock_class_def + plugin.schema_view.class_slots.return_value = ["title", "other_field"] + plugin.schema_view.induced_slot.return_value = mock_slot + + fields = plugin._find_title_fields("Evidence") + + assert "title" in fields + + +class TestPluginTitleValidation: + """Tests for title validation in the plugin process flow.""" + + def test_validate_with_title_field(self, plugin, mocker): + """Test validation includes title field from data.""" + mock_validate = mocker.patch.object(plugin.validator, "validate") + mock_result = mocker.MagicMock() + mock_result.is_valid = True + mock_result.severity.value = "INFO" + mock_validate.return_value = mock_result + + # Setup schema view + mock_slot_ref = mocker.MagicMock() + mock_slot_ref.implements = ["linkml:authoritative_reference"] + mock_slot_ref.range = None + + mock_slot_excerpt = mocker.MagicMock() + mock_slot_excerpt.implements = ["linkml:excerpt"] + mock_slot_excerpt.range = None + + mock_slot_title = mocker.MagicMock() + mock_slot_title.implements = ["dcterms:title"] + mock_slot_title.slot_uri = None + mock_slot_title.range = None + + plugin.schema_view = mocker.MagicMock() + plugin.schema_view.get_class.return_value = mocker.MagicMock() + plugin.schema_view.class_slots.return_value = [ + "reference", + "supporting_text", + "reference_title", + ] + plugin.schema_view.induced_slot.side_effect = lambda name, cls: { + "reference": mock_slot_ref, + "supporting_text": mock_slot_excerpt, + "reference_title": mock_slot_title, + }.get(name) + + instance = { + "reference": "PMID:12345678", + "supporting_text": "test quote", + "reference_title": "Test Article Title", + } + + list(plugin._validate_instance(instance, "Evidence", "")) + + # Verify validate was called with expected_title + mock_validate.assert_called_once() + call_kwargs = mock_validate.call_args + # The title should have been passed as expected_title + assert call_kwargs[1].get("expected_title") == "Test Article Title" or \ + (len(call_kwargs[0]) >= 3 and call_kwargs[0][2] == "Test Article Title") + + def test_validate_title_only_field(self, plugin, mocker): + """Test validation of title-only (no excerpt).""" + mock_validate_title = mocker.patch.object(plugin.validator, "validate_title") + mock_result = mocker.MagicMock() + mock_result.is_valid = False + mock_result.message = "Title mismatch" + mock_result.severity.value = "ERROR" + mock_validate_title.return_value = mock_result + + # Setup schema view - only reference and title, no excerpt + mock_slot_ref = mocker.MagicMock() + mock_slot_ref.implements = ["linkml:authoritative_reference"] + mock_slot_ref.range = None + + mock_slot_title = mocker.MagicMock() + mock_slot_title.implements = ["dcterms:title"] + mock_slot_title.slot_uri = None + mock_slot_title.range = None + + plugin.schema_view = mocker.MagicMock() + plugin.schema_view.get_class.return_value = mocker.MagicMock() + plugin.schema_view.class_slots.return_value = [ + "reference", + "reference_title", + ] + plugin.schema_view.induced_slot.side_effect = lambda name, cls: { + "reference": mock_slot_ref, + "reference_title": mock_slot_title, + }.get(name) + + instance = { + "reference": "PMID:12345678", + "reference_title": "Expected Title", + } + + results = list(plugin._validate_instance(instance, "Evidence", "")) + + # Should have validated title + mock_validate_title.assert_called_once() + assert len(results) == 1 + assert results[0].type == "reference_validation"