diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index edff926..c65d51a 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -68,7 +68,7 @@ We don't use LLMs or semantic similarity because:
## Reference Fetching
-The validator supports multiple reference types:
+The validator uses a **plugin architecture** to support multiple reference sources. Each source type is handled by a dedicated plugin that knows how to fetch and parse content from that source.
### PubMed (PMID)
@@ -76,49 +76,47 @@ For `PMID:12345678`:
1. Queries NCBI E-utilities API
2. Fetches abstract and metadata
-3. Attempts to retrieve full-text from PMC if available
-4. Parses XML response with BeautifulSoup
-5. Caches as markdown with YAML frontmatter
+3. Parses XML response
+4. Caches as markdown with YAML frontmatter
-### DOI (Digital Object Identifier)
+### PubMed Central (PMC)
-For `DOI:10.1234/journal.article`:
+For `PMC:12345`:
-1. Queries Crossref API for metadata
-2. Fetches abstract and bibliographic information
-3. Extracts title, authors, journal, year
-4. Caches abstract and metadata as markdown
+1. Queries PMC API for full-text XML
+2. Extracts all sections (abstract, introduction, methods, results, discussion)
+3. Provides more content than abstracts alone
+4. Also cached as markdown
-### URLs
+### DOI (Digital Object Identifier)
-For `URL:https://example.com/page` or `https://example.com/page`:
+For `DOI:10.1234/example`:
-1. Makes HTTP GET request to fetch web page
-2. Extracts title from `
` tag
-3. Converts HTML to plain text (removes scripts, styles, navigation)
-4. Normalizes whitespace
-5. Caches as markdown with content type `html_converted`
+1. Queries Crossref API
+2. Fetches metadata and abstract (when available)
+3. Caches as markdown
-**Use cases for URLs:**
-- Online book chapters
-- Educational resources
-- Documentation pages
-- Any static web content
+### Local Files
-**Limitations:**
-- Works best with static HTML content
-- Does not execute JavaScript
-- Cannot access content behind authentication
-- Complex dynamic pages may not extract well
+For `file:./path/to/document.md`:
-### PubMed Central (PMC)
+1. Reads file from local filesystem
+2. Extracts title from first markdown heading (or uses filename)
+3. Content used as-is (no parsing for HTML files)
+4. Caches to allow consistent validation
-For `PMC:12345`:
+Path resolution:
+- Absolute paths work directly
+- Relative paths use `reference_base_dir` config if set, otherwise current directory
-1. Queries PMC API for full-text XML
-2. Extracts all sections (abstract, introduction, methods, results, discussion)
-3. Provides more content than abstracts alone
-4. Also cached as markdown
+### URLs
+
+For `url:https://example.com/page`:
+
+1. Fetches page via HTTP GET
+2. Extracts title from `` tag (for HTML)
+3. Content preserved as-is
+4. Cached like other sources
## Caching
diff --git a/docs/how-to/add-reference-source.md b/docs/how-to/add-reference-source.md
new file mode 100644
index 0000000..404d6fc
--- /dev/null
+++ b/docs/how-to/add-reference-source.md
@@ -0,0 +1,263 @@
+# Adding a New Reference Source
+
+The validator uses a plugin architecture that makes it easy to add support for new reference types. This guide shows how to create a custom reference source.
+
+## Overview
+
+Each reference source is a Python class that:
+
+1. Inherits from `ReferenceSource`
+2. Implements `prefix()` and `fetch()` methods
+3. Registers itself with the `ReferenceSourceRegistry`
+
+## Step 1: Create the Source Class
+
+Create a new file in `src/linkml_reference_validator/etl/sources/`:
+
+```python
+# src/linkml_reference_validator/etl/sources/arxiv.py
+"""arXiv reference source."""
+
+import logging
+from typing import Optional
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class ArxivSource(ReferenceSource):
+ """Fetch references from arXiv."""
+
+ @classmethod
+ def prefix(cls) -> str:
+ """Return the prefix this source handles."""
+ return "arxiv"
+
+ def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+ ) -> Optional[ReferenceContent]:
+ """Fetch a paper from arXiv.
+
+ Args:
+ identifier: arXiv ID (e.g., '2301.07041')
+ config: Configuration for fetching
+
+ Returns:
+ ReferenceContent if successful, None otherwise
+ """
+ # Your implementation here
+ # Fetch from arXiv API, parse response, return ReferenceContent
+ ...
+```
+
+## Step 2: Implement the `fetch()` Method
+
+The `fetch()` method should:
+
+1. Accept an identifier (without the prefix)
+2. Fetch content from the external source
+3. Return a `ReferenceContent` object or `None` on failure
+
+```python
+def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+) -> Optional[ReferenceContent]:
+ """Fetch a paper from arXiv."""
+ import requests
+ import time
+
+ arxiv_id = identifier.strip()
+
+ # Respect rate limiting
+ time.sleep(config.rate_limit_delay)
+
+ # Fetch from arXiv API
+ url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
+ response = requests.get(url, timeout=30)
+
+ if response.status_code != 200:
+ logger.warning(f"Failed to fetch arxiv:{arxiv_id}")
+ return None
+
+ # Parse the response (arXiv returns Atom XML)
+ title, authors, abstract = self._parse_arxiv_response(response.text)
+
+ return ReferenceContent(
+ reference_id=f"arxiv:{arxiv_id}",
+ title=title,
+ content=abstract,
+ content_type="abstract_only",
+ authors=authors,
+ )
+```
+
+## Step 3: Handle Errors Gracefully
+
+Since you're interfacing with external systems, wrap API calls in try/except:
+
+```python
+def fetch(self, identifier: str, config: ReferenceValidationConfig) -> Optional[ReferenceContent]:
+ try:
+ response = requests.get(url, timeout=30)
+ # ... process response
+ except Exception as e:
+ logger.warning(f"Failed to fetch arxiv:{identifier}: {e}")
+ return None
+```
+
+## Step 4: Register the Source
+
+The `@ReferenceSourceRegistry.register` decorator automatically registers your source when the module is imported.
+
+Add the import to `src/linkml_reference_validator/etl/sources/__init__.py`:
+
+```python
+from linkml_reference_validator.etl.sources.arxiv import ArxivSource
+
+__all__ = [
+ # ... existing exports
+ "ArxivSource",
+]
+```
+
+## Step 5: Write Tests
+
+Create tests in `tests/test_sources.py`:
+
+```python
+class TestArxivSource:
+ """Tests for ArxivSource."""
+
+ @pytest.fixture
+ def source(self):
+ return ArxivSource()
+
+ def test_prefix(self, source):
+ assert source.prefix() == "arxiv"
+
+ def test_can_handle(self, source):
+ assert source.can_handle("arxiv:2301.07041")
+ assert not source.can_handle("PMID:12345")
+
+ @patch("linkml_reference_validator.etl.sources.arxiv.requests.get")
+ def test_fetch(self, mock_get, source, config):
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = """...""" # Mock arXiv XML
+ mock_get.return_value = mock_response
+
+ result = source.fetch("2301.07041", config)
+
+ assert result is not None
+ assert result.reference_id == "arxiv:2301.07041"
+```
+
+## Optional: Custom `can_handle()` Method
+
+The default `can_handle()` checks if the reference starts with your prefix. Override it for custom matching:
+
+```python
+@classmethod
+def can_handle(cls, reference_id: str) -> bool:
+ """Handle arxiv: references and bare arXiv IDs."""
+ ref = reference_id.strip()
+ # Match prefix
+ if ref.lower().startswith("arxiv:"):
+ return True
+ # Match bare arXiv ID pattern (e.g., 2301.07041)
+ import re
+ return bool(re.match(r"^\d{4}\.\d{4,5}(v\d+)?$", ref))
+```
+
+## Complete Example
+
+Here's a complete implementation for a hypothetical "WikiData" source:
+
+```python
+# src/linkml_reference_validator/etl/sources/wikidata.py
+"""WikiData reference source."""
+
+import logging
+import time
+from typing import Optional
+
+import requests
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class WikidataSource(ReferenceSource):
+ """Fetch reference content from WikiData items."""
+
+ @classmethod
+ def prefix(cls) -> str:
+ return "wikidata"
+
+ def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+ ) -> Optional[ReferenceContent]:
+ qid = identifier.strip().upper()
+ if not qid.startswith("Q"):
+ qid = f"Q{qid}"
+
+ time.sleep(config.rate_limit_delay)
+
+ url = f"https://www.wikidata.org/wiki/Special:EntityData/{qid}.json"
+
+ try:
+ response = requests.get(url, timeout=30)
+ if response.status_code != 200:
+ logger.warning(f"Failed to fetch wikidata:{qid}")
+ return None
+
+ data = response.json()
+ entity = data["entities"].get(qid, {})
+
+ # Extract label and description
+ labels = entity.get("labels", {})
+ descriptions = entity.get("descriptions", {})
+
+ title = labels.get("en", {}).get("value", qid)
+ description = descriptions.get("en", {}).get("value", "")
+
+ return ReferenceContent(
+ reference_id=f"wikidata:{qid}",
+ title=title,
+ content=description,
+ content_type="wikidata_description",
+ )
+
+ except Exception as e:
+ logger.warning(f"Error fetching wikidata:{qid}: {e}")
+ return None
+```
+
+## Reference: ReferenceContent Fields
+
+The `ReferenceContent` model has these fields:
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `reference_id` | `str` | Full reference ID with prefix (e.g., `arxiv:2301.07041`) |
+| `title` | `Optional[str]` | Title of the reference |
+| `content` | `Optional[str]` | Main text content for validation |
+| `content_type` | `str` | Type indicator (e.g., `abstract_only`, `full_text`) |
+| `authors` | `Optional[list[str]]` | List of author names |
+| `journal` | `Optional[str]` | Journal/venue name |
+| `year` | `Optional[str]` | Publication year |
+| `doi` | `Optional[str]` | DOI if available |
+
+## Tips
+
+- **Rate limiting**: Always respect `config.rate_limit_delay` between API calls
+- **Error handling**: Return `None` on failures, don't raise exceptions
+- **Logging**: Use `logger.warning()` for failures to aid debugging
+- **Caching**: The `ReferenceFetcher` handles caching automatically - your source just needs to fetch
+- **Testing**: Mock external API calls in tests to avoid network dependencies
diff --git a/docs/how-to/use-local-files-and-urls.md b/docs/how-to/use-local-files-and-urls.md
new file mode 100644
index 0000000..385f271
--- /dev/null
+++ b/docs/how-to/use-local-files-and-urls.md
@@ -0,0 +1,146 @@
+# Using Local Files and URLs as References
+
+In addition to PubMed IDs and DOIs, the validator supports local files and web URLs as reference sources. This is useful when your supporting text comes from internal documentation, research notes, or web pages.
+
+## Local File References
+
+Use the `file:` prefix to reference local files:
+
+```bash
+linkml-reference-validator validate text \
+ "JAK1 binds to the receptor complex" \
+ file:./research/jak-signaling-notes.md
+```
+
+### Supported File Types
+
+- **Markdown** (`.md`) - Title extracted from first `# heading`
+- **Plain text** (`.txt`) - Content used as-is
+- **HTML** (`.html`) - Content preserved including HTML entities
+
+### Path Resolution
+
+**Absolute paths** always work:
+
+```bash
+file:/Users/me/research/notes.md
+```
+
+**Relative paths** are resolved in order:
+
+1. If `reference_base_dir` is configured, paths resolve relative to it
+2. Otherwise, paths resolve relative to the current working directory
+
+### Configuring a Base Directory
+
+Set a base directory for all relative file references:
+
+```python
+from linkml_reference_validator.models import ReferenceValidationConfig
+from pathlib import Path
+
+config = ReferenceValidationConfig(
+ reference_base_dir=Path("./references"),
+)
+```
+
+Then `file:notes.md` resolves to `./references/notes.md`.
+
+### Example: Validating Against Research Notes
+
+Create a research file:
+
+```markdown
+# JAK-STAT Signaling Pathway
+
+JAK1 binds to the receptor complex and initiates downstream signaling.
+This leads to STAT phosphorylation and nuclear translocation.
+```
+
+Validate:
+
+```bash
+linkml-reference-validator validate text \
+ "JAK1 binds to the receptor complex" \
+ file:./jak-signaling.md
+```
+
+## URL References
+
+Use the `url:` prefix to reference web pages:
+
+```bash
+linkml-reference-validator validate text \
+ "Climate change affects biodiversity" \
+ url:https://example.org/climate-report.html
+```
+
+### Caching
+
+URLs are cached the same way as PMID and DOI references:
+
+- First fetch downloads and caches the content
+- Subsequent validations use the cached version
+- Use `--force-refresh` to re-fetch
+
+### Title Extraction
+
+For HTML pages, the title is extracted from the `` tag. For other content types, the URL itself is used as the title.
+
+### Example: Validating Against a Web Page
+
+```bash
+# First validation fetches and caches
+linkml-reference-validator validate text \
+ "The quick brown fox jumps over the lazy dog" \
+ url:https://example.com/pangram-examples.html
+
+# Subsequent validations use cache
+linkml-reference-validator validate text \
+ "A quick brown fox" \
+ url:https://example.com/pangram-examples.html
+```
+
+## Using in Data Files
+
+Both file and URL references work in LinkML data files:
+
+```yaml
+# data.yaml
+- id: local-evidence
+ supporting_text: JAK1 binds to the receptor complex
+ reference: file:./research/jak-notes.md
+
+- id: web-evidence
+ supporting_text: Climate impacts are accelerating
+ reference: url:https://example.org/climate-report.html
+```
+
+## Reference Type Summary
+
+| Prefix | Example | Source |
+|--------|---------|--------|
+| `PMID:` | `PMID:16888623` | PubMed via NCBI Entrez |
+| `DOI:` | `DOI:10.1038/nature12373` | Crossref API |
+| `file:` | `file:./notes.md` | Local filesystem |
+| `url:` | `url:https://example.com` | Web (HTTP/HTTPS) |
+
+## Best Practices
+
+### For Local Files
+
+- Keep reference files in a dedicated directory
+- Use `reference_base_dir` for consistent path resolution
+- Use markdown for structured content with clear headings
+
+### For URLs
+
+- Prefer stable URLs (avoid query parameters that change)
+- Be aware that web content may change (cache helps with reproducibility)
+- Consider downloading important pages as local files for long-term stability
+
+## Limitations
+
+- **PDF files**: Not yet supported (planned for future)
+- **Authentication**: URLs requiring login are not supported
+- **Dynamic content**: JavaScript-rendered pages may not work
diff --git a/docs/how-to/validate-urls.md b/docs/how-to/validate-urls.md
index a85f617..77426e9 100644
--- a/docs/how-to/validate-urls.md
+++ b/docs/how-to/validate-urls.md
@@ -15,34 +15,29 @@ The linkml-reference-validator supports validating references that point to web
When a reference field contains a URL, the validator:
1. Fetches the web page content
-2. Extracts the page title
-3. Converts HTML to plain text
-4. Validates the extracted content against your supporting text
+2. Extracts the page title from `` tag (for HTML)
+3. Caches the content for future validations
+4. Validates your supporting text against the page content
## URL Format
-URLs can be specified in two ways:
-
-### Explicit URL Prefix
+Use the `url:` prefix to specify URL references:
```yaml
my_field:
value: "Some text from the web page..."
references:
- - "URL:https://example.com/book/chapter1"
+ - "url:https://example.com/book/chapter1"
```
-### Direct URL
+Or via CLI:
-```yaml
-my_field:
- value: "Some text from the web page..."
- references:
- - "https://example.com/book/chapter1"
+```bash
+linkml-reference-validator validate text \
+ "Some text from the web page" \
+ url:https://example.com/book/chapter1
```
-Both formats are equivalent. If a reference starts with `http://` or `https://`, it's automatically recognized as a URL reference.
-
## Example
Suppose you have an online textbook chapter at `https://example.com/biology/cell-structure` with the following content:
@@ -62,11 +57,10 @@ Suppose you have an online textbook chapter at `https://example.com/biology/cell
You can validate text extracted from this chapter:
-```yaml
-description:
- value: "The cell is the basic structural and functional unit of all living organisms"
- references:
- - "https://example.com/biology/cell-structure"
+```bash
+linkml-reference-validator validate text \
+ "The cell is the basic structural and functional unit of all living organisms" \
+ url:https://example.com/biology/cell-structure
```
## How URL Validation Works
@@ -80,39 +74,39 @@ When the validator encounters a URL reference, it:
- Respects rate limiting (configurable via `rate_limit_delay`)
- Handles timeouts (default 30 seconds)
-### 2. Content Extraction
-
-The fetcher extracts content from the HTML:
+### 2. Content Storage
-- **Title**: Extracted from the `` tag
-- **Content**: HTML is converted to plain text using BeautifulSoup
-- **Cleanup**: Removes scripts, styles, navigation, headers, and footers
-- **Normalization**: Whitespace is normalized for better matching
+The fetcher stores:
-### 3. Content Type
+- **Title**: Extracted from the `` tag (for HTML pages)
+- **Content**: The raw page content as received
+- **Content type**: Marked as `url` to distinguish from other reference types
-URL references are marked with content type `html_converted` to distinguish them from other reference types like abstracts or full-text articles.
+Note: Unlike some tools, the validator stores the raw page content without HTML-to-text conversion. This preserves the original content, though HTML tags will be present in the cached file.
-### 4. Caching
+### 3. Caching
Fetched URL content is cached to disk in markdown format with YAML frontmatter:
```markdown
---
-reference_id: URL:https://example.com/biology/cell-structure
+reference_id: url:https://example.com/biology/cell-structure
title: "Chapter 3: Cell Structure and Function"
-content_type: html_converted
+content_type: url
---
# Chapter 3: Cell Structure and Function
## Content
-The cell is the basic structural and functional unit of all living organisms.
-Cells contain various organelles that perform specific functions...
+
+
+ Chapter 3: Cell Structure and Function
+
+ ...
```
-Cache files are stored in the configured cache directory (default: `.linkml-reference-validator-cache/`).
+Cache files are stored in the configured cache directory (default: `references_cache/`).
## Configuration
@@ -145,15 +139,13 @@ URL validation is designed for static web pages. It may not work well with:
- Content behind paywalls
- Frequently changing content
-### HTML Structure
+### Raw Content
-The content extraction works by:
+The validator stores raw page content. For HTML pages:
-- Removing navigation, headers, and footers
-- Converting remaining HTML to text
-- Normalizing whitespace
-
-This works well for simple HTML but may not capture content perfectly from complex layouts.
+- HTML tags are preserved in the cache
+- The text normalization during validation handles most cases
+- Complex HTML layouts may require careful text extraction
### No Rendering
@@ -161,7 +153,6 @@ The fetcher downloads raw HTML and parses it directly. It does not:
- Execute JavaScript
- Render the page in a browser
-- Follow redirects automatically (may be added in future)
- Handle dynamic content
## Best Practices
@@ -170,10 +161,9 @@ The fetcher downloads raw HTML and parses it directly. It does not:
Choose URLs that are unlikely to change:
-- ✅ Versioned documentation: `https://docs.example.com/v1.0/chapter1`
-- ✅ Archived content: `https://archive.example.com/2024/article`
-- ❌ Blog posts with dates that might be reorganized
-- ❌ URLs with session parameters
+- Versioned documentation: `https://docs.example.com/v1.0/chapter1`
+- Archived content: `https://archive.example.com/2024/article`
+- Avoid URLs with session parameters
### 2. Verify Content Quality
@@ -181,15 +171,15 @@ After adding a URL reference, verify the extracted content:
```bash
# Check what was extracted
-cat .linkml-reference-validator-cache/URL_https___example.com_page.md
+cat references_cache/url_https___example.com_page.md
```
-Ensure the extracted text contains the relevant information you're referencing.
+Ensure the cached content contains the text you're referencing.
### 3. Cache Management
- Commit cache files to version control for reproducibility
-- Use `--force-refresh` to update cached content
+- Use `--force-refresh` to update cached content when pages change
- Periodically review cached URLs to ensure they're still accessible
### 4. Mix Reference Types
@@ -202,7 +192,7 @@ findings:
references:
- "PMID:12345678" # Research paper
- "DOI:10.1234/journal.article" # Another paper
- - "https://example.com/textbook/chapter5" # Textbook chapter
+ - "url:https://example.com/textbook/chapter5" # Textbook chapter
```
## Troubleshooting
@@ -216,15 +206,6 @@ If URL content isn't being fetched:
3. Check for rate limiting or IP blocks
4. Look for error messages in the logs
-### Incorrect Content Extraction
-
-If the wrong content is extracted:
-
-1. Inspect the cached markdown file
-2. Check if the page uses complex JavaScript
-3. Consider if the page structure requires custom parsing
-4. File an issue with the page URL for improvement
-
### Validation Failing
If validation fails for URL references:
@@ -234,19 +215,31 @@ If validation fails for URL references:
3. Check for whitespace or formatting differences
4. Consider if the page content has changed since caching
+### Force Refresh
+
+To re-fetch content for a URL that may have changed:
+
+```bash
+linkml-reference-validator validate text \
+ "Updated content" \
+ url:https://example.com/page \
+ --force-refresh
+```
+
## Comparison with Other Reference Types
-| Feature | PMID | DOI | URL |
-|---------|------|-----|-----|
-| Source | PubMed | Crossref | Any web page |
-| Content Type | Abstract + Full Text | Abstract | HTML converted |
-| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) |
-| Stability | High | High | Variable |
-| Access | Free for abstracts | Varies | Varies |
-| Caching | Yes | Yes | Yes |
+| Feature | PMID | DOI | URL | file |
+|---------|------|-----|-----|------|
+| Source | PubMed | Crossref | Any web page | Local filesystem |
+| Content Type | Abstract + Full Text | Abstract | Raw HTML/text | Raw file content |
+| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) | Minimal (title from heading) |
+| Stability | High | High | Variable | High (local control) |
+| Access | Free for abstracts | Varies | Varies | Always available |
+| Caching | Yes | Yes | Yes | Yes |
## See Also
+- [Using Local Files and URLs](use-local-files-and-urls.md) - Quick reference for file and URL sources
- [Validating DOIs](validate-dois.md) - For journal articles with DOIs
- [Validating OBO Files](validate-obo-files.md) - For ontology-specific validation
- [How It Works](../concepts/how-it-works.md) - Core validation concepts
diff --git a/docs/index.md b/docs/index.md
index ba92ac8..7308c79 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -2,11 +2,12 @@
**Validate quotes and excerpts against their source publications**
-linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC, DOIs via Crossref, and URLs, then performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
+linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC, DOIs, local files, and URLs, then performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
## Key Features
- **Deterministic validation** - No fuzzy matching or AI hallucinations
+- **Multiple reference sources** - PubMed, DOIs, local files, and URLs
- **Editorial convention support** - Handles `[clarifications]` and `...` ellipsis
- **Multiple interfaces** - CLI for quick checks, Python API for integration
- **LinkML integration** - Validates data files with `linkml:excerpt` annotations
diff --git a/docs/quickstart.md b/docs/quickstart.md
index c4a9b90..24d5d26 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -86,33 +86,26 @@ linkml-reference-validator validate text \
This works the same way as PMID validation - the reference is fetched and cached locally.
-## Validate Against a URL
+## Validate Against Local Files
-For online resources like book chapters, documentation, or educational content:
+You can also validate against local markdown, text, or HTML files:
```bash
linkml-reference-validator validate text \
- "The cell is the basic structural and functional unit of all living organisms" \
- https://example.com/biology/cell-structure
+ "JAK1 binds to the receptor complex" \
+ file:./research/jak-notes.md
```
-Or with explicit URL prefix:
+## Validate Against URLs
+
+Web pages can also be used as references:
```bash
linkml-reference-validator validate text \
- "The cell is the basic unit of life" \
- URL:https://example.com/biology/cells
+ "Climate change affects biodiversity" \
+ url:https://example.org/climate-report.html
```
-The validator will:
-1. Fetch the web page content
-2. Extract the title from the `` tag
-3. Convert HTML to plain text (removing scripts, styles, navigation)
-4. Cache the content locally
-5. Validate your text against the extracted content
-
-**Note:** URL validation works best with static HTML pages and may not work well with JavaScript-heavy or dynamic content.
-
## Key Features
- **Automatic Caching**: References cached locally after first fetch
@@ -121,7 +114,8 @@ The validator will:
- **Deterministic Matching**: Substring-based (not AI/fuzzy matching)
- **PubMed & PMC**: Fetches from NCBI automatically
- **DOI Support**: Fetches metadata from Crossref API
-- **URL Support**: Validates against web content (books, docs, educational resources)
+- **Local Files**: Validate against markdown, text, or HTML files
+- **URL Support**: Validate against web pages
## Next Steps
diff --git a/mkdocs.yml b/mkdocs.yml
index d9510ed..2b8f0bb 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -34,6 +34,8 @@ nav:
- Validating OBO Files: how-to/validate-obo-files.md
- Validating DOIs: how-to/validate-dois.md
- Validating URLs: how-to/validate-urls.md
+ - Using Local Files and URLs: how-to/use-local-files-and-urls.md
+ - Adding a New Reference Source: how-to/add-reference-source.md
- Concepts:
- How It Works: concepts/how-it-works.md
- Editorial Conventions: concepts/editorial-conventions.md
diff --git a/src/linkml_reference_validator/etl/reference_fetcher.py b/src/linkml_reference_validator/etl/reference_fetcher.py
index 5fc377b..83f9282 100644
--- a/src/linkml_reference_validator/etl/reference_fetcher.py
+++ b/src/linkml_reference_validator/etl/reference_fetcher.py
@@ -1,17 +1,18 @@
-"""Fetching and caching of references from various sources."""
+"""Fetching and caching of references from various sources.
+
+This module provides the main ReferenceFetcher class that coordinates
+fetching from various sources (PMID, DOI, file, URL) using a plugin architecture.
+"""
import logging
import re
-import time
from pathlib import Path
from typing import Optional
from ruamel.yaml import YAML # type: ignore
-from Bio import Entrez # type: ignore
-from bs4 import BeautifulSoup # type: ignore
-import requests # type: ignore
from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources import ReferenceSourceRegistry
logger = logging.getLogger(__name__)
@@ -19,19 +20,23 @@
class ReferenceFetcher:
"""Fetch and cache references from various sources.
- Currently supports:
+ Uses a plugin architecture to support multiple reference types:
- PMID (PubMed IDs)
- DOI (Digital Object Identifiers via Crossref API)
-
- Future support planned for:
- - URLs
- - Other databases
+ - file (local files)
+ - url (web URLs)
Examples:
>>> config = ReferenceValidationConfig()
>>> fetcher = ReferenceFetcher(config)
>>> # This would fetch from NCBI in real usage
>>> # ref = fetcher.fetch("PMID:12345678")
+
+ >>> # Local file support
+ >>> # ref = fetcher.fetch("file:./research/notes.md")
+
+ >>> # URL support
+ >>> # ref = fetcher.fetch("url:https://example.com/paper.html")
"""
def __init__(self, config: ReferenceValidationConfig):
@@ -48,15 +53,17 @@ def __init__(self, config: ReferenceValidationConfig):
"""
self.config = config
self._cache: dict[str, ReferenceContent] = {}
- Entrez.email = config.email # type: ignore
- def fetch(self, reference_id: str, force_refresh: bool = False) -> Optional[ReferenceContent]:
+ def fetch(
+ self, reference_id: str, force_refresh: bool = False
+ ) -> Optional[ReferenceContent]:
"""Fetch a reference by ID.
Supports various ID formats:
- PMID:12345678
- DOI:10.xxxx/yyyy
- - URL:https://... (future)
+ - file:./path/to/file.md
+ - url:https://example.com
Args:
reference_id: The reference identifier
@@ -70,28 +77,30 @@ def fetch(self, reference_id: str, force_refresh: bool = False) -> Optional[Refe
>>> fetcher = ReferenceFetcher(config)
>>> # Would fetch in real usage:
>>> # ref = fetcher.fetch("PMID:12345678")
+ >>> # ref = fetcher.fetch("file:./notes.md")
"""
+ # Check memory cache
if not force_refresh and reference_id in self._cache:
return self._cache[reference_id]
+ # Check disk cache
if not force_refresh:
cached = self._load_from_disk(reference_id)
if cached:
self._cache[reference_id] = cached
return cached
- prefix, identifier = self._parse_reference_id(reference_id)
-
- if prefix == "PMID":
- content = self._fetch_pmid(identifier)
- elif prefix == "DOI":
- content = self._fetch_doi(identifier)
- elif prefix == "URL":
- content = self._fetch_url(identifier)
- else:
- logger.warning(f"Unsupported reference type: {prefix}")
+ # Find appropriate source using registry
+ source_class = ReferenceSourceRegistry.get_source(reference_id)
+ if not source_class:
+ logger.warning(f"No source found for reference type: {reference_id}")
return None
+ # Parse identifier and fetch
+ _, identifier = self._parse_reference_id(reference_id)
+ source = source_class()
+ content = source.fetch(identifier, self.config)
+
if content:
self._cache[reference_id] = content
self._save_to_disk(content)
@@ -116,415 +125,24 @@ def _parse_reference_id(self, reference_id: str) -> tuple[str, str]:
('PMID', '12345678')
>>> fetcher._parse_reference_id("12345678")
('PMID', '12345678')
- >>> fetcher._parse_reference_id("URL:https://example.com/book/chapter1")
- ('URL', 'https://example.com/book/chapter1')
- >>> fetcher._parse_reference_id("https://example.com/direct")
- ('URL', 'https://example.com/direct')
+ >>> fetcher._parse_reference_id("file:./test.md")
+ ('file', './test.md')
+ >>> fetcher._parse_reference_id("url:https://example.com/page")
+ ('url', 'https://example.com/page')
"""
stripped = reference_id.strip()
-
- # Check if it's a direct URL (starts with http or https)
- if stripped.startswith(('http://', 'https://')):
- return "URL", stripped
-
+
# Standard prefix:identifier format
match = re.match(r"^([A-Za-z_]+)[:\s]+(.+)$", stripped)
if match:
- return match.group(1).upper(), match.group(2).strip()
-
- # Plain numeric ID defaults to PMID
- if stripped.isdigit():
- return "PMID", stripped
-
- return "UNKNOWN", stripped
-
- def _fetch_pmid(self, pmid: str) -> Optional[ReferenceContent]:
- """Fetch a publication from PubMed by PMID.
-
- Args:
- pmid: PubMed ID (without prefix)
-
- Returns:
- ReferenceContent if successful, None otherwise
- """
- time.sleep(self.config.rate_limit_delay)
-
- try:
- handle = Entrez.esummary(db="pubmed", id=pmid)
- records = Entrez.read(handle)
- handle.close()
-
- if not records:
- logger.warning(f"No records found for PMID:{pmid}")
- return None
-
- record = records[0] if isinstance(records, list) else records
-
- title = record.get("Title", "")
- authors = self._parse_authors(record.get("AuthorList", []))
- journal = record.get("Source", "")
- year = record.get("PubDate", "")[:4] if record.get("PubDate") else ""
- doi = record.get("DOI", "")
-
- abstract = self._fetch_abstract(pmid)
- full_text, content_type = self._fetch_pmc_fulltext(pmid)
-
- if full_text:
- content: Optional[str] = f"{abstract}\n\n{full_text}" if abstract else full_text
- else:
- content = abstract
- content_type = "abstract_only" if abstract else "unavailable"
-
- return ReferenceContent(
- reference_id=f"PMID:{pmid}",
- title=title,
- content=content,
- content_type=content_type,
- authors=authors,
- journal=journal,
- year=year,
- doi=doi,
- )
-
- except Exception as e:
- logger.error(f"Error fetching PMID:{pmid}: {e}")
- return None
-
- def _fetch_doi(self, doi: str) -> Optional[ReferenceContent]:
- """Fetch a publication from Crossref by DOI.
-
- Uses the Crossref API (https://api.crossref.org) to fetch metadata
- for a DOI.
-
- Args:
- doi: Digital Object Identifier (without prefix)
-
- Returns:
- ReferenceContent if successful, None otherwise
-
- Examples:
- >>> config = ReferenceValidationConfig()
- >>> fetcher = ReferenceFetcher(config)
- >>> # Would fetch in real usage:
- >>> # ref = fetcher._fetch_doi("10.1234/test")
- """
- time.sleep(self.config.rate_limit_delay)
-
- url = f"https://api.crossref.org/works/{doi}"
- headers = {
- "User-Agent": f"linkml-reference-validator/1.0 (mailto:{self.config.email})",
- }
-
- response = requests.get(url, headers=headers, timeout=30)
- if response.status_code != 200:
- logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}")
- return None
-
- data = response.json()
- if data.get("status") != "ok":
- logger.warning(f"Crossref API error for DOI:{doi}")
- return None
-
- message = data.get("message", {})
-
- title_list = message.get("title", [])
- title = title_list[0] if title_list else ""
-
- authors = self._parse_crossref_authors(message.get("author", []))
-
- container_title = message.get("container-title", [])
- journal = container_title[0] if container_title else ""
-
- year = self._extract_crossref_year(message)
-
- abstract = self._clean_abstract(message.get("abstract", ""))
-
- return ReferenceContent(
- reference_id=f"DOI:{doi}",
- title=title,
- content=abstract if abstract else None,
- content_type="abstract_only" if abstract else "unavailable",
- authors=authors,
- journal=journal,
- year=year,
- doi=doi,
- )
-
- def _fetch_url(self, url: str) -> Optional[ReferenceContent]:
- """Fetch content from a URL.
-
- Fetches web content, extracts title and converts HTML to text.
- Intended for static pages like book chapters.
-
- Args:
- url: The URL to fetch
-
- Returns:
- ReferenceContent if successful, None otherwise
-
- Examples:
- >>> config = ReferenceValidationConfig()
- >>> fetcher = ReferenceFetcher(config)
- >>> # Would fetch in real usage:
- >>> # ref = fetcher._fetch_url("https://example.com/book/chapter1")
- """
- time.sleep(self.config.rate_limit_delay)
-
- headers = {
- "User-Agent": f"linkml-reference-validator/1.0 (mailto:{self.config.email})",
- }
-
- try:
- response = requests.get(url, headers=headers, timeout=30)
- if response.status_code != 200:
- logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}")
- return None
-
- soup = BeautifulSoup(response.text, "html.parser")
-
- # Extract title
- title_tag = soup.find("title")
- title = title_tag.get_text().strip() if title_tag else None
-
- # Convert HTML to text
- # Remove script and style elements
- for script in soup(["script", "style", "nav", "header", "footer"]):
- script.decompose()
-
- # Get text content
- content = soup.get_text()
-
- # Clean up text - normalize whitespace
- lines = (line.strip() for line in content.splitlines())
- content = "\n".join(line for line in lines if line)
-
- return ReferenceContent(
- reference_id=f"URL:{url}",
- title=title,
- content=content if content else None,
- content_type="html_converted",
- )
-
- except Exception as e:
- logger.error(f"Error fetching URL:{url}: {e}")
- return None
-
- def _parse_crossref_authors(self, authors: list) -> list[str]:
- """Parse author list from Crossref response.
-
- Args:
- authors: List of author dicts from Crossref
-
- Returns:
- List of formatted author names
-
- Examples:
- >>> config = ReferenceValidationConfig()
- >>> fetcher = ReferenceFetcher(config)
- >>> fetcher._parse_crossref_authors([{"given": "John", "family": "Smith"}])
- ['John Smith']
- >>> fetcher._parse_crossref_authors([{"family": "Smith"}])
- ['Smith']
- """
- result = []
- for author in authors:
- given = author.get("given", "")
- family = author.get("family", "")
- if given and family:
- result.append(f"{given} {family}")
- elif family:
- result.append(family)
- elif given:
- result.append(given)
- return result
-
- def _extract_crossref_year(self, message: dict) -> str:
- """Extract publication year from Crossref message.
-
- Tries multiple date fields in order of preference.
-
- Args:
- message: Crossref message dict
-
- Returns:
- Year as string, or empty string if not found
-
- Examples:
- >>> config = ReferenceValidationConfig()
- >>> fetcher = ReferenceFetcher(config)
- >>> fetcher._extract_crossref_year({"published-print": {"date-parts": [[2024, 1, 15]]}})
- '2024'
- >>> fetcher._extract_crossref_year({"published-online": {"date-parts": [[2023]]}})
- '2023'
- """
- for date_field in ["published-print", "published-online", "created", "issued"]:
- date_info = message.get(date_field, {})
- date_parts = date_info.get("date-parts", [[]])
- if date_parts and date_parts[0]:
- return str(date_parts[0][0])
- return ""
-
- def _clean_abstract(self, abstract: str) -> str:
- """Clean JATS/XML markup from abstract text.
-
- Args:
- abstract: Abstract text potentially containing JATS markup
-
- Returns:
- Clean abstract text
-
- Examples:
- >>> config = ReferenceValidationConfig()
- >>> fetcher = ReferenceFetcher(config)
- >>> fetcher._clean_abstract("Test abstract.")
- 'Test abstract.'
- """
- if not abstract:
- return ""
- soup = BeautifulSoup(abstract, "html.parser")
- return soup.get_text().strip()
-
- def _parse_authors(self, author_list: list) -> list[str]:
- """Parse author list from Entrez record.
-
- Args:
- author_list: List of author names from Entrez
-
- Returns:
- List of formatted author names
-
- Examples:
- >>> config = ReferenceValidationConfig()
- >>> fetcher = ReferenceFetcher(config)
- >>> fetcher._parse_authors(["Smith J", "Doe A"])
- ['Smith J', 'Doe A']
- """
- return [str(author) for author in author_list if author]
-
- def _fetch_abstract(self, pmid: str) -> Optional[str]:
- """Fetch abstract for a PMID.
-
- Args:
- pmid: PubMed ID
-
- Returns:
- Abstract text if available
- """
- time.sleep(self.config.rate_limit_delay)
-
- handle = Entrez.efetch(db="pubmed", id=pmid, rettype="abstract", retmode="text")
- abstract_text = handle.read()
- handle.close()
-
- if abstract_text and len(abstract_text) > 50:
- return str(abstract_text)
-
- return None
-
- def _fetch_pmc_fulltext(self, pmid: str) -> tuple[Optional[str], str]:
- """Attempt to fetch full text from PMC.
-
- Args:
- pmid: PubMed ID
-
- Returns:
- Tuple of (full_text, content_type)
- """
- pmcid = self._get_pmcid(pmid)
- if not pmcid:
- return None, "no_pmc"
-
- full_text = self._fetch_pmc_xml(pmcid)
- if full_text and len(full_text) > 1000:
- return full_text, "full_text_xml"
-
- full_text = self._fetch_pmc_html(pmcid)
- if full_text and len(full_text) > 1000:
- return full_text, "full_text_html"
-
- return None, "pmc_restricted"
-
- def _get_pmcid(self, pmid: str) -> Optional[str]:
- """Get PMC ID for a PubMed ID.
-
- Args:
- pmid: PubMed ID
-
- Returns:
- PMC ID if available
- """
- time.sleep(self.config.rate_limit_delay)
-
- handle = Entrez.elink(dbfrom="pubmed", db="pmc", id=pmid, linkname="pubmed_pmc")
- result = Entrez.read(handle)
- handle.close()
-
- if result and result[0].get("LinkSetDb"):
- links = result[0]["LinkSetDb"][0].get("Link", [])
- if links:
- return links[0]["Id"]
-
- return None
-
- def _fetch_pmc_xml(self, pmcid: str) -> Optional[str]:
- """Fetch full text from PMC XML API.
-
- Args:
- pmcid: PMC ID
-
- Returns:
- Extracted text from XML
- """
- time.sleep(self.config.rate_limit_delay)
-
- handle = Entrez.efetch(db="pmc", id=pmcid, rettype="xml", retmode="xml")
- xml_content = handle.read()
- handle.close()
-
- if isinstance(xml_content, bytes):
- xml_content = xml_content.decode("utf-8")
-
- if "cannot be obtained" in xml_content.lower() or "restricted" in xml_content.lower():
- return None
-
- soup = BeautifulSoup(xml_content, "xml")
- body = soup.find("body")
-
- if body:
- paragraphs = body.find_all("p")
- if paragraphs:
- text = "\n\n".join(p.get_text() for p in paragraphs)
- return text
-
- return None
-
- def _fetch_pmc_html(self, pmcid: str) -> Optional[str]:
- """Fetch full text from PMC HTML as fallback.
-
- Args:
- pmcid: PMC ID
-
- Returns:
- Extracted text from HTML
- """
- time.sleep(self.config.rate_limit_delay)
-
- url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmcid}/"
-
- response = requests.get(url, timeout=30)
- if response.status_code != 200:
- return None
-
- soup = BeautifulSoup(response.content, "html.parser")
- article_body = soup.find("div", class_="article-body") or soup.find("div", class_="tsec")
-
- if article_body:
- paragraphs = article_body.find_all("p")
- if paragraphs:
- text = "\n\n".join(p.get_text() for p in paragraphs)
- return text
-
- return None
+ prefix = match.group(1)
+ # Preserve case for file/url, uppercase for others
+ if prefix.lower() not in ("file", "url"):
+ prefix = prefix.upper()
+ return prefix, match.group(2).strip()
+ if reference_id.strip().isdigit():
+ return "PMID", reference_id.strip()
+ return "UNKNOWN", reference_id
def _get_cache_path(self, reference_id: str) -> Path:
"""Get the cache file path for a reference.
@@ -541,9 +159,9 @@ def _get_cache_path(self, reference_id: str) -> Path:
>>> path = fetcher._get_cache_path("PMID:12345678")
>>> path.name
'PMID_12345678.md'
- >>> path = fetcher._get_cache_path("URL:https://example.com/book/chapter1")
+ >>> path = fetcher._get_cache_path("url:https://example.com/book/chapter1")
>>> path.name
- 'URL_https___example.com_book_chapter1.md'
+ 'url_https___example.com_book_chapter1.md'
"""
safe_id = reference_id.replace(":", "_").replace("/", "_").replace("?", "_").replace("=", "_")
cache_dir = self.config.get_cache_dir()
@@ -589,12 +207,12 @@ def _quote_yaml_value(self, value: str) -> str:
# Check for values that YAML might misinterpret
lower_value = value.lower()
- if lower_value in ('true', 'false', 'yes', 'no', 'on', 'off', 'null', '~'):
+ if lower_value in ("true", "false", "yes", "no", "on", "off", "null", "~"):
needs_quote = True
if needs_quote:
# Escape any existing double quotes and wrap in double quotes
- escaped = value.replace('\\', '\\\\').replace('"', '\\"')
+ escaped = value.replace("\\", "\\\\").replace('"', '\\"')
return f'"{escaped}"'
return value
@@ -636,7 +254,9 @@ def _save_to_disk(self, reference: ReferenceContent) -> None:
journal_info += f" ({reference.year})"
lines.append(f"**Journal:** {journal_info}")
if reference.doi:
- lines.append(f"**DOI:** [{reference.doi}](https://doi.org/{reference.doi})")
+ lines.append(
+ f"**DOI:** [{reference.doi}](https://doi.org/{reference.doi})"
+ )
lines.append("")
lines.append("## Content")
lines.append("")
@@ -674,7 +294,9 @@ def _load_from_disk(self, reference_id: str) -> Optional[ReferenceContent]:
else:
return self._load_legacy_format(content_text, reference_id)
- def _load_markdown_format(self, content_text: str, reference_id: str) -> Optional[ReferenceContent]:
+ def _load_markdown_format(
+ self, content_text: str, reference_id: str
+ ) -> Optional[ReferenceContent]:
"""Load reference from markdown format with YAML frontmatter.
Args:
@@ -741,7 +363,9 @@ def _extract_content_from_markdown(self, body: str) -> str:
return body
- def _load_legacy_format(self, content_text: str, reference_id: str) -> Optional[ReferenceContent]:
+ def _load_legacy_format(
+ self, content_text: str, reference_id: str
+ ) -> Optional[ReferenceContent]:
"""Load reference from legacy text format.
Args:
@@ -764,9 +388,15 @@ def _load_legacy_format(self, content_text: str, reference_id: str) -> Optional[
key, value = line.split(":", 1)
metadata[key.strip()] = value.strip()
- content = "\n".join(lines[content_start:]).strip() if content_start < len(lines) else None
+ content = (
+ "\n".join(lines[content_start:]).strip()
+ if content_start < len(lines)
+ else None
+ )
- authors = metadata.get("Authors", "").split(", ") if metadata.get("Authors") else None
+ authors = (
+ metadata.get("Authors", "").split(", ") if metadata.get("Authors") else None
+ )
return ReferenceContent(
reference_id=metadata.get("ID", reference_id),
diff --git a/src/linkml_reference_validator/etl/sources/__init__.py b/src/linkml_reference_validator/etl/sources/__init__.py
new file mode 100644
index 0000000..1483397
--- /dev/null
+++ b/src/linkml_reference_validator/etl/sources/__init__.py
@@ -0,0 +1,31 @@
+"""Reference source plugins.
+
+This package provides pluggable reference sources for fetching content
+from various origins (PubMed, Crossref, local files, URLs).
+
+Examples:
+ >>> from linkml_reference_validator.etl.sources import ReferenceSourceRegistry
+ >>> sources = ReferenceSourceRegistry.list_sources()
+ >>> len(sources) >= 4
+ True
+"""
+
+from linkml_reference_validator.etl.sources.base import (
+ ReferenceSource,
+ ReferenceSourceRegistry,
+)
+
+# Import sources to register them
+from linkml_reference_validator.etl.sources.pmid import PMIDSource
+from linkml_reference_validator.etl.sources.doi import DOISource
+from linkml_reference_validator.etl.sources.file import FileSource
+from linkml_reference_validator.etl.sources.url import URLSource
+
+__all__ = [
+ "ReferenceSource",
+ "ReferenceSourceRegistry",
+ "PMIDSource",
+ "DOISource",
+ "FileSource",
+ "URLSource",
+]
diff --git a/src/linkml_reference_validator/etl/sources/base.py b/src/linkml_reference_validator/etl/sources/base.py
new file mode 100644
index 0000000..daa32e0
--- /dev/null
+++ b/src/linkml_reference_validator/etl/sources/base.py
@@ -0,0 +1,186 @@
+"""Base class and registry for reference sources.
+
+This module provides the plugin architecture for fetching reference content
+from various sources (PMID, DOI, local files, URLs, etc.).
+
+Examples:
+ >>> from linkml_reference_validator.etl.sources.base import ReferenceSourceRegistry
+ >>> source = ReferenceSourceRegistry.get_source("PMID:12345678")
+ >>> source.prefix()
+ 'PMID'
+"""
+
+import logging
+import re
+from abc import ABC, abstractmethod
+from typing import Optional
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+
+logger = logging.getLogger(__name__)
+
+
+class ReferenceSource(ABC):
+ """Abstract base class for reference content sources.
+
+ Subclasses must implement:
+ - prefix(): Return the prefix this source handles (e.g., 'PMID', 'DOI')
+ - fetch(): Fetch content for a given identifier
+
+ Examples:
+ >>> class MySource(ReferenceSource):
+ ... @classmethod
+ ... def prefix(cls) -> str:
+ ... return "MY"
+ ... def fetch(self, identifier, config):
+ ... return None
+ >>> MySource.prefix()
+ 'MY'
+ """
+
+ @classmethod
+ @abstractmethod
+ def prefix(cls) -> str:
+ """Return the prefix this source handles.
+
+ Returns:
+ The prefix string (e.g., 'PMID', 'DOI', 'file', 'url')
+
+ Examples:
+ >>> from linkml_reference_validator.etl.sources.pmid import PMIDSource
+ >>> PMIDSource.prefix()
+ 'PMID'
+ """
+ ...
+
+ @classmethod
+ def can_handle(cls, reference_id: str) -> bool:
+ """Check if this source can handle the given reference ID.
+
+ Default implementation checks if reference_id starts with the prefix.
+
+ Args:
+ reference_id: The full reference ID (e.g., 'PMID:12345678')
+
+ Returns:
+ True if this source can handle the reference
+
+ Examples:
+ >>> from linkml_reference_validator.etl.sources.pmid import PMIDSource
+ >>> PMIDSource.can_handle("PMID:12345678")
+ True
+ >>> PMIDSource.can_handle("DOI:10.1234/test")
+ False
+ """
+ prefix = cls.prefix()
+ pattern = rf"^{re.escape(prefix)}[:\s]"
+ return bool(re.match(pattern, reference_id, re.IGNORECASE))
+
+ @abstractmethod
+ def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+ ) -> Optional[ReferenceContent]:
+ """Fetch content for the given identifier.
+
+ Args:
+ identifier: The identifier without prefix (e.g., '12345678' for PMID)
+ config: Configuration for fetching
+
+ Returns:
+ ReferenceContent if successful, None otherwise
+ """
+ ...
+
+
+class ReferenceSourceRegistry:
+ """Registry of available reference sources.
+
+ Sources are registered automatically when their modules are imported.
+ The registry is used by ReferenceFetcher to dispatch to the appropriate source.
+
+ Examples:
+ >>> from linkml_reference_validator.etl.sources.base import ReferenceSourceRegistry
+ >>> sources = ReferenceSourceRegistry.list_sources()
+ >>> len(sources) >= 4 # PMID, DOI, file, url
+ True
+ """
+
+ _sources: list[type[ReferenceSource]] = []
+
+ @classmethod
+ def register(cls, source_class: type[ReferenceSource]) -> type[ReferenceSource]:
+ """Register a source class.
+
+ Can be used as a decorator.
+
+ Args:
+ source_class: The source class to register
+
+ Returns:
+ The source class (for decorator usage)
+
+ Examples:
+ >>> from linkml_reference_validator.etl.sources.base import (
+ ... ReferenceSource, ReferenceSourceRegistry
+ ... )
+ >>> @ReferenceSourceRegistry.register
+ ... class TestSource(ReferenceSource):
+ ... @classmethod
+ ... def prefix(cls) -> str:
+ ... return "TEST"
+ ... def fetch(self, identifier, config):
+ ... return None
+ >>> "TEST" in [s.prefix() for s in ReferenceSourceRegistry.list_sources()]
+ True
+ """
+ if source_class not in cls._sources:
+ cls._sources.append(source_class)
+ logger.debug(f"Registered source: {source_class.prefix()}")
+ return source_class
+
+ @classmethod
+ def get_source(cls, reference_id: str) -> Optional[type[ReferenceSource]]:
+ """Find a source that can handle the given reference ID.
+
+ Args:
+ reference_id: The full reference ID (e.g., 'PMID:12345678')
+
+ Returns:
+ The source class if found, None otherwise
+
+ Examples:
+ >>> from linkml_reference_validator.etl.sources.base import ReferenceSourceRegistry
+ >>> ReferenceSourceRegistry.get_source("UNKNOWN:xyz") is None
+ True
+ """
+ for source_class in cls._sources:
+ if source_class.can_handle(reference_id):
+ return source_class
+ return None
+
+ @classmethod
+ def list_sources(cls) -> list[type[ReferenceSource]]:
+ """List all registered sources.
+
+ Returns:
+ List of registered source classes
+
+ Examples:
+ >>> from linkml_reference_validator.etl.sources import ReferenceSourceRegistry
+ >>> sources = ReferenceSourceRegistry.list_sources()
+ >>> isinstance(sources, list)
+ True
+ """
+ return list(cls._sources)
+
+ @classmethod
+ def clear(cls) -> None:
+ """Clear all registered sources (mainly for testing).
+
+ Examples:
+ >>> from linkml_reference_validator.etl.sources.base import ReferenceSourceRegistry
+ >>> ReferenceSourceRegistry.clear()
+ >>> len(ReferenceSourceRegistry._sources)
+ 0
+ """
+ cls._sources = []
diff --git a/src/linkml_reference_validator/etl/sources/doi.py b/src/linkml_reference_validator/etl/sources/doi.py
new file mode 100644
index 0000000..be52ab8
--- /dev/null
+++ b/src/linkml_reference_validator/etl/sources/doi.py
@@ -0,0 +1,182 @@
+"""DOI (Digital Object Identifier) reference source.
+
+Fetches publication metadata from Crossref API.
+
+Examples:
+ >>> from linkml_reference_validator.etl.sources.doi import DOISource
+ >>> DOISource.prefix()
+ 'DOI'
+ >>> DOISource.can_handle("DOI:10.1234/test")
+ True
+"""
+
+import logging
+import time
+from typing import Optional
+
+from bs4 import BeautifulSoup # type: ignore
+import requests # type: ignore
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class DOISource(ReferenceSource):
+ """Fetch references from Crossref using DOI.
+
+ Uses the Crossref API (https://api.crossref.org) to fetch publication metadata.
+
+ Examples:
+ >>> source = DOISource()
+ >>> source.prefix()
+ 'DOI'
+ >>> source.can_handle("DOI:10.1234/test")
+ True
+ """
+
+ @classmethod
+ def prefix(cls) -> str:
+ """Return 'DOI' prefix.
+
+ Examples:
+ >>> DOISource.prefix()
+ 'DOI'
+ """
+ return "DOI"
+
+ def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+ ) -> Optional[ReferenceContent]:
+ """Fetch a publication from Crossref by DOI.
+
+ Args:
+ identifier: DOI (without prefix)
+ config: Configuration including rate limiting and email
+
+ Returns:
+ ReferenceContent if successful, None otherwise
+
+ Examples:
+ >>> from linkml_reference_validator.models import ReferenceValidationConfig
+ >>> config = ReferenceValidationConfig()
+ >>> source = DOISource()
+ >>> # Would fetch in real usage:
+ >>> # ref = source.fetch("10.1234/test", config)
+ """
+ doi = identifier.strip()
+ time.sleep(config.rate_limit_delay)
+
+ url = f"https://api.crossref.org/works/{doi}"
+ headers = {
+ "User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})",
+ }
+
+ response = requests.get(url, headers=headers, timeout=30)
+ if response.status_code != 200:
+ logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}")
+ return None
+
+ data = response.json()
+ if data.get("status") != "ok":
+ logger.warning(f"Crossref API error for DOI:{doi}")
+ return None
+
+ message = data.get("message", {})
+
+ title_list = message.get("title", [])
+ title = title_list[0] if title_list else ""
+
+ authors = self._parse_crossref_authors(message.get("author", []))
+
+ container_title = message.get("container-title", [])
+ journal = container_title[0] if container_title else ""
+
+ year = self._extract_crossref_year(message)
+
+ abstract = self._clean_abstract(message.get("abstract", ""))
+
+ return ReferenceContent(
+ reference_id=f"DOI:{doi}",
+ title=title,
+ content=abstract if abstract else None,
+ content_type="abstract_only" if abstract else "unavailable",
+ authors=authors,
+ journal=journal,
+ year=year,
+ doi=doi,
+ )
+
+ def _parse_crossref_authors(self, authors: list) -> list[str]:
+ """Parse author list from Crossref response.
+
+ Args:
+ authors: List of author dicts from Crossref
+
+ Returns:
+ List of formatted author names
+
+ Examples:
+ >>> source = DOISource()
+ >>> source._parse_crossref_authors([{"given": "John", "family": "Smith"}])
+ ['John Smith']
+ >>> source._parse_crossref_authors([{"family": "Smith"}])
+ ['Smith']
+ """
+ result = []
+ for author in authors:
+ given = author.get("given", "")
+ family = author.get("family", "")
+ if given and family:
+ result.append(f"{given} {family}")
+ elif family:
+ result.append(family)
+ elif given:
+ result.append(given)
+ return result
+
+ def _extract_crossref_year(self, message: dict) -> str:
+ """Extract publication year from Crossref message.
+
+ Tries multiple date fields in order of preference.
+
+ Args:
+ message: Crossref message dict
+
+ Returns:
+ Year as string, or empty string if not found
+
+ Examples:
+ >>> source = DOISource()
+ >>> source._extract_crossref_year({"published-print": {"date-parts": [[2024, 1, 15]]}})
+ '2024'
+ >>> source._extract_crossref_year({"published-online": {"date-parts": [[2023]]}})
+ '2023'
+ """
+ for date_field in ["published-print", "published-online", "created", "issued"]:
+ date_info = message.get(date_field, {})
+ date_parts = date_info.get("date-parts", [[]])
+ if date_parts and date_parts[0]:
+ return str(date_parts[0][0])
+ return ""
+
+ def _clean_abstract(self, abstract: str) -> str:
+ """Clean JATS/XML markup from abstract text.
+
+ Args:
+ abstract: Abstract text potentially containing JATS markup
+
+ Returns:
+ Clean abstract text
+
+ Examples:
+ >>> source = DOISource()
+ >>> source._clean_abstract("Test abstract.")
+ 'Test abstract.'
+ """
+ if not abstract:
+ return ""
+ soup = BeautifulSoup(abstract, "html.parser")
+ return soup.get_text().strip()
diff --git a/src/linkml_reference_validator/etl/sources/file.py b/src/linkml_reference_validator/etl/sources/file.py
new file mode 100644
index 0000000..fbb61b6
--- /dev/null
+++ b/src/linkml_reference_validator/etl/sources/file.py
@@ -0,0 +1,153 @@
+"""Local file reference source.
+
+Reads content from local files (markdown, text, HTML).
+
+Examples:
+ >>> from linkml_reference_validator.etl.sources.file import FileSource
+ >>> FileSource.prefix()
+ 'file'
+ >>> FileSource.can_handle("file:./notes.md")
+ True
+"""
+
+import logging
+import re
+from pathlib import Path
+from typing import Optional
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class FileSource(ReferenceSource):
+ """Fetch reference content from local files.
+
+ Supports markdown (.md), plain text (.txt), and HTML (.html) files.
+ Content is read as-is without parsing (HTML entities preserved).
+
+ Path resolution:
+ - Absolute paths work directly
+ - Relative paths use reference_base_dir from config if set, otherwise CWD
+
+ Examples:
+ >>> source = FileSource()
+ >>> source.prefix()
+ 'file'
+ >>> source.can_handle("file:./notes.md")
+ True
+ """
+
+ @classmethod
+ def prefix(cls) -> str:
+ """Return 'file' prefix.
+
+ Examples:
+ >>> FileSource.prefix()
+ 'file'
+ """
+ return "file"
+
+ def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+ ) -> Optional[ReferenceContent]:
+ """Read content from a local file.
+
+ Args:
+ identifier: File path (without 'file:' prefix)
+ config: Configuration including reference_base_dir
+
+ Returns:
+ ReferenceContent if file exists and is readable, None otherwise
+
+ Examples:
+ >>> from linkml_reference_validator.models import ReferenceValidationConfig
+ >>> config = ReferenceValidationConfig()
+ >>> source = FileSource()
+ >>> # Would read file in real usage:
+ >>> # ref = source.fetch("./notes.md", config)
+ """
+ file_path = self._resolve_path(identifier, config)
+
+ if not file_path.exists():
+ logger.warning(f"File not found: {file_path}")
+ return None
+
+ if not file_path.is_file():
+ logger.warning(f"Not a file: {file_path}")
+ return None
+
+ content = file_path.read_text(encoding="utf-8")
+ title = self._extract_title(content, file_path)
+
+ return ReferenceContent(
+ reference_id=f"file:{file_path}",
+ title=title,
+ content=content,
+ content_type="local_file",
+ )
+
+ def _resolve_path(self, identifier: str, config: ReferenceValidationConfig) -> Path:
+ """Resolve a file path from identifier.
+
+ - Absolute paths are used directly
+ - Relative paths use reference_base_dir if set, otherwise CWD
+
+ Args:
+ identifier: File path string
+ config: Configuration with optional reference_base_dir
+
+ Returns:
+ Resolved Path object
+
+ Examples:
+ >>> from linkml_reference_validator.models import ReferenceValidationConfig
+ >>> source = FileSource()
+ >>> config = ReferenceValidationConfig()
+ >>> # Absolute path stays absolute
+ >>> p = source._resolve_path("/tmp/test.md", config)
+ >>> p.is_absolute()
+ True
+ """
+ path = Path(identifier)
+
+ # Absolute paths are used directly
+ if path.is_absolute():
+ return path
+
+ # Relative paths: use base_dir if set, otherwise CWD
+ base_dir = getattr(config, "reference_base_dir", None)
+ if base_dir is not None:
+ return Path(base_dir) / path
+ else:
+ return Path.cwd() / path
+
+ def _extract_title(self, content: str, file_path: Path) -> str:
+ """Extract title from content or use filename.
+
+ For markdown files, extracts the first # heading.
+ Falls back to filename otherwise.
+
+ Args:
+ content: File content
+ file_path: Path to file
+
+ Returns:
+ Extracted title or filename
+
+ Examples:
+ >>> source = FileSource()
+ >>> source._extract_title("# My Title\\n\\nContent", Path("test.md"))
+ 'My Title'
+ >>> source._extract_title("No heading here", Path("notes.txt"))
+ 'notes.txt'
+ """
+ # Look for markdown heading
+ match = re.search(r"^#\s+(.+)$", content, re.MULTILINE)
+ if match:
+ return match.group(1).strip()
+
+ # Fall back to filename
+ return file_path.name
diff --git a/src/linkml_reference_validator/etl/sources/pmid.py b/src/linkml_reference_validator/etl/sources/pmid.py
new file mode 100644
index 0000000..6dca297
--- /dev/null
+++ b/src/linkml_reference_validator/etl/sources/pmid.py
@@ -0,0 +1,301 @@
+"""PMID (PubMed ID) reference source.
+
+Fetches publication content from PubMed/NCBI using the Entrez API.
+
+Examples:
+ >>> from linkml_reference_validator.etl.sources.pmid import PMIDSource
+ >>> PMIDSource.prefix()
+ 'PMID'
+ >>> PMIDSource.can_handle("PMID:12345678")
+ True
+"""
+
+import logging
+import re
+import time
+from typing import Optional
+
+from Bio import Entrez # type: ignore
+from bs4 import BeautifulSoup # type: ignore
+import requests # type: ignore
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class PMIDSource(ReferenceSource):
+ """Fetch references from PubMed using PMID.
+
+ Uses the NCBI Entrez API to fetch publication metadata and content.
+
+ Examples:
+ >>> source = PMIDSource()
+ >>> source.prefix()
+ 'PMID'
+ >>> source.can_handle("PMID:12345678")
+ True
+ >>> source.can_handle("PMID 12345678")
+ True
+ """
+
+ @classmethod
+ def prefix(cls) -> str:
+ """Return 'PMID' prefix.
+
+ Examples:
+ >>> PMIDSource.prefix()
+ 'PMID'
+ """
+ return "PMID"
+
+ @classmethod
+ def can_handle(cls, reference_id: str) -> bool:
+ """Check if this is a PMID reference.
+
+ Handles formats:
+ - PMID:12345678
+ - PMID 12345678
+ - Plain digits (assumed to be PMID)
+
+ Examples:
+ >>> PMIDSource.can_handle("PMID:12345678")
+ True
+ >>> PMIDSource.can_handle("PMID 12345678")
+ True
+ >>> PMIDSource.can_handle("12345678")
+ True
+ >>> PMIDSource.can_handle("DOI:10.1234/test")
+ False
+ """
+ reference_id = reference_id.strip()
+ # Check for PMID prefix
+ if re.match(r"^PMID[:\s]", reference_id, re.IGNORECASE):
+ return True
+ # Plain digits are assumed to be PMIDs
+ if reference_id.isdigit():
+ return True
+ return False
+
+ def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+ ) -> Optional[ReferenceContent]:
+ """Fetch a publication from PubMed by PMID.
+
+ Args:
+ identifier: PubMed ID (without prefix)
+ config: Configuration including rate limiting and email
+
+ Returns:
+ ReferenceContent if successful, None otherwise
+
+ Examples:
+ >>> from linkml_reference_validator.models import ReferenceValidationConfig
+ >>> config = ReferenceValidationConfig()
+ >>> source = PMIDSource()
+ >>> # Would fetch in real usage:
+ >>> # ref = source.fetch("12345678", config)
+ """
+ pmid = identifier.strip()
+ Entrez.email = config.email # type: ignore
+
+ time.sleep(config.rate_limit_delay)
+
+ # External API call - handle network/API errors
+ try:
+ handle = Entrez.esummary(db="pubmed", id=pmid)
+ records = Entrez.read(handle)
+ handle.close()
+ except Exception as e:
+ logger.warning(f"Failed to fetch PMID:{pmid} from NCBI: {e}")
+ return None
+
+ if not records:
+ logger.warning(f"No records found for PMID:{pmid}")
+ return None
+
+ record = records[0] if isinstance(records, list) else records
+
+ title = record.get("Title", "")
+ authors = self._parse_authors(record.get("AuthorList", []))
+ journal = record.get("Source", "")
+ year = record.get("PubDate", "")[:4] if record.get("PubDate") else ""
+ doi = record.get("DOI", "")
+
+ abstract = self._fetch_abstract(pmid, config)
+ full_text, content_type = self._fetch_pmc_fulltext(pmid, config)
+
+ if full_text:
+ content: Optional[str] = f"{abstract}\n\n{full_text}" if abstract else full_text
+ else:
+ content = abstract
+ content_type = "abstract_only" if abstract else "unavailable"
+
+ return ReferenceContent(
+ reference_id=f"PMID:{pmid}",
+ title=title,
+ content=content,
+ content_type=content_type,
+ authors=authors,
+ journal=journal,
+ year=year,
+ doi=doi,
+ )
+
+ def _parse_authors(self, author_list: list) -> list[str]:
+ """Parse author list from Entrez record.
+
+ Args:
+ author_list: List of author names from Entrez
+
+ Returns:
+ List of formatted author names
+
+ Examples:
+ >>> source = PMIDSource()
+ >>> source._parse_authors(["Smith J", "Doe A"])
+ ['Smith J', 'Doe A']
+ """
+ return [str(author) for author in author_list if author]
+
+ def _fetch_abstract(
+ self, pmid: str, config: ReferenceValidationConfig
+ ) -> Optional[str]:
+ """Fetch abstract for a PMID.
+
+ Args:
+ pmid: PubMed ID
+ config: Configuration for rate limiting
+
+ Returns:
+ Abstract text if available
+ """
+ time.sleep(config.rate_limit_delay)
+
+ handle = Entrez.efetch(db="pubmed", id=pmid, rettype="abstract", retmode="text")
+ abstract_text = handle.read()
+ handle.close()
+
+ if abstract_text and len(abstract_text) > 50:
+ return str(abstract_text)
+
+ return None
+
+ def _fetch_pmc_fulltext(
+ self, pmid: str, config: ReferenceValidationConfig
+ ) -> tuple[Optional[str], str]:
+ """Attempt to fetch full text from PMC.
+
+ Args:
+ pmid: PubMed ID
+ config: Configuration for rate limiting
+
+ Returns:
+ Tuple of (full_text, content_type)
+ """
+ pmcid = self._get_pmcid(pmid, config)
+ if not pmcid:
+ return None, "no_pmc"
+
+ full_text = self._fetch_pmc_xml(pmcid, config)
+ if full_text and len(full_text) > 1000:
+ return full_text, "full_text_xml"
+
+ full_text = self._fetch_pmc_html(pmcid, config)
+ if full_text and len(full_text) > 1000:
+ return full_text, "full_text_html"
+
+ return None, "pmc_restricted"
+
+ def _get_pmcid(self, pmid: str, config: ReferenceValidationConfig) -> Optional[str]:
+ """Get PMC ID for a PubMed ID.
+
+ Args:
+ pmid: PubMed ID
+ config: Configuration for rate limiting
+
+ Returns:
+ PMC ID if available
+ """
+ time.sleep(config.rate_limit_delay)
+
+ handle = Entrez.elink(dbfrom="pubmed", db="pmc", id=pmid, linkname="pubmed_pmc")
+ result = Entrez.read(handle)
+ handle.close()
+
+ if result and result[0].get("LinkSetDb"):
+ links = result[0]["LinkSetDb"][0].get("Link", [])
+ if links:
+ return links[0]["Id"]
+
+ return None
+
+ def _fetch_pmc_xml(
+ self, pmcid: str, config: ReferenceValidationConfig
+ ) -> Optional[str]:
+ """Fetch full text from PMC XML API.
+
+ Args:
+ pmcid: PMC ID
+ config: Configuration for rate limiting
+
+ Returns:
+ Extracted text from XML
+ """
+ time.sleep(config.rate_limit_delay)
+
+ handle = Entrez.efetch(db="pmc", id=pmcid, rettype="xml", retmode="xml")
+ xml_content = handle.read()
+ handle.close()
+
+ if isinstance(xml_content, bytes):
+ xml_content = xml_content.decode("utf-8")
+
+ if "cannot be obtained" in xml_content.lower() or "restricted" in xml_content.lower():
+ return None
+
+ soup = BeautifulSoup(xml_content, "xml")
+ body = soup.find("body")
+
+ if body:
+ paragraphs = body.find_all("p")
+ if paragraphs:
+ text = "\n\n".join(p.get_text() for p in paragraphs)
+ return text
+
+ return None
+
+ def _fetch_pmc_html(
+ self, pmcid: str, config: ReferenceValidationConfig
+ ) -> Optional[str]:
+ """Fetch full text from PMC HTML as fallback.
+
+ Args:
+ pmcid: PMC ID
+ config: Configuration for rate limiting
+
+ Returns:
+ Extracted text from HTML
+ """
+ time.sleep(config.rate_limit_delay)
+
+ url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmcid}/"
+
+ response = requests.get(url, timeout=30)
+ if response.status_code != 200:
+ return None
+
+ soup = BeautifulSoup(response.content, "html.parser")
+ article_body = soup.find("div", class_="article-body") or soup.find(
+ "div", class_="tsec"
+ )
+
+ if article_body:
+ paragraphs = article_body.find_all("p")
+ if paragraphs:
+ text = "\n\n".join(p.get_text() for p in paragraphs)
+ return text
+
+ return None
diff --git a/src/linkml_reference_validator/etl/sources/url.py b/src/linkml_reference_validator/etl/sources/url.py
new file mode 100644
index 0000000..d487fe7
--- /dev/null
+++ b/src/linkml_reference_validator/etl/sources/url.py
@@ -0,0 +1,117 @@
+"""URL reference source.
+
+Fetches content from web URLs.
+
+Examples:
+ >>> from linkml_reference_validator.etl.sources.url import URLSource
+ >>> URLSource.prefix()
+ 'url'
+ >>> URLSource.can_handle("url:https://example.com")
+ True
+"""
+
+import logging
+import re
+import time
+from typing import Optional
+
+import requests # type: ignore
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class URLSource(ReferenceSource):
+ """Fetch reference content from web URLs.
+
+ Fetches HTML and plain text content. HTML is returned as-is (no parsing).
+ Content is cached to disk like other sources.
+
+ Examples:
+ >>> source = URLSource()
+ >>> source.prefix()
+ 'url'
+ >>> source.can_handle("url:https://example.com")
+ True
+ """
+
+ @classmethod
+ def prefix(cls) -> str:
+ """Return 'url' prefix.
+
+ Examples:
+ >>> URLSource.prefix()
+ 'url'
+ """
+ return "url"
+
+ def fetch(
+ self, identifier: str, config: ReferenceValidationConfig
+ ) -> Optional[ReferenceContent]:
+ """Fetch content from a URL.
+
+ Args:
+ identifier: URL (without 'url:' prefix)
+ config: Configuration including rate limiting
+
+ Returns:
+ ReferenceContent if successful, None otherwise
+
+ Examples:
+ >>> from linkml_reference_validator.models import ReferenceValidationConfig
+ >>> config = ReferenceValidationConfig()
+ >>> source = URLSource()
+ >>> # Would fetch in real usage:
+ >>> # ref = source.fetch("https://example.com", config)
+ """
+ url = identifier.strip()
+ time.sleep(config.rate_limit_delay)
+
+ headers = {
+ "User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})",
+ }
+
+ response = requests.get(url, headers=headers, timeout=30)
+ if response.status_code != 200:
+ logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}")
+ return None
+
+ content = response.text
+ title = self._extract_title(content, url)
+
+ return ReferenceContent(
+ reference_id=f"url:{url}",
+ title=title,
+ content=content,
+ content_type="url",
+ )
+
+ def _extract_title(self, content: str, url: str) -> str:
+ """Extract title from HTML content or use URL.
+
+ Looks for tag in HTML. Falls back to URL.
+
+ Args:
+ content: Page content
+ url: URL of the page
+
+ Returns:
+ Extracted title or URL
+
+ Examples:
+ >>> source = URLSource()
+ >>> source._extract_title("Page Title", "https://x.com")
+ 'Page Title'
+ >>> source._extract_title("plain text", "https://example.com/doc.txt")
+ 'https://example.com/doc.txt'
+ """
+ # Look for HTML title tag (simple regex, no BeautifulSoup)
+ match = re.search(r"]*>([^<]+)", content, re.IGNORECASE)
+ if match:
+ return match.group(1).strip()
+
+ # Fall back to URL
+ return url
diff --git a/src/linkml_reference_validator/models.py b/src/linkml_reference_validator/models.py
index 351a07b..dc6896b 100644
--- a/src/linkml_reference_validator/models.py
+++ b/src/linkml_reference_validator/models.py
@@ -345,6 +345,10 @@ class ReferenceValidationConfig(BaseModel):
default=Path("references_cache"),
description="Directory for caching downloaded references",
)
+ reference_base_dir: Optional[Path] = Field(
+ default=None,
+ description="Base directory for resolving relative file: references. If None, uses CWD.",
+ )
rate_limit_delay: float = Field(
default=0.5,
ge=0.0,
diff --git a/tests/test_reference_fetcher.py b/tests/test_reference_fetcher.py
index a78c8d2..15c567f 100644
--- a/tests/test_reference_fetcher.py
+++ b/tests/test_reference_fetcher.py
@@ -1,7 +1,7 @@
"""Tests for reference fetcher."""
import pytest
-from unittest.mock import Mock, patch, MagicMock
+from unittest.mock import patch, MagicMock
from linkml_reference_validator.models import ReferenceValidationConfig, ReferenceContent
from linkml_reference_validator.etl.reference_fetcher import ReferenceFetcher
@@ -35,15 +35,8 @@ def test_parse_reference_id(fetcher):
assert fetcher._parse_reference_id("pmid:12345678") == ("PMID", "12345678")
assert fetcher._parse_reference_id("12345678") == ("PMID", "12345678")
assert fetcher._parse_reference_id("DOI:10.1234/test") == ("DOI", "10.1234/test")
-
-
-def test_parse_authors(fetcher):
- """Test author list parsing."""
- authors = fetcher._parse_authors(["Smith J", "Doe A", "Johnson K"])
- assert authors == ["Smith J", "Doe A", "Johnson K"]
-
- authors = fetcher._parse_authors([])
- assert authors == []
+ assert fetcher._parse_reference_id("file:./test.md") == ("file", "./test.md")
+ assert fetcher._parse_reference_id("url:https://example.com") == ("url", "https://example.com")
def test_get_cache_path(fetcher):
@@ -159,102 +152,9 @@ def test_fetch_unsupported_type(fetcher):
assert result is None
-@patch("linkml_reference_validator.etl.reference_fetcher.Entrez")
-def test_fetch_pmid_mock(mock_entrez, fetcher):
- """Test fetching PMID with mocked Entrez."""
- mock_handle = MagicMock()
- mock_handle.read.return_value = [
- {
- "Title": "Test Article",
- "AuthorList": ["Smith J", "Doe A"],
- "Source": "Nature",
- "PubDate": "2024 Jan",
- "DOI": "10.1234/test",
- }
- ]
- mock_handle.__enter__ = Mock(return_value=mock_handle)
- mock_handle.__exit__ = Mock(return_value=False)
-
- mock_entrez.read.return_value = [
- {
- "Title": "Test Article",
- "AuthorList": ["Smith J", "Doe A"],
- "Source": "Nature",
- "PubDate": "2024 Jan",
- "DOI": "10.1234/test",
- }
- ]
- mock_entrez.esummary.return_value = mock_handle
- mock_entrez.efetch.return_value = MagicMock(read=lambda: "This is the abstract text.")
- mock_entrez.elink.return_value = MagicMock()
- mock_entrez.read.side_effect = [
- [
- {
- "Title": "Test Article",
- "AuthorList": ["Smith J", "Doe A"],
- "Source": "Nature",
- "PubDate": "2024 Jan",
- "DOI": "10.1234/test",
- }
- ],
- [{"LinkSetDb": []}],
- ]
-
- result = fetcher._fetch_pmid("12345678")
-
- assert result is not None
- assert result.reference_id == "PMID:12345678"
- assert result.title == "Test Article"
-
-
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
-def test_fetch_doi_mock(mock_get, fetcher):
- """Test fetching DOI with mocked requests."""
- mock_response = MagicMock()
- mock_response.status_code = 200
- mock_response.json.return_value = {
- "status": "ok",
- "message": {
- "title": ["Test DOI Article"],
- "author": [
- {"given": "John", "family": "Smith"},
- {"given": "Alice", "family": "Doe"},
- ],
- "container-title": ["Nature"],
- "published-print": {"date-parts": [[2024, 1, 15]]},
- "abstract": "This is the abstract of the test article.",
- "DOI": "10.1234/test.article",
- },
- }
- mock_get.return_value = mock_response
-
- result = fetcher._fetch_doi("10.1234/test.article")
-
- assert result is not None
- assert result.reference_id == "DOI:10.1234/test.article"
- assert result.title == "Test DOI Article"
- assert result.authors == ["John Smith", "Alice Doe"]
- assert result.journal == "Nature"
- assert result.year == "2024"
- assert result.doi == "10.1234/test.article"
- assert "This is the abstract" in result.content
-
-
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
-def test_fetch_doi_not_found(mock_get, fetcher):
- """Test fetching DOI that doesn't exist."""
- mock_response = MagicMock()
- mock_response.status_code = 404
- mock_get.return_value = mock_response
-
- result = fetcher._fetch_doi("10.1234/nonexistent")
-
- assert result is None
-
-
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+@patch("linkml_reference_validator.etl.sources.doi.requests.get")
def test_fetch_doi_via_fetch_method(mock_get, fetcher):
- """Test that fetch() correctly routes DOI requests to _fetch_doi."""
+ """Test that fetch() correctly routes DOI requests to DOISource."""
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.json.return_value = {
@@ -276,7 +176,7 @@ def test_fetch_doi_via_fetch_method(mock_get, fetcher):
assert result.title == "DOI Article via fetch()"
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+@patch("linkml_reference_validator.etl.sources.doi.requests.get")
def test_save_and_load_doi_from_disk(mock_get, fetcher, tmp_path):
"""Test saving and loading DOI reference from disk cache."""
mock_response = MagicMock()
@@ -310,117 +210,59 @@ def test_save_and_load_doi_from_disk(mock_get, fetcher, tmp_path):
assert result2.doi == "10.9999/cached.doi"
-def test_parse_url_reference_id(fetcher):
- """Test parsing URL reference IDs."""
- assert fetcher._parse_reference_id("URL:https://example.com/book/chapter1") == ("URL", "https://example.com/book/chapter1")
- assert fetcher._parse_reference_id("url:https://example.com/article") == ("URL", "https://example.com/article")
- assert fetcher._parse_reference_id("https://example.com/direct") == ("URL", "https://example.com/direct")
+def test_fetch_local_file(fetcher, tmp_path):
+ """Test fetching content from a local file."""
+ # Create a test file
+ test_file = tmp_path / "research.md"
+ test_file.write_text("# Research Notes\n\nThis is my research content.")
-
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
-def test_fetch_url_success(mock_get, fetcher):
- """Test fetching URL reference successfully."""
- mock_response = MagicMock()
- mock_response.status_code = 200
- mock_response.text = """
-
-
- Chapter 1: Introduction to Biology
-
-
- Chapter 1: Introduction to Biology
- Biology is the natural science that studies life and living organisms.
- This chapter provides an overview of cellular structure and function.
- The cell is the basic unit of life.
-
-
- """
- mock_get.return_value = mock_response
-
- result = fetcher.fetch("URL:https://example.com/biology-book/chapter1")
+ result = fetcher.fetch(f"file:{test_file}")
assert result is not None
- assert result.reference_id == "URL:https://example.com/biology-book/chapter1"
- assert result.title == "Chapter 1: Introduction to Biology"
- assert result.content_type == "html_converted"
- assert "Biology is the natural science" in result.content
- assert "basic unit of life" in result.content
+ assert "Research Notes" in result.title
+ assert "This is my research content." in result.content
+ assert result.content_type == "local_file"
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
-def test_fetch_url_no_title(mock_get, fetcher):
- """Test fetching URL with no title tag."""
+@patch("linkml_reference_validator.etl.sources.url.requests.get")
+def test_fetch_url(mock_get, fetcher):
+ """Test fetching content from a URL."""
mock_response = MagicMock()
mock_response.status_code = 200
- mock_response.text = """
-
-
- Main Heading
- Content without title tag.
-
-
- """
+ mock_response.text = "Web PagePage content here."
+ mock_response.headers = {"content-type": "text/html"}
mock_get.return_value = mock_response
- result = fetcher.fetch("URL:https://example.com/no-title")
+ result = fetcher.fetch("url:https://example.com/page")
assert result is not None
- assert result.reference_id == "URL:https://example.com/no-title"
- assert result.title is None
- assert "Main Heading" in result.content
+ assert result.title == "Web Page"
+ assert "Page content here." in result.content
+ assert result.content_type == "url"
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+@patch("linkml_reference_validator.etl.sources.url.requests.get")
def test_fetch_url_http_error(mock_get, fetcher):
"""Test fetching URL that returns HTTP error."""
mock_response = MagicMock()
mock_response.status_code = 404
mock_get.return_value = mock_response
- result = fetcher.fetch("URL:https://example.com/not-found")
+ result = fetcher.fetch("url:https://example.com/not-found")
assert result is None
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
-def test_fetch_url_request_exception(mock_get, fetcher):
- """Test fetching URL that raises request exception."""
- mock_get.side_effect = Exception("Network error")
-
- result = fetcher.fetch("URL:https://example.com/error")
-
- assert result is None
-
-
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
-def test_fetch_url_malformed_html(mock_get, fetcher):
- """Test fetching URL with malformed HTML.
-
- BeautifulSoup is very forgiving and will parse even malformed HTML.
- This test verifies that the fetcher doesn't crash on malformed input.
- """
- mock_response = MagicMock()
- mock_response.status_code = 200
- mock_response.text = "TestContent without closing tags"
- mock_get.return_value = mock_response
-
- result = fetcher.fetch("URL:https://example.com/malformed")
-
- assert result is not None
- assert result.title == "Test"
- assert "Content without closing tags" in result.content
-
-
def test_url_cache_path(fetcher):
"""Test cache path generation for URLs."""
- path = fetcher._get_cache_path("URL:https://example.com/book/chapter1")
- assert path.name == "URL_https___example.com_book_chapter1.md"
+ path = fetcher._get_cache_path("url:https://example.com/book/chapter1")
+ assert path.name == "url_https___example.com_book_chapter1.md"
- path = fetcher._get_cache_path("URL:https://example.com/path?param=value")
- assert path.name == "URL_https___example.com_path_param_value.md"
+ path = fetcher._get_cache_path("url:https://example.com/path?param=value")
+ assert path.name == "url_https___example.com_path_param_value.md"
-@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+@patch("linkml_reference_validator.etl.sources.url.requests.get")
def test_save_and_load_url_from_disk(mock_get, fetcher, tmp_path):
"""Test saving and loading URL reference from disk cache."""
mock_response = MagicMock()
@@ -431,21 +273,22 @@ def test_save_and_load_url_from_disk(mock_get, fetcher, tmp_path):
This content should be cached.
"""
+ mock_response.headers = {"content-type": "text/html"}
mock_get.return_value = mock_response
# First fetch - this should save to disk
- result1 = fetcher.fetch("URL:https://example.com/cached")
+ result1 = fetcher.fetch("url:https://example.com/cached")
assert result1 is not None
# Clear memory cache
fetcher._cache.clear()
# Second fetch - should load from disk without making HTTP request
- with patch("linkml_reference_validator.etl.reference_fetcher.requests.get") as mock_no_request:
- result2 = fetcher.fetch("URL:https://example.com/cached")
+ with patch("linkml_reference_validator.etl.sources.url.requests.get") as mock_no_request:
+ result2 = fetcher.fetch("url:https://example.com/cached")
mock_no_request.assert_not_called()
assert result2 is not None
- assert result2.reference_id == "URL:https://example.com/cached"
+ assert result2.reference_id == "url:https://example.com/cached"
assert result2.title == "Cached URL Content"
assert "This content should be cached" in result2.content
diff --git a/tests/test_sources.py b/tests/test_sources.py
new file mode 100644
index 0000000..91354ff
--- /dev/null
+++ b/tests/test_sources.py
@@ -0,0 +1,298 @@
+"""Tests for reference source plugins."""
+
+import pytest
+from unittest.mock import patch, MagicMock
+
+from linkml_reference_validator.models import ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSourceRegistry
+from linkml_reference_validator.etl.sources.file import FileSource
+from linkml_reference_validator.etl.sources.url import URLSource
+from linkml_reference_validator.etl.sources.pmid import PMIDSource
+from linkml_reference_validator.etl.sources.doi import DOISource
+
+
+class TestReferenceSourceRegistry:
+ """Tests for the source registry."""
+
+ def test_registry_has_default_sources(self):
+ """Registry should have PMID, DOI, file, and url sources registered."""
+ sources = ReferenceSourceRegistry.list_sources()
+ prefixes = [s.prefix() for s in sources]
+ assert "PMID" in prefixes
+ assert "DOI" in prefixes
+ assert "file" in prefixes
+ assert "url" in prefixes
+
+ def test_get_source_for_pmid(self):
+ """Should return PMIDSource for PMID references."""
+ source = ReferenceSourceRegistry.get_source("PMID:12345678")
+ assert source is not None
+ assert source.prefix() == "PMID"
+
+ def test_get_source_for_doi(self):
+ """Should return DOISource for DOI references."""
+ source = ReferenceSourceRegistry.get_source("DOI:10.1234/test")
+ assert source is not None
+ assert source.prefix() == "DOI"
+
+ def test_get_source_for_file(self):
+ """Should return FileSource for file references."""
+ source = ReferenceSourceRegistry.get_source("file:./test.md")
+ assert source is not None
+ assert source.prefix() == "file"
+
+ def test_get_source_for_url(self):
+ """Should return URLSource for url references."""
+ source = ReferenceSourceRegistry.get_source("url:https://example.com")
+ assert source is not None
+ assert source.prefix() == "url"
+
+ def test_get_source_unknown(self):
+ """Should return None for unknown reference types."""
+ source = ReferenceSourceRegistry.get_source("UNKNOWN:12345")
+ assert source is None
+
+
+class TestFileSource:
+ """Tests for FileSource."""
+
+ @pytest.fixture
+ def config(self, tmp_path):
+ """Create test config."""
+ return ReferenceValidationConfig(
+ cache_dir=tmp_path / "cache",
+ rate_limit_delay=0.0,
+ )
+
+ @pytest.fixture
+ def source(self):
+ """Create FileSource instance."""
+ return FileSource()
+
+ def test_prefix(self, source):
+ """FileSource should have 'file' prefix."""
+ assert source.prefix() == "file"
+
+ def test_can_handle_file_prefix(self, source):
+ """Should handle file: references."""
+ assert source.can_handle("file:./test.md")
+ assert source.can_handle("file:/absolute/path.txt")
+ assert not source.can_handle("PMID:12345")
+
+ def test_fetch_markdown_file(self, source, config, tmp_path):
+ """Should read markdown file content."""
+ # Create test markdown file
+ test_file = tmp_path / "test.md"
+ test_file.write_text("# Test Document\n\nThis is test content.")
+
+ result = source.fetch(str(test_file), config)
+
+ assert result is not None
+ assert result.reference_id == f"file:{test_file}"
+ assert result.title == "Test Document"
+ assert "This is test content." in result.content
+ assert result.content_type == "local_file"
+
+ def test_fetch_plain_text_file(self, source, config, tmp_path):
+ """Should read plain text file content."""
+ test_file = tmp_path / "test.txt"
+ test_file.write_text("Plain text content here.")
+
+ result = source.fetch(str(test_file), config)
+
+ assert result is not None
+ assert "Plain text content here." in result.content
+ assert result.title == "test.txt" # Falls back to filename
+
+ def test_fetch_relative_path_with_base_dir(self, tmp_path):
+ """Should resolve relative paths using reference_base_dir."""
+ # Create base dir with test file
+ base_dir = tmp_path / "references"
+ base_dir.mkdir()
+ test_file = base_dir / "notes.md"
+ test_file.write_text("# Notes\n\nSome notes here.")
+
+ config = ReferenceValidationConfig(
+ cache_dir=tmp_path / "cache",
+ reference_base_dir=base_dir,
+ )
+ source = FileSource()
+
+ result = source.fetch("notes.md", config)
+
+ assert result is not None
+ assert "Some notes here." in result.content
+
+ def test_fetch_relative_path_cwd_fallback(self, source, config, tmp_path, monkeypatch):
+ """Should resolve relative paths from CWD if no base_dir set."""
+ # Create test file in tmp_path (simulating CWD)
+ test_file = tmp_path / "relative.md"
+ test_file.write_text("# Relative\n\nRelative content.")
+
+ # Change CWD to tmp_path
+ monkeypatch.chdir(tmp_path)
+
+ result = source.fetch("relative.md", config)
+
+ assert result is not None
+ assert "Relative content." in result.content
+
+ def test_fetch_nonexistent_file(self, source, config):
+ """Should return None for nonexistent files."""
+ result = source.fetch("/nonexistent/file.md", config)
+ assert result is None
+
+ def test_extract_title_from_markdown(self, source, config, tmp_path):
+ """Should extract title from first heading."""
+ test_file = tmp_path / "titled.md"
+ test_file.write_text("Some preamble\n\n# The Real Title\n\nContent here.")
+
+ result = source.fetch(str(test_file), config)
+
+ assert result is not None
+ assert result.title == "The Real Title"
+
+ def test_html_content_preserved(self, source, config, tmp_path):
+ """HTML content should be preserved as-is."""
+ test_file = tmp_path / "test.html"
+ test_file.write_text("Test & content
")
+
+ result = source.fetch(str(test_file), config)
+
+ assert result is not None
+ assert "&" in result.content # HTML entities preserved
+
+
+class TestURLSource:
+ """Tests for URLSource."""
+
+ @pytest.fixture
+ def config(self, tmp_path):
+ """Create test config."""
+ return ReferenceValidationConfig(
+ cache_dir=tmp_path / "cache",
+ rate_limit_delay=0.0,
+ )
+
+ @pytest.fixture
+ def source(self):
+ """Create URLSource instance."""
+ return URLSource()
+
+ def test_prefix(self, source):
+ """URLSource should have 'url' prefix."""
+ assert source.prefix() == "url"
+
+ def test_can_handle_url_prefix(self, source):
+ """Should handle url: references."""
+ assert source.can_handle("url:https://example.com")
+ assert source.can_handle("url:http://example.com/page")
+ assert not source.can_handle("PMID:12345")
+
+ @patch("linkml_reference_validator.etl.sources.url.requests.get")
+ def test_fetch_url_html(self, mock_get, source, config):
+ """Should fetch HTML content from URL."""
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = "Test PageContent here"
+ mock_response.headers = {"content-type": "text/html"}
+ mock_get.return_value = mock_response
+
+ result = source.fetch("https://example.com/page", config)
+
+ assert result is not None
+ assert result.reference_id == "url:https://example.com/page"
+ assert "Content here" in result.content
+ assert result.content_type == "url"
+
+ @patch("linkml_reference_validator.etl.sources.url.requests.get")
+ def test_fetch_url_plain_text(self, mock_get, source, config):
+ """Should fetch plain text content from URL."""
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = "Plain text content from URL"
+ mock_response.headers = {"content-type": "text/plain"}
+ mock_get.return_value = mock_response
+
+ result = source.fetch("https://example.com/text.txt", config)
+
+ assert result is not None
+ assert "Plain text content from URL" in result.content
+
+ @patch("linkml_reference_validator.etl.sources.url.requests.get")
+ def test_fetch_url_not_found(self, mock_get, source, config):
+ """Should return None for 404 responses."""
+ mock_response = MagicMock()
+ mock_response.status_code = 404
+ mock_get.return_value = mock_response
+
+ result = source.fetch("https://example.com/notfound", config)
+
+ assert result is None
+
+ @patch("linkml_reference_validator.etl.sources.url.requests.get")
+ def test_fetch_url_extracts_title(self, mock_get, source, config):
+ """Should extract title from HTML."""
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = "Page Title HereContent"
+ mock_response.headers = {"content-type": "text/html"}
+ mock_get.return_value = mock_response
+
+ result = source.fetch("https://example.com", config)
+
+ assert result is not None
+ assert result.title == "Page Title Here"
+
+
+class TestPMIDSource:
+ """Tests for PMIDSource (refactored from ReferenceFetcher)."""
+
+ @pytest.fixture
+ def config(self, tmp_path):
+ """Create test config."""
+ return ReferenceValidationConfig(
+ cache_dir=tmp_path / "cache",
+ rate_limit_delay=0.0,
+ )
+
+ @pytest.fixture
+ def source(self):
+ """Create PMIDSource instance."""
+ return PMIDSource()
+
+ def test_prefix(self, source):
+ """PMIDSource should have 'PMID' prefix."""
+ assert source.prefix() == "PMID"
+
+ def test_can_handle_pmid(self, source):
+ """Should handle PMID references."""
+ assert source.can_handle("PMID:12345678")
+ assert source.can_handle("PMID 12345678")
+ assert not source.can_handle("DOI:10.1234/test")
+
+
+class TestDOISource:
+ """Tests for DOISource (refactored from ReferenceFetcher)."""
+
+ @pytest.fixture
+ def config(self, tmp_path):
+ """Create test config."""
+ return ReferenceValidationConfig(
+ cache_dir=tmp_path / "cache",
+ rate_limit_delay=0.0,
+ )
+
+ @pytest.fixture
+ def source(self):
+ """Create DOISource instance."""
+ return DOISource()
+
+ def test_prefix(self, source):
+ """DOISource should have 'DOI' prefix."""
+ assert source.prefix() == "DOI"
+
+ def test_can_handle_doi(self, source):
+ """Should handle DOI references."""
+ assert source.can_handle("DOI:10.1234/test")
+ assert not source.can_handle("PMID:12345678")