linkml · dragon-ai-agent · Dec 17, 2025 · Dec 16, 2025 · Dec 17, 2025
diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
@@ -68,57 +68,55 @@ We don't use LLMs or semantic similarity because:
 
 ## Reference Fetching
 
-The validator supports multiple reference types:
+The validator uses a **plugin architecture** to support multiple reference sources. Each source type is handled by a dedicated plugin that knows how to fetch and parse content from that source.
 
 ### PubMed (PMID)
 
 For `PMID:12345678`:
 
 1. Queries NCBI E-utilities API
 2. Fetches abstract and metadata
-3. Attempts to retrieve full-text from PMC if available
-4. Parses XML response with BeautifulSoup
-5. Caches as markdown with YAML frontmatter
+3. Parses XML response
+4. Caches as markdown with YAML frontmatter
 
-### DOI (Digital Object Identifier)
+### PubMed Central (PMC)
 
-For `DOI:10.1234/journal.article`:
+For `PMC:12345`:
 
-1. Queries Crossref API for metadata
-2. Fetches abstract and bibliographic information
-3. Extracts title, authors, journal, year
-4. Caches abstract and metadata as markdown
+1. Queries PMC API for full-text XML
+2. Extracts all sections (abstract, introduction, methods, results, discussion)
+3. Provides more content than abstracts alone
+4. Also cached as markdown
 
-### URLs
+### DOI (Digital Object Identifier)
 
-For `URL:https://example.com/page` or `https://example.com/page`:
+For `DOI:10.1234/example`:
 
-1. Makes HTTP GET request to fetch web page
-2. Extracts title from `<title>` tag
-3. Converts HTML to plain text (removes scripts, styles, navigation)
-4. Normalizes whitespace
-5. Caches as markdown with content type `html_converted`
+1. Queries Crossref API
+2. Fetches metadata and abstract (when available)
+3. Caches as markdown
 
-**Use cases for URLs:**
-- Online book chapters
-- Educational resources
-- Documentation pages
-- Any static web content
+### Local Files
 
-**Limitations:**
-- Works best with static HTML content
-- Does not execute JavaScript
-- Cannot access content behind authentication
-- Complex dynamic pages may not extract well
+For `file:./path/to/document.md`:
 
-### PubMed Central (PMC)
+1. Reads file from local filesystem
+2. Extracts title from first markdown heading (or uses filename)
+3. Content used as-is (no parsing for HTML files)
+4. Caches to allow consistent validation
 
-For `PMC:12345`:
+Path resolution:
+- Absolute paths work directly
+- Relative paths use `reference_base_dir` config if set, otherwise current directory
 
-1. Queries PMC API for full-text XML
-2. Extracts all sections (abstract, introduction, methods, results, discussion)
-3. Provides more content than abstracts alone
-4. Also cached as markdown
+### URLs
+
+For `url:https://example.com/page`:
+
+1. Fetches page via HTTP GET
+2. Extracts title from `<title>` tag (for HTML)
+3. Content preserved as-is
+4. Cached like other sources
 
 ## Caching
 

diff --git a/docs/how-to/add-reference-source.md b/docs/how-to/add-reference-source.md
@@ -0,0 +1,263 @@
+# Adding a New Reference Source
+
+The validator uses a plugin architecture that makes it easy to add support for new reference types. This guide shows how to create a custom reference source.
+
+## Overview
+
+Each reference source is a Python class that:
+
+1. Inherits from `ReferenceSource`
+2. Implements `prefix()` and `fetch()` methods
+3. Registers itself with the `ReferenceSourceRegistry`
+
+## Step 1: Create the Source Class
+
+Create a new file in `src/linkml_reference_validator/etl/sources/`:
+
+```python
+# src/linkml_reference_validator/etl/sources/arxiv.py
+"""arXiv reference source."""
+
+import logging
+from typing import Optional
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class ArxivSource(ReferenceSource):
+    """Fetch references from arXiv."""
+
+    @classmethod
+    def prefix(cls) -> str:
+        """Return the prefix this source handles."""
+        return "arxiv"
+
+    def fetch(
+        self, identifier: str, config: ReferenceValidationConfig
+    ) -> Optional[ReferenceContent]:
+        """Fetch a paper from arXiv.
+
+        Args:
+            identifier: arXiv ID (e.g., '2301.07041')
+            config: Configuration for fetching
+
+        Returns:
+            ReferenceContent if successful, None otherwise
+        """
+        # Your implementation here
+        # Fetch from arXiv API, parse response, return ReferenceContent
+        ...
+```
+
+## Step 2: Implement the `fetch()` Method
+
+The `fetch()` method should:
+
+1. Accept an identifier (without the prefix)
+2. Fetch content from the external source
+3. Return a `ReferenceContent` object or `None` on failure
+
+```python
+def fetch(
+    self, identifier: str, config: ReferenceValidationConfig
+) -> Optional[ReferenceContent]:
+    """Fetch a paper from arXiv."""
+    import requests
+    import time
+
+    arxiv_id = identifier.strip()
+
+    # Respect rate limiting
+    time.sleep(config.rate_limit_delay)
+
+    # Fetch from arXiv API
+    url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
+    response = requests.get(url, timeout=30)
+
+    if response.status_code != 200:
+        logger.warning(f"Failed to fetch arxiv:{arxiv_id}")
+        return None
+
+    # Parse the response (arXiv returns Atom XML)
+    title, authors, abstract = self._parse_arxiv_response(response.text)
+
+    return ReferenceContent(
+        reference_id=f"arxiv:{arxiv_id}",
+        title=title,
+        content=abstract,
+        content_type="abstract_only",
+        authors=authors,
+    )
+```
+
+## Step 3: Handle Errors Gracefully
+
+Since you're interfacing with external systems, wrap API calls in try/except:
+
+```python
+def fetch(self, identifier: str, config: ReferenceValidationConfig) -> Optional[ReferenceContent]:
+    try:
+        response = requests.get(url, timeout=30)
+        # ... process response
+    except Exception as e:
+        logger.warning(f"Failed to fetch arxiv:{identifier}: {e}")
+        return None
+```
+
+## Step 4: Register the Source
+
+The `@ReferenceSourceRegistry.register` decorator automatically registers your source when the module is imported.
+
+Add the import to `src/linkml_reference_validator/etl/sources/__init__.py`:
+
+```python
+from linkml_reference_validator.etl.sources.arxiv import ArxivSource
+
+__all__ = [
+    # ... existing exports
+    "ArxivSource",
+]
+```
+
+## Step 5: Write Tests
+
+Create tests in `tests/test_sources.py`:
+
+```python
+class TestArxivSource:
+    """Tests for ArxivSource."""
+
+    @pytest.fixture
+    def source(self):
+        return ArxivSource()
+
+    def test_prefix(self, source):
+        assert source.prefix() == "arxiv"
+
+    def test_can_handle(self, source):
+        assert source.can_handle("arxiv:2301.07041")
+        assert not source.can_handle("PMID:12345")
+
+    @patch("linkml_reference_validator.etl.sources.arxiv.requests.get")
+    def test_fetch(self, mock_get, source, config):
+        mock_response = MagicMock()
+        mock_response.status_code = 200
+        mock_response.text = """..."""  # Mock arXiv XML
+        mock_get.return_value = mock_response
+
+        result = source.fetch("2301.07041", config)
+
+        assert result is not None
+        assert result.reference_id == "arxiv:2301.07041"
+```
+
+## Optional: Custom `can_handle()` Method
+
+The default `can_handle()` checks if the reference starts with your prefix. Override it for custom matching:
+
+```python
+@classmethod
+def can_handle(cls, reference_id: str) -> bool:
+    """Handle arxiv: references and bare arXiv IDs."""
+    ref = reference_id.strip()
+    # Match prefix
+    if ref.lower().startswith("arxiv:"):
+        return True
+    # Match bare arXiv ID pattern (e.g., 2301.07041)
+    import re
+    return bool(re.match(r"^\d{4}\.\d{4,5}(v\d+)?$", ref))
+```
+
+## Complete Example
+
+Here's a complete implementation for a hypothetical "WikiData" source:
+
+```python
+# src/linkml_reference_validator/etl/sources/wikidata.py
+"""WikiData reference source."""
+
+import logging
+import time
+from typing import Optional
+
+import requests
+
+from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
+from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry
+
+logger = logging.getLogger(__name__)
+
+
+@ReferenceSourceRegistry.register
+class WikidataSource(ReferenceSource):
+    """Fetch reference content from WikiData items."""
+
+    @classmethod
+    def prefix(cls) -> str:
+        return "wikidata"
+
+    def fetch(
+        self, identifier: str, config: ReferenceValidationConfig
+    ) -> Optional[ReferenceContent]:
+        qid = identifier.strip().upper()
+        if not qid.startswith("Q"):
+            qid = f"Q{qid}"
+
+        time.sleep(config.rate_limit_delay)
+
+        url = f"https://www.wikidata.org/wiki/Special:EntityData/{qid}.json"
+
+        try:
+            response = requests.get(url, timeout=30)
+            if response.status_code != 200:
+                logger.warning(f"Failed to fetch wikidata:{qid}")
+                return None
+
+            data = response.json()
+            entity = data["entities"].get(qid, {})
+
+            # Extract label and description
+            labels = entity.get("labels", {})
+            descriptions = entity.get("descriptions", {})
+
+            title = labels.get("en", {}).get("value", qid)
+            description = descriptions.get("en", {}).get("value", "")
+
+            return ReferenceContent(
+                reference_id=f"wikidata:{qid}",
+                title=title,
+                content=description,
+                content_type="wikidata_description",
+            )
+
+        except Exception as e:
+            logger.warning(f"Error fetching wikidata:{qid}: {e}")
+            return None
+```
+
+## Reference: ReferenceContent Fields
+
+The `ReferenceContent` model has these fields:
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `reference_id` | `str` | Full reference ID with prefix (e.g., `arxiv:2301.07041`) |
+| `title` | `Optional[str]` | Title of the reference |
+| `content` | `Optional[str]` | Main text content for validation |
+| `content_type` | `str` | Type indicator (e.g., `abstract_only`, `full_text`) |
+| `authors` | `Optional[list[str]]` | List of author names |
+| `journal` | `Optional[str]` | Journal/venue name |
+| `year` | `Optional[str]` | Publication year |
+| `doi` | `Optional[str]` | DOI if available |
+
+## Tips
+
+- **Rate limiting**: Always respect `config.rate_limit_delay` between API calls
+- **Error handling**: Return `None` on failures, don't raise exceptions
+- **Logging**: Use `logger.warning()` for failures to aid debugging
+- **Caching**: The `ReferenceFetcher` handles caching automatically - your source just needs to fetch
+- **Testing**: Mock external API calls in tests to avoid network dependencies