Add plugin-based reference sources with file and URL support #14

cmungall · 2025-12-16T23:35:19Z

Summary

Refactors reference fetching from if/elif dispatch to a plugin architecture
Adds support for local file references (file:./path.md)
Adds support for URL references (url:https://example.com)
Adds reference_base_dir config option for relative path resolution

Changes

New Plugin Architecture

ReferenceSource abstract base class with prefix() and fetch() methods
ReferenceSourceRegistry for discovering and dispatching to sources
Existing PMID and DOI logic extracted to PMIDSource and DOISource

New Reference Types

Prefix	Example	Source
`file:`	`file:./notes.md`	Local filesystem
`url:`	`url:https://example.com`	Web (HTTP/HTTPS)

Documentation

New how-to: Using Local Files and URLs
New how-to: Adding a New Reference Source
Updated quickstart and concepts docs

Test plan

All 231 existing tests pass
New tests for FileSource and URLSource (25 tests)
mypy passes with no issues
ruff passes with no issues
Doctests pass

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR introduces a plugin-based architecture for reference sources, replacing the previous if/elif dispatch pattern with a more extensible registry system. It adds support for local file references and URL references, alongside existing PMID and DOI support, and introduces a reference_base_dir configuration option for relative path resolution.

Key changes:

Introduced ReferenceSource abstract base class and ReferenceSourceRegistry for plugin management
Extracted PMID and DOI fetching logic into dedicated source classes (PMIDSource, DOISource)
Added new FileSource for local file references (markdown, text, HTML)
Added new URLSource for web page references

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/linkml_reference_validator/etl/sources/base.py	Defines the plugin architecture with ReferenceSource ABC and ReferenceSourceRegistry
src/linkml_reference_validator/etl/sources/init.py	Imports and registers all source plugins
src/linkml_reference_validator/etl/sources/pmid.py	Extracted PMID fetching logic from ReferenceFetcher with proper error handling
src/linkml_reference_validator/etl/sources/doi.py	Extracted DOI fetching logic from ReferenceFetcher
src/linkml_reference_validator/etl/sources/file.py	New plugin for reading local file references
src/linkml_reference_validator/etl/sources/url.py	New plugin for fetching web URL references
src/linkml_reference_validator/etl/reference_fetcher.py	Refactored to use plugin registry, removed direct PMID/DOI logic
src/linkml_reference_validator/models.py	Added reference_base_dir configuration field
tests/test_sources.py	Comprehensive tests for all source plugins (25 new tests)
tests/test_reference_fetcher.py	Updated tests to work with plugin architecture
docs/quickstart.md	Added examples for file and URL references
docs/index.md	Updated to mention multiple reference sources
docs/how-to/use-local-files-and-urls.md	New guide for file and URL references
docs/how-to/add-reference-source.md	New guide for creating custom source plugins
docs/concepts/how-it-works.md	Expanded to document all reference source types
mkdocs.yml	Added new how-to guide entries

Copilot · 2025-12-16T23:43:57Z

src/linkml_reference_validator/etl/sources/doi.py

+        response = requests.get(url, headers=headers, timeout=30)
+        if response.status_code != 200:
+            logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}")
+            return None
+
+        data = response.json()


Missing error handling for network request. The requests.get call can raise various exceptions (ConnectionError, Timeout, RequestException, etc.). PMIDSource wraps similar API calls in try-except blocks for consistency. Consider wrapping the request and JSON parsing in a try-except block that returns None and logs a warning on failure, similar to how PMIDSource handles Entrez API calls at line 107-113 of pmid.py.

Copilot · 2025-12-16T23:43:57Z

src/linkml_reference_validator/etl/sources/url.py

+        response = requests.get(url, headers=headers, timeout=30)
+        if response.status_code != 200:
+            logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}")
+            return None


Missing error handling for network request. The requests.get call can raise various exceptions (ConnectionError, Timeout, RequestException, etc.). For consistency with PMIDSource which wraps API calls in try-except blocks (see pmid.py:107-113), consider wrapping the HTTP request in a try-except block that returns None and logs a warning on failure.

Copilot · 2025-12-16T23:43:58Z

src/linkml_reference_validator/etl/sources/file.py

+        content = file_path.read_text(encoding="utf-8")
+        title = self._extract_title(content, file_path)


Missing error handling for file reading. The read_text call can raise various exceptions (PermissionError, UnicodeDecodeError, OSError, etc.). For consistency with PMIDSource which wraps external operations in try-except blocks (see pmid.py:107-113), consider wrapping the file read operation in a try-except block that returns None and logs a warning on failure.

Copilot · 2025-12-16T23:43:58Z

src/linkml_reference_validator/etl/sources/doi.py

+    def fetch(
+        self, identifier: str, config: ReferenceValidationConfig
+    ) -> Optional[ReferenceContent]:
+        """Fetch a publication from Crossref by DOI.
+
+        Args:
+            identifier: DOI (without prefix)
+            config: Configuration including rate limiting and email
+
+        Returns:
+            ReferenceContent if successful, None otherwise
+
+        Examples:
+            >>> from linkml_reference_validator.models import ReferenceValidationConfig
+            >>> config = ReferenceValidationConfig()
+            >>> source = DOISource()
+            >>> # Would fetch in real usage:
+            >>> # ref = source.fetch("10.1234/test", config)
+        """
+        doi = identifier.strip()
+        time.sleep(config.rate_limit_delay)
+
+        url = f"https://api.crossref.org/works/{doi}"
+        headers = {
+            "User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})",
+        }
+
+        response = requests.get(url, headers=headers, timeout=30)
+        if response.status_code != 200:
+            logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}")
+            return None
+
+        data = response.json()
+        if data.get("status") != "ok":
+            logger.warning(f"Crossref API error for DOI:{doi}")
+            return None
+
+        message = data.get("message", {})
+
+        title_list = message.get("title", [])
+        title = title_list[0] if title_list else ""
+
+        authors = self._parse_crossref_authors(message.get("author", []))
+
+        container_title = message.get("container-title", [])
+        journal = container_title[0] if container_title else ""
+
+        year = self._extract_crossref_year(message)
+
+        abstract = self._clean_abstract(message.get("abstract", ""))
+
+        return ReferenceContent(
+            reference_id=f"DOI:{doi}",
+            title=title,
+            content=abstract if abstract else None,
+            content_type="abstract_only" if abstract else "unavailable",
+            authors=authors,
+            journal=journal,
+            year=year,
+            doi=doi,
+        )


The implementation doesn't follow the error handling guidance in docs/how-to/add-reference-source.md (Step 3, lines 97-109), which recommends wrapping external API calls in try-except blocks. This creates a discrepancy between the documented best practices and the actual code. Consider either adding try-except error handling to match the documentation, or updating the documentation to reflect that basic HTTP status checking is sufficient.

Copilot · 2025-12-16T23:43:58Z

src/linkml_reference_validator/etl/sources/url.py

+    def fetch(
+        self, identifier: str, config: ReferenceValidationConfig
+    ) -> Optional[ReferenceContent]:
+        """Fetch content from a URL.
+
+        Args:
+            identifier: URL (without 'url:' prefix)
+            config: Configuration including rate limiting
+
+        Returns:
+            ReferenceContent if successful, None otherwise
+
+        Examples:
+            >>> from linkml_reference_validator.models import ReferenceValidationConfig
+            >>> config = ReferenceValidationConfig()
+            >>> source = URLSource()
+            >>> # Would fetch in real usage:
+            >>> # ref = source.fetch("https://example.com", config)
+        """
+        url = identifier.strip()
+        time.sleep(config.rate_limit_delay)
+
+        headers = {
+            "User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})",
+        }
+
+        response = requests.get(url, headers=headers, timeout=30)
+        if response.status_code != 200:
+            logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}")
+            return None
+
+        content = response.text
+        title = self._extract_title(content, url)
+
+        return ReferenceContent(
+            reference_id=f"url:{url}",
+            title=title,
+            content=content,
+            content_type="url",
+        )


The implementation doesn't follow the error handling guidance in docs/how-to/add-reference-source.md (Step 3, lines 97-109), which recommends wrapping external API calls in try-except blocks. This creates a discrepancy between the documented best practices and the actual code. Consider either adding try-except error handling to match the documentation, or updating the documentation to reflect that basic HTTP status checking is sufficient.

Copilot · 2025-12-16T23:43:58Z

docs/how-to/use-local-files-and-urls.md

+  supporting_text: Climate impacts are accelerating
+  reference: url:https://example.org/climate-report.html
+```
+


Using the url: reference type in LinkML data files means that reference IDs coming from untrusted or semi-trusted datasets can cause the validator to issue HTTP requests to arbitrary hosts, creating a server-side request forgery (SSRF) risk. In automated validation pipelines or services that run this tool on externally supplied data, an attacker could set reference: url:... to target internal services or cloud metadata endpoints, causing unexpected network access from inside your environment. To mitigate this, document and enforce restrictions such as limiting allowed URL schemes/hosts, or disabling url: references entirely when validating untrusted data.

Suggested change

> ⚠️ **Security Warning:**

> Using the `url:` reference type means that reference IDs coming from untrusted or semi-trusted datasets can cause the validator to issue HTTP requests to arbitrary hosts, creating a server-side request forgery (SSRF) risk.

> In automated validation pipelines or services that run this tool on externally supplied data, an attacker could set `reference: url:...` to target internal services or cloud metadata endpoints, causing unexpected network access from inside your environment.

>

> **To mitigate this risk:**

> - Limit allowed URL schemes/hosts when validating untrusted data

> - Or disable `url:` references entirely in such contexts

> - Always review and sanitize input data sources

Refactor reference fetching from if/elif dispatch to a plugin architecture: - Add ReferenceSource base class and ReferenceSourceRegistry - Extract PMID and DOI logic to dedicated source classes - Add FileSource for local file references (file:./path.md) - Add URLSource for web page references (url:https://...) - Add reference_base_dir config for relative path resolution Documentation: - Add how-to guide for local files and URLs - Add how-to guide for creating new reference sources - Update quickstart and concepts docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2025-12-17T01:11:29Z

Claude finished @cmungall's task —— View job

Pull Request Review

Overview

This PR successfully refactors the reference fetching system from an if/elif dispatch to a clean plugin architecture, adding support for local files and URLs. The implementation follows good software engineering principles with a well-designed abstract base class and registry pattern.

✅ Strengths

Architecture & Design

Excellent plugin architecture: The ReferenceSource ABC and ReferenceSourceRegistry provide a clean, extensible foundation that follows SOLID principles
Clean separation of concerns: Each source type is isolated in its own module with focused responsibilities
Good abstractions: The can_handle() method with regex-based prefix matching is elegant and flexible
Comprehensive documentation: Extensive docstrings with doctests throughout the codebase

Implementation Quality

Strong test coverage: 25 new tests covering all source types with good edge case coverage
Type safety: Proper type hints throughout and passes mypy checks
Path resolution logic: Smart handling of absolute vs relative paths with configurable base directory
Content extraction: Thoughtful title extraction from HTML and markdown files

Code Quality

Follows project conventions: Adheres to CLAUDE.md guidelines with proper docstrings and test-driven approach
Good error handling patterns: Appropriate logging for debugging and troubleshooting
Clean migration: Successfully extracted PMID/DOI logic without breaking existing functionality

⚠️ Critical Security Concerns

1. Server-Side Request Forgery (SSRF) Risk

Issue: The URL source allows arbitrary HTTP requests to any host, creating SSRF vulnerability when processing untrusted data
Impact: In automated pipelines, malicious reference IDs like url:http://169.254.169.254/ could target cloud metadata services or internal networks
Recommendation: Add URL validation/allowlisting, especially for production deployments

2. Path Traversal Risk

Issue: File source accepts arbitrary paths without validation
Impact: file:../../../etc/passwd could read sensitive files outside intended directories
Recommendation: Add path validation to restrict access to intended directories only

⚠️ Error Handling Inconsistencies

Missing Exception Handling (as noted by Copilot reviewer):

URLSource.fetch() line 80: Missing try/catch for requests.get() and response.json()
DOISource.fetch() line 82: Same issue with unhandled network exceptions
FileSource.fetch() line 83: Missing try/catch for read_text() operations

Inconsistency with Documentation:

docs/how-to/add-reference-source.md recommends try/catch blocks (lines 99-109)
But URLSource and DOISource don't follow this pattern, unlike PMIDSource
Recommendation: Either add exception handling or update documentation to match actual implementation

⚠️ Code Quality Issues

CLAUDE.md Compliance:

Issue: Project guidelines state "AVOID try/except blocks, except when interfacing with external systems" (line 22-24)
Conflict: New sources interface with external systems but lack proper exception handling
Recommendation: Add try/catch blocks for external API calls as these are the intended use case

📝 Minor Improvements

Performance:

Consider adding request timeouts and retry logic for robustness
URL source could benefit from content-type validation
File source could add file size limits for large files

Documentation:

Security warnings needed in docs/how-to/use-local-files-and-urls.md about SSRF risks
Consider adding configuration examples for production deployments

✅ Test Coverage Assessment

Excellent coverage: 25 new tests cover all source types
Good mocking: Proper use of unittest.mock for external dependencies
Edge cases: Tests cover error conditions, missing files, network failures
Registry testing: Comprehensive tests for plugin registration and dispatch

📊 Final Assessment

This is a well-architected refactor that successfully modernizes the codebase with a clean plugin system. The code quality is high and follows good software engineering practices. However, security concerns around SSRF and path traversal need addressing before production use with untrusted data.

Recommendation: Approve with required security fixes - address the exception handling consistency and add input validation for production safety.

Review Tasks

Read project guidelines (CLAUDE.md)
Analyze core plugin architecture (base.py, registry)
Review source implementations (PMID, DOI, File, URL)
Evaluate security concerns and error handling
Check test coverage and quality
Review documentation updates
Provide comprehensive feedback and recommendations

- Update to use lowercase `url:` prefix (matches implementation) - Remove references to BeautifulSoup HTML conversion (we keep raw content) - Add file reference type to comparison table - Add cross-reference to use-local-files-and-urls.md - Add to mkdocs.yml navigation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2025-12-17T01:44:42Z

Claude finished @cmungall's task —— View job

Pull Request Review

Overview

This PR successfully refactors the reference fetching system from an if/elif dispatch to a clean plugin architecture, adding support for local files and URLs. The implementation follows good software engineering principles with a well-designed abstract base class and registry pattern.

✅ Strengths

Architecture & Design

Excellent plugin architecture: The ReferenceSource ABC and ReferenceSourceRegistry provide a clean, extensible foundation that follows SOLID principles
Clean separation of concerns: Each source type is isolated in its own module with focused responsibilities
Good abstractions: The can_handle() method with regex-based prefix matching is elegant and flexible
Comprehensive documentation: Extensive docstrings with doctests throughout the codebase

Implementation Quality

Strong test coverage: 25 new tests covering all source types with good edge case coverage
Type safety: Proper type hints throughout and passes mypy checks
Path resolution logic: Smart handling of absolute vs relative paths with configurable base directory
Content extraction: Thoughtful title extraction from HTML and markdown files

Code Quality

Follows project conventions: Adheres to CLAUDE.md guidelines with proper docstrings and test-driven approach
Good error handling patterns: Appropriate logging for debugging and troubleshooting
Clean migration: Successfully extracted PMID/DOI logic without breaking existing functionality

⚠️ Critical Security Concerns

1. Server-Side Request Forgery (SSRF) Risk

Issue: The URL source allows arbitrary HTTP requests to any host, creating SSRF vulnerability when processing untrusted data
Impact: In automated pipelines, malicious reference IDs like url:http://169.254.169.254/ could target cloud metadata services or internal networks
Recommendation: Add URL validation/allowlisting, especially for production deployments

2. Path Traversal Risk

Issue: File source accepts arbitrary paths without validation
Impact: file:../../../etc/passwd could read sensitive files outside intended directories
Recommendation: Add path validation to restrict access to intended directories only

⚠️ Error Handling Inconsistencies

Missing Exception Handling (as noted by Copilot reviewer):

URLSource.fetch() line 80: Missing try/catch for requests.get() and response.json()
DOISource.fetch() line 82: Same issue with unhandled network exceptions
FileSource.fetch() line 83: Missing try/catch for read_text() operations

Inconsistency with Documentation:

docs/how-to/add-reference-source.md recommends try/catch blocks (lines 99-109)
But URLSource and DOISource don't follow this pattern, unlike PMIDSource
Recommendation: Either add exception handling or update documentation to match actual implementation

⚠️ Code Quality Issues

CLAUDE.md Compliance:

Issue: Project guidelines state "AVOID try/except blocks, except when interfacing with external systems" (line 22-24)
Conflict: New sources interface with external systems but lack proper exception handling
Recommendation: Add try/catch blocks for external API calls as these are the intended use case

📝 Minor Improvements

Performance:

Consider adding request timeouts and retry logic for robustness
URL source could benefit from content-type validation
File source could add file size limits for large files

Documentation:

Security warnings needed in docs/how-to/use-local-files-and-urls.md about SSRF risks
Consider adding configuration examples for production deployments

✅ Test Coverage Assessment

Excellent coverage: 25 new tests cover all source types
Good mocking: Proper use of unittest.mock for external dependencies
Edge cases: Tests cover error conditions, missing files, network failures
Registry testing: Comprehensive tests for plugin registration and dispatch

📊 Final Assessment

This is a well-architected refactor that successfully modernizes the codebase with a clean plugin system. The code quality is high and follows good software engineering practices. However, security concerns around SSRF and path traversal need addressing before production use with untrusted data.

Recommendation: Approve with required security fixes - address the exception handling consistency and add input validation for production safety.

Review Tasks

Read project guidelines (CLAUDE.md)
Analyze core plugin architecture (base.py, registry)
Review source implementations (PMID, DOI, File, URL)
Evaluate security concerns and error handling
Check test coverage and quality
Review documentation updates
Provide comprehensive feedback and recommendations

Copilot AI review requested due to automatic review settings December 16, 2025 23:35

Copilot started reviewing on behalf of cmungall December 16, 2025 23:35 View session

Copilot AI reviewed Dec 16, 2025

View reviewed changes

cmungall force-pushed the feature/plugin-reference-sources branch from 1a00b30 to 87ee0f0 Compare December 17, 2025 01:11

dragon-ai-agent merged commit c8cab33 into main Dec 17, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add plugin-based reference sources with file and URL support #14

Add plugin-based reference sources with file and URL support #14

Uh oh!

cmungall commented Dec 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

Copilot AI Dec 16, 2025

Uh oh!

claude bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		content = file_path.read_text(encoding="utf-8")
		title = self._extract_title(content, file_path)

+> ⚠️ **Security Warning:**
+> Using the `url:` reference type means that reference IDs coming from untrusted or semi-trusted datasets can cause the validator to issue HTTP requests to arbitrary hosts, creating a server-side request forgery (SSRF) risk.
+> In automated validation pipelines or services that run this tool on externally supplied data, an attacker could set `reference: url:...` to target internal services or cloud metadata endpoints, causing unexpected network access from inside your environment.
+>
+> **To mitigate this risk:**
+> - Limit allowed URL schemes/hosts when validating untrusted data
+> - Or disable `url:` references entirely in such contexts
+> - Always review and sanitize input data sources

Add plugin-based reference sources with file and URL support #14

Add plugin-based reference sources with file and URL support #14

Uh oh!

Conversation

cmungall commented Dec 16, 2025

Summary

Changes

New Plugin Architecture

New Reference Types

Documentation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Review

Overview

✅ Strengths

⚠️ Critical Security Concerns

⚠️ Error Handling Inconsistencies

⚠️ Code Quality Issues

📝 Minor Improvements

✅ Test Coverage Assessment

📊 Final Assessment

Review Tasks

Uh oh!

claude bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Review

Overview

✅ Strengths

⚠️ Critical Security Concerns

⚠️ Error Handling Inconsistencies

⚠️ Code Quality Issues

📝 Minor Improvements

✅ Test Coverage Assessment

📊 Final Assessment

Review Tasks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Dec 17, 2025 •

edited

Loading

claude bot commented Dec 17, 2025 •

edited

Loading