Skip to content

Conversation

@cmungall
Copy link
Member

Summary

  • Refactors reference fetching from if/elif dispatch to a plugin architecture
  • Adds support for local file references (file:./path.md)
  • Adds support for URL references (url:https://example.com)
  • Adds reference_base_dir config option for relative path resolution

Changes

New Plugin Architecture

  • ReferenceSource abstract base class with prefix() and fetch() methods
  • ReferenceSourceRegistry for discovering and dispatching to sources
  • Existing PMID and DOI logic extracted to PMIDSource and DOISource

New Reference Types

Prefix Example Source
file: file:./notes.md Local filesystem
url: url:https://example.com Web (HTTP/HTTPS)

Documentation

  • New how-to: Using Local Files and URLs
  • New how-to: Adding a New Reference Source
  • Updated quickstart and concepts docs

Test plan

  • All 231 existing tests pass
  • New tests for FileSource and URLSource (25 tests)
  • mypy passes with no issues
  • ruff passes with no issues
  • Doctests pass

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings December 16, 2025 23:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a plugin-based architecture for reference sources, replacing the previous if/elif dispatch pattern with a more extensible registry system. It adds support for local file references and URL references, alongside existing PMID and DOI support, and introduces a reference_base_dir configuration option for relative path resolution.

Key changes:

  • Introduced ReferenceSource abstract base class and ReferenceSourceRegistry for plugin management
  • Extracted PMID and DOI fetching logic into dedicated source classes (PMIDSource, DOISource)
  • Added new FileSource for local file references (markdown, text, HTML)
  • Added new URLSource for web page references

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/linkml_reference_validator/etl/sources/base.py Defines the plugin architecture with ReferenceSource ABC and ReferenceSourceRegistry
src/linkml_reference_validator/etl/sources/init.py Imports and registers all source plugins
src/linkml_reference_validator/etl/sources/pmid.py Extracted PMID fetching logic from ReferenceFetcher with proper error handling
src/linkml_reference_validator/etl/sources/doi.py Extracted DOI fetching logic from ReferenceFetcher
src/linkml_reference_validator/etl/sources/file.py New plugin for reading local file references
src/linkml_reference_validator/etl/sources/url.py New plugin for fetching web URL references
src/linkml_reference_validator/etl/reference_fetcher.py Refactored to use plugin registry, removed direct PMID/DOI logic
src/linkml_reference_validator/models.py Added reference_base_dir configuration field
tests/test_sources.py Comprehensive tests for all source plugins (25 new tests)
tests/test_reference_fetcher.py Updated tests to work with plugin architecture
docs/quickstart.md Added examples for file and URL references
docs/index.md Updated to mention multiple reference sources
docs/how-to/use-local-files-and-urls.md New guide for file and URL references
docs/how-to/add-reference-source.md New guide for creating custom source plugins
docs/concepts/how-it-works.md Expanded to document all reference source types
mkdocs.yml Added new how-to guide entries

Comment on lines +77 to +82
response = requests.get(url, headers=headers, timeout=30)
if response.status_code != 200:
logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}")
return None

data = response.json()
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for network request. The requests.get call can raise various exceptions (ConnectionError, Timeout, RequestException, etc.). PMIDSource wraps similar API calls in try-except blocks for consistency. Consider wrapping the request and JSON parsing in a try-except block that returns None and logs a warning on failure, similar to how PMIDSource handles Entrez API calls at line 107-113 of pmid.py.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +80
response = requests.get(url, headers=headers, timeout=30)
if response.status_code != 200:
logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}")
return None
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for network request. The requests.get call can raise various exceptions (ConnectionError, Timeout, RequestException, etc.). For consistency with PMIDSource which wraps API calls in try-except blocks (see pmid.py:107-113), consider wrapping the HTTP request in a try-except block that returns None and logs a warning on failure.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +83
content = file_path.read_text(encoding="utf-8")
title = self._extract_title(content, file_path)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for file reading. The read_text call can raise various exceptions (PermissionError, UnicodeDecodeError, OSError, etc.). For consistency with PMIDSource which wraps external operations in try-except blocks (see pmid.py:107-113), consider wrapping the file read operation in a try-except block that returns None and logs a warning on failure.

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +110
def fetch(
self, identifier: str, config: ReferenceValidationConfig
) -> Optional[ReferenceContent]:
"""Fetch a publication from Crossref by DOI.

Args:
identifier: DOI (without prefix)
config: Configuration including rate limiting and email

Returns:
ReferenceContent if successful, None otherwise

Examples:
>>> from linkml_reference_validator.models import ReferenceValidationConfig
>>> config = ReferenceValidationConfig()
>>> source = DOISource()
>>> # Would fetch in real usage:
>>> # ref = source.fetch("10.1234/test", config)
"""
doi = identifier.strip()
time.sleep(config.rate_limit_delay)

url = f"https://api.crossref.org/works/{doi}"
headers = {
"User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})",
}

response = requests.get(url, headers=headers, timeout=30)
if response.status_code != 200:
logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}")
return None

data = response.json()
if data.get("status") != "ok":
logger.warning(f"Crossref API error for DOI:{doi}")
return None

message = data.get("message", {})

title_list = message.get("title", [])
title = title_list[0] if title_list else ""

authors = self._parse_crossref_authors(message.get("author", []))

container_title = message.get("container-title", [])
journal = container_title[0] if container_title else ""

year = self._extract_crossref_year(message)

abstract = self._clean_abstract(message.get("abstract", ""))

return ReferenceContent(
reference_id=f"DOI:{doi}",
title=title,
content=abstract if abstract else None,
content_type="abstract_only" if abstract else "unavailable",
authors=authors,
journal=journal,
year=year,
doi=doi,
)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation doesn't follow the error handling guidance in docs/how-to/add-reference-source.md (Step 3, lines 97-109), which recommends wrapping external API calls in try-except blocks. This creates a discrepancy between the documented best practices and the actual code. Consider either adding try-except error handling to match the documentation, or updating the documentation to reflect that basic HTTP status checking is sufficient.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +90
def fetch(
self, identifier: str, config: ReferenceValidationConfig
) -> Optional[ReferenceContent]:
"""Fetch content from a URL.

Args:
identifier: URL (without 'url:' prefix)
config: Configuration including rate limiting

Returns:
ReferenceContent if successful, None otherwise

Examples:
>>> from linkml_reference_validator.models import ReferenceValidationConfig
>>> config = ReferenceValidationConfig()
>>> source = URLSource()
>>> # Would fetch in real usage:
>>> # ref = source.fetch("https://example.com", config)
"""
url = identifier.strip()
time.sleep(config.rate_limit_delay)

headers = {
"User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})",
}

response = requests.get(url, headers=headers, timeout=30)
if response.status_code != 200:
logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}")
return None

content = response.text
title = self._extract_title(content, url)

return ReferenceContent(
reference_id=f"url:{url}",
title=title,
content=content,
content_type="url",
)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation doesn't follow the error handling guidance in docs/how-to/add-reference-source.md (Step 3, lines 97-109), which recommends wrapping external API calls in try-except blocks. This creates a discrepancy between the documented best practices and the actual code. Consider either adding try-except error handling to match the documentation, or updating the documentation to reflect that basic HTTP status checking is sufficient.

Copilot uses AI. Check for mistakes.
supporting_text: Climate impacts are accelerating
reference: url:https://example.org/climate-report.html
```

Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the url: reference type in LinkML data files means that reference IDs coming from untrusted or semi-trusted datasets can cause the validator to issue HTTP requests to arbitrary hosts, creating a server-side request forgery (SSRF) risk. In automated validation pipelines or services that run this tool on externally supplied data, an attacker could set reference: url:... to target internal services or cloud metadata endpoints, causing unexpected network access from inside your environment. To mitigate this, document and enforce restrictions such as limiting allowed URL schemes/hosts, or disabling url: references entirely when validating untrusted data.

Suggested change
> ⚠️ **Security Warning:**
> Using the `url:` reference type means that reference IDs coming from untrusted or semi-trusted datasets can cause the validator to issue HTTP requests to arbitrary hosts, creating a server-side request forgery (SSRF) risk.
> In automated validation pipelines or services that run this tool on externally supplied data, an attacker could set `reference: url:...` to target internal services or cloud metadata endpoints, causing unexpected network access from inside your environment.
>
> **To mitigate this risk:**
> - Limit allowed URL schemes/hosts when validating untrusted data
> - Or disable `url:` references entirely in such contexts
> - Always review and sanitize input data sources

Copilot uses AI. Check for mistakes.
Refactor reference fetching from if/elif dispatch to a plugin architecture:
- Add ReferenceSource base class and ReferenceSourceRegistry
- Extract PMID and DOI logic to dedicated source classes
- Add FileSource for local file references (file:./path.md)
- Add URLSource for web page references (url:https://...)
- Add reference_base_dir config for relative path resolution

Documentation:
- Add how-to guide for local files and URLs
- Add how-to guide for creating new reference sources
- Update quickstart and concepts docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@cmungall cmungall force-pushed the feature/plugin-reference-sources branch from 1a00b30 to 87ee0f0 Compare December 17, 2025 01:11
@claude
Copy link
Contributor

claude bot commented Dec 17, 2025

Claude finished @cmungall's task —— View job


Pull Request Review

Overview

This PR successfully refactors the reference fetching system from an if/elif dispatch to a clean plugin architecture, adding support for local files and URLs. The implementation follows good software engineering principles with a well-designed abstract base class and registry pattern.

Strengths

Architecture & Design

  • Excellent plugin architecture: The ReferenceSource ABC and ReferenceSourceRegistry provide a clean, extensible foundation that follows SOLID principles
  • Clean separation of concerns: Each source type is isolated in its own module with focused responsibilities
  • Good abstractions: The can_handle() method with regex-based prefix matching is elegant and flexible
  • Comprehensive documentation: Extensive docstrings with doctests throughout the codebase

Implementation Quality

  • Strong test coverage: 25 new tests covering all source types with good edge case coverage
  • Type safety: Proper type hints throughout and passes mypy checks
  • Path resolution logic: Smart handling of absolute vs relative paths with configurable base directory
  • Content extraction: Thoughtful title extraction from HTML and markdown files

Code Quality

  • Follows project conventions: Adheres to CLAUDE.md guidelines with proper docstrings and test-driven approach
  • Good error handling patterns: Appropriate logging for debugging and troubleshooting
  • Clean migration: Successfully extracted PMID/DOI logic without breaking existing functionality

⚠️ Critical Security Concerns

1. Server-Side Request Forgery (SSRF) Risk

  • Issue: The URL source allows arbitrary HTTP requests to any host, creating SSRF vulnerability when processing untrusted data
  • Impact: In automated pipelines, malicious reference IDs like url:http://169.254.169.254/ could target cloud metadata services or internal networks
  • Recommendation: Add URL validation/allowlisting, especially for production deployments

2. Path Traversal Risk

  • Issue: File source accepts arbitrary paths without validation
  • Impact: file:../../../etc/passwd could read sensitive files outside intended directories
  • Recommendation: Add path validation to restrict access to intended directories only

⚠️ Error Handling Inconsistencies

Missing Exception Handling (as noted by Copilot reviewer):

  • URLSource.fetch() line 80: Missing try/catch for requests.get() and response.json()
  • DOISource.fetch() line 82: Same issue with unhandled network exceptions
  • FileSource.fetch() line 83: Missing try/catch for read_text() operations

Inconsistency with Documentation:

  • docs/how-to/add-reference-source.md recommends try/catch blocks (lines 99-109)
  • But URLSource and DOISource don't follow this pattern, unlike PMIDSource
  • Recommendation: Either add exception handling or update documentation to match actual implementation

⚠️ Code Quality Issues

CLAUDE.md Compliance:

  • Issue: Project guidelines state "AVOID try/except blocks, except when interfacing with external systems" (line 22-24)
  • Conflict: New sources interface with external systems but lack proper exception handling
  • Recommendation: Add try/catch blocks for external API calls as these are the intended use case

📝 Minor Improvements

Performance:

  • Consider adding request timeouts and retry logic for robustness
  • URL source could benefit from content-type validation
  • File source could add file size limits for large files

Documentation:

  • Security warnings needed in docs/how-to/use-local-files-and-urls.md about SSRF risks
  • Consider adding configuration examples for production deployments

Test Coverage Assessment

  • Excellent coverage: 25 new tests cover all source types
  • Good mocking: Proper use of unittest.mock for external dependencies
  • Edge cases: Tests cover error conditions, missing files, network failures
  • Registry testing: Comprehensive tests for plugin registration and dispatch

📊 Final Assessment

This is a well-architected refactor that successfully modernizes the codebase with a clean plugin system. The code quality is high and follows good software engineering practices. However, security concerns around SSRF and path traversal need addressing before production use with untrusted data.

Recommendation: Approve with required security fixes - address the exception handling consistency and add input validation for production safety.


Review Tasks

  • Read project guidelines (CLAUDE.md)
  • Analyze core plugin architecture (base.py, registry)
  • Review source implementations (PMID, DOI, File, URL)
  • Evaluate security concerns and error handling
  • Check test coverage and quality
  • Review documentation updates
  • Provide comprehensive feedback and recommendations

- Update to use lowercase `url:` prefix (matches implementation)
- Remove references to BeautifulSoup HTML conversion (we keep raw content)
- Add file reference type to comparison table
- Add cross-reference to use-local-files-and-urls.md
- Add to mkdocs.yml navigation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@claude
Copy link
Contributor

claude bot commented Dec 17, 2025

Claude finished @cmungall's task —— View job


Pull Request Review

Overview

This PR successfully refactors the reference fetching system from an if/elif dispatch to a clean plugin architecture, adding support for local files and URLs. The implementation follows good software engineering principles with a well-designed abstract base class and registry pattern.

Strengths

Architecture & Design

  • Excellent plugin architecture: The ReferenceSource ABC and ReferenceSourceRegistry provide a clean, extensible foundation that follows SOLID principles
  • Clean separation of concerns: Each source type is isolated in its own module with focused responsibilities
  • Good abstractions: The can_handle() method with regex-based prefix matching is elegant and flexible
  • Comprehensive documentation: Extensive docstrings with doctests throughout the codebase

Implementation Quality

  • Strong test coverage: 25 new tests covering all source types with good edge case coverage
  • Type safety: Proper type hints throughout and passes mypy checks
  • Path resolution logic: Smart handling of absolute vs relative paths with configurable base directory
  • Content extraction: Thoughtful title extraction from HTML and markdown files

Code Quality

  • Follows project conventions: Adheres to CLAUDE.md guidelines with proper docstrings and test-driven approach
  • Good error handling patterns: Appropriate logging for debugging and troubleshooting
  • Clean migration: Successfully extracted PMID/DOI logic without breaking existing functionality

⚠️ Critical Security Concerns

1. Server-Side Request Forgery (SSRF) Risk

  • Issue: The URL source allows arbitrary HTTP requests to any host, creating SSRF vulnerability when processing untrusted data
  • Impact: In automated pipelines, malicious reference IDs like url:http://169.254.169.254/ could target cloud metadata services or internal networks
  • Recommendation: Add URL validation/allowlisting, especially for production deployments

2. Path Traversal Risk

  • Issue: File source accepts arbitrary paths without validation
  • Impact: file:../../../etc/passwd could read sensitive files outside intended directories
  • Recommendation: Add path validation to restrict access to intended directories only

⚠️ Error Handling Inconsistencies

Missing Exception Handling (as noted by Copilot reviewer):

  • URLSource.fetch() line 80: Missing try/catch for requests.get() and response.json()
  • DOISource.fetch() line 82: Same issue with unhandled network exceptions
  • FileSource.fetch() line 83: Missing try/catch for read_text() operations

Inconsistency with Documentation:

  • docs/how-to/add-reference-source.md recommends try/catch blocks (lines 99-109)
  • But URLSource and DOISource don't follow this pattern, unlike PMIDSource
  • Recommendation: Either add exception handling or update documentation to match actual implementation

⚠️ Code Quality Issues

CLAUDE.md Compliance:

  • Issue: Project guidelines state "AVOID try/except blocks, except when interfacing with external systems" (line 22-24)
  • Conflict: New sources interface with external systems but lack proper exception handling
  • Recommendation: Add try/catch blocks for external API calls as these are the intended use case

📝 Minor Improvements

Performance:

  • Consider adding request timeouts and retry logic for robustness
  • URL source could benefit from content-type validation
  • File source could add file size limits for large files

Documentation:

  • Security warnings needed in docs/how-to/use-local-files-and-urls.md about SSRF risks
  • Consider adding configuration examples for production deployments

Test Coverage Assessment

  • Excellent coverage: 25 new tests cover all source types
  • Good mocking: Proper use of unittest.mock for external dependencies
  • Edge cases: Tests cover error conditions, missing files, network failures
  • Registry testing: Comprehensive tests for plugin registration and dispatch

📊 Final Assessment

This is a well-architected refactor that successfully modernizes the codebase with a clean plugin system. The code quality is high and follows good software engineering practices. However, security concerns around SSRF and path traversal need addressing before production use with untrusted data.

Recommendation: Approve with required security fixes - address the exception handling consistency and add input validation for production safety.


Review Tasks

  • Read project guidelines (CLAUDE.md)
  • Analyze core plugin architecture (base.py, registry)
  • Review source implementations (PMID, DOI, File, URL)
  • Evaluate security concerns and error handling
  • Check test coverage and quality
  • Review documentation updates
  • Provide comprehensive feedback and recommendations

@dragon-ai-agent dragon-ai-agent merged commit c8cab33 into main Dec 17, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants