-
Notifications
You must be signed in to change notification settings - Fork 0
Add plugin-based reference sources with file and URL support #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a plugin-based architecture for reference sources, replacing the previous if/elif dispatch pattern with a more extensible registry system. It adds support for local file references and URL references, alongside existing PMID and DOI support, and introduces a reference_base_dir configuration option for relative path resolution.
Key changes:
- Introduced
ReferenceSourceabstract base class andReferenceSourceRegistryfor plugin management - Extracted PMID and DOI fetching logic into dedicated source classes (
PMIDSource,DOISource) - Added new
FileSourcefor local file references (markdown, text, HTML) - Added new
URLSourcefor web page references
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/linkml_reference_validator/etl/sources/base.py | Defines the plugin architecture with ReferenceSource ABC and ReferenceSourceRegistry |
| src/linkml_reference_validator/etl/sources/init.py | Imports and registers all source plugins |
| src/linkml_reference_validator/etl/sources/pmid.py | Extracted PMID fetching logic from ReferenceFetcher with proper error handling |
| src/linkml_reference_validator/etl/sources/doi.py | Extracted DOI fetching logic from ReferenceFetcher |
| src/linkml_reference_validator/etl/sources/file.py | New plugin for reading local file references |
| src/linkml_reference_validator/etl/sources/url.py | New plugin for fetching web URL references |
| src/linkml_reference_validator/etl/reference_fetcher.py | Refactored to use plugin registry, removed direct PMID/DOI logic |
| src/linkml_reference_validator/models.py | Added reference_base_dir configuration field |
| tests/test_sources.py | Comprehensive tests for all source plugins (25 new tests) |
| tests/test_reference_fetcher.py | Updated tests to work with plugin architecture |
| docs/quickstart.md | Added examples for file and URL references |
| docs/index.md | Updated to mention multiple reference sources |
| docs/how-to/use-local-files-and-urls.md | New guide for file and URL references |
| docs/how-to/add-reference-source.md | New guide for creating custom source plugins |
| docs/concepts/how-it-works.md | Expanded to document all reference source types |
| mkdocs.yml | Added new how-to guide entries |
| response = requests.get(url, headers=headers, timeout=30) | ||
| if response.status_code != 200: | ||
| logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}") | ||
| return None | ||
|
|
||
| data = response.json() |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for network request. The requests.get call can raise various exceptions (ConnectionError, Timeout, RequestException, etc.). PMIDSource wraps similar API calls in try-except blocks for consistency. Consider wrapping the request and JSON parsing in a try-except block that returns None and logs a warning on failure, similar to how PMIDSource handles Entrez API calls at line 107-113 of pmid.py.
| response = requests.get(url, headers=headers, timeout=30) | ||
| if response.status_code != 200: | ||
| logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}") | ||
| return None |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for network request. The requests.get call can raise various exceptions (ConnectionError, Timeout, RequestException, etc.). For consistency with PMIDSource which wraps API calls in try-except blocks (see pmid.py:107-113), consider wrapping the HTTP request in a try-except block that returns None and logs a warning on failure.
| content = file_path.read_text(encoding="utf-8") | ||
| title = self._extract_title(content, file_path) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for file reading. The read_text call can raise various exceptions (PermissionError, UnicodeDecodeError, OSError, etc.). For consistency with PMIDSource which wraps external operations in try-except blocks (see pmid.py:107-113), consider wrapping the file read operation in a try-except block that returns None and logs a warning on failure.
| def fetch( | ||
| self, identifier: str, config: ReferenceValidationConfig | ||
| ) -> Optional[ReferenceContent]: | ||
| """Fetch a publication from Crossref by DOI. | ||
|
|
||
| Args: | ||
| identifier: DOI (without prefix) | ||
| config: Configuration including rate limiting and email | ||
|
|
||
| Returns: | ||
| ReferenceContent if successful, None otherwise | ||
|
|
||
| Examples: | ||
| >>> from linkml_reference_validator.models import ReferenceValidationConfig | ||
| >>> config = ReferenceValidationConfig() | ||
| >>> source = DOISource() | ||
| >>> # Would fetch in real usage: | ||
| >>> # ref = source.fetch("10.1234/test", config) | ||
| """ | ||
| doi = identifier.strip() | ||
| time.sleep(config.rate_limit_delay) | ||
|
|
||
| url = f"https://api.crossref.org/works/{doi}" | ||
| headers = { | ||
| "User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})", | ||
| } | ||
|
|
||
| response = requests.get(url, headers=headers, timeout=30) | ||
| if response.status_code != 200: | ||
| logger.warning(f"Failed to fetch DOI:{doi} - status {response.status_code}") | ||
| return None | ||
|
|
||
| data = response.json() | ||
| if data.get("status") != "ok": | ||
| logger.warning(f"Crossref API error for DOI:{doi}") | ||
| return None | ||
|
|
||
| message = data.get("message", {}) | ||
|
|
||
| title_list = message.get("title", []) | ||
| title = title_list[0] if title_list else "" | ||
|
|
||
| authors = self._parse_crossref_authors(message.get("author", [])) | ||
|
|
||
| container_title = message.get("container-title", []) | ||
| journal = container_title[0] if container_title else "" | ||
|
|
||
| year = self._extract_crossref_year(message) | ||
|
|
||
| abstract = self._clean_abstract(message.get("abstract", "")) | ||
|
|
||
| return ReferenceContent( | ||
| reference_id=f"DOI:{doi}", | ||
| title=title, | ||
| content=abstract if abstract else None, | ||
| content_type="abstract_only" if abstract else "unavailable", | ||
| authors=authors, | ||
| journal=journal, | ||
| year=year, | ||
| doi=doi, | ||
| ) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation doesn't follow the error handling guidance in docs/how-to/add-reference-source.md (Step 3, lines 97-109), which recommends wrapping external API calls in try-except blocks. This creates a discrepancy between the documented best practices and the actual code. Consider either adding try-except error handling to match the documentation, or updating the documentation to reflect that basic HTTP status checking is sufficient.
| def fetch( | ||
| self, identifier: str, config: ReferenceValidationConfig | ||
| ) -> Optional[ReferenceContent]: | ||
| """Fetch content from a URL. | ||
|
|
||
| Args: | ||
| identifier: URL (without 'url:' prefix) | ||
| config: Configuration including rate limiting | ||
|
|
||
| Returns: | ||
| ReferenceContent if successful, None otherwise | ||
|
|
||
| Examples: | ||
| >>> from linkml_reference_validator.models import ReferenceValidationConfig | ||
| >>> config = ReferenceValidationConfig() | ||
| >>> source = URLSource() | ||
| >>> # Would fetch in real usage: | ||
| >>> # ref = source.fetch("https://example.com", config) | ||
| """ | ||
| url = identifier.strip() | ||
| time.sleep(config.rate_limit_delay) | ||
|
|
||
| headers = { | ||
| "User-Agent": f"linkml-reference-validator/1.0 (mailto:{config.email})", | ||
| } | ||
|
|
||
| response = requests.get(url, headers=headers, timeout=30) | ||
| if response.status_code != 200: | ||
| logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}") | ||
| return None | ||
|
|
||
| content = response.text | ||
| title = self._extract_title(content, url) | ||
|
|
||
| return ReferenceContent( | ||
| reference_id=f"url:{url}", | ||
| title=title, | ||
| content=content, | ||
| content_type="url", | ||
| ) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation doesn't follow the error handling guidance in docs/how-to/add-reference-source.md (Step 3, lines 97-109), which recommends wrapping external API calls in try-except blocks. This creates a discrepancy between the documented best practices and the actual code. Consider either adding try-except error handling to match the documentation, or updating the documentation to reflect that basic HTTP status checking is sufficient.
| supporting_text: Climate impacts are accelerating | ||
| reference: url:https://example.org/climate-report.html | ||
| ``` | ||
|
|
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the url: reference type in LinkML data files means that reference IDs coming from untrusted or semi-trusted datasets can cause the validator to issue HTTP requests to arbitrary hosts, creating a server-side request forgery (SSRF) risk. In automated validation pipelines or services that run this tool on externally supplied data, an attacker could set reference: url:... to target internal services or cloud metadata endpoints, causing unexpected network access from inside your environment. To mitigate this, document and enforce restrictions such as limiting allowed URL schemes/hosts, or disabling url: references entirely when validating untrusted data.
| > ⚠️ **Security Warning:** | |
| > Using the `url:` reference type means that reference IDs coming from untrusted or semi-trusted datasets can cause the validator to issue HTTP requests to arbitrary hosts, creating a server-side request forgery (SSRF) risk. | |
| > In automated validation pipelines or services that run this tool on externally supplied data, an attacker could set `reference: url:...` to target internal services or cloud metadata endpoints, causing unexpected network access from inside your environment. | |
| > | |
| > **To mitigate this risk:** | |
| > - Limit allowed URL schemes/hosts when validating untrusted data | |
| > - Or disable `url:` references entirely in such contexts | |
| > - Always review and sanitize input data sources |
Refactor reference fetching from if/elif dispatch to a plugin architecture: - Add ReferenceSource base class and ReferenceSourceRegistry - Extract PMID and DOI logic to dedicated source classes - Add FileSource for local file references (file:./path.md) - Add URLSource for web page references (url:https://...) - Add reference_base_dir config for relative path resolution Documentation: - Add how-to guide for local files and URLs - Add how-to guide for creating new reference sources - Update quickstart and concepts docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1a00b30 to
87ee0f0
Compare
|
Claude finished @cmungall's task —— View job Pull Request ReviewOverviewThis PR successfully refactors the reference fetching system from an if/elif dispatch to a clean plugin architecture, adding support for local files and URLs. The implementation follows good software engineering principles with a well-designed abstract base class and registry pattern. ✅ StrengthsArchitecture & Design
Implementation Quality
Code Quality
|
- Update to use lowercase `url:` prefix (matches implementation) - Remove references to BeautifulSoup HTML conversion (we keep raw content) - Add file reference type to comparison table - Add cross-reference to use-local-files-and-urls.md - Add to mkdocs.yml navigation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Claude finished @cmungall's task —— View job Pull Request ReviewOverviewThis PR successfully refactors the reference fetching system from an if/elif dispatch to a clean plugin architecture, adding support for local files and URLs. The implementation follows good software engineering principles with a well-designed abstract base class and registry pattern. ✅ StrengthsArchitecture & Design
Implementation Quality
Code Quality
|
Summary
file:./path.md)url:https://example.com)reference_base_dirconfig option for relative path resolutionChanges
New Plugin Architecture
ReferenceSourceabstract base class withprefix()andfetch()methodsReferenceSourceRegistryfor discovering and dispatching to sourcesPMIDSourceandDOISourceNew Reference Types
file:file:./notes.mdurl:url:https://example.comDocumentation
Test plan
🤖 Generated with Claude Code