diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index 55cd2d1..edff926 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -68,14 +68,48 @@ We don't use LLMs or semantic similarity because:
## Reference Fetching
+The validator supports multiple reference types:
+
### PubMed (PMID)
For `PMID:12345678`:
1. Queries NCBI E-utilities API
2. Fetches abstract and metadata
-3. Parses XML response with BeautifulSoup
-4. Caches as markdown with YAML frontmatter
+3. Attempts to retrieve full-text from PMC if available
+4. Parses XML response with BeautifulSoup
+5. Caches as markdown with YAML frontmatter
+
+### DOI (Digital Object Identifier)
+
+For `DOI:10.1234/journal.article`:
+
+1. Queries Crossref API for metadata
+2. Fetches abstract and bibliographic information
+3. Extracts title, authors, journal, year
+4. Caches abstract and metadata as markdown
+
+### URLs
+
+For `URL:https://example.com/page` or `https://example.com/page`:
+
+1. Makes HTTP GET request to fetch web page
+2. Extracts title from `
` tag
+3. Converts HTML to plain text (removes scripts, styles, navigation)
+4. Normalizes whitespace
+5. Caches as markdown with content type `html_converted`
+
+**Use cases for URLs:**
+- Online book chapters
+- Educational resources
+- Documentation pages
+- Any static web content
+
+**Limitations:**
+- Works best with static HTML content
+- Does not execute JavaScript
+- Cannot access content behind authentication
+- Complex dynamic pages may not extract well
### PubMed Central (PMC)
diff --git a/docs/how-to/validate-urls.md b/docs/how-to/validate-urls.md
new file mode 100644
index 0000000..a85f617
--- /dev/null
+++ b/docs/how-to/validate-urls.md
@@ -0,0 +1,253 @@
+# Validating URL References
+
+This guide explains how to validate references that use URLs instead of traditional identifiers like PMIDs or DOIs.
+
+## Overview
+
+The linkml-reference-validator supports validating references that point to web content, such as:
+
+- Book chapters hosted online
+- Educational resources
+- Documentation pages
+- Blog posts or articles
+- Any static web content
+
+When a reference field contains a URL, the validator:
+
+1. Fetches the web page content
+2. Extracts the page title
+3. Converts HTML to plain text
+4. Validates the extracted content against your supporting text
+
+## URL Format
+
+URLs can be specified in two ways:
+
+### Explicit URL Prefix
+
+```yaml
+my_field:
+ value: "Some text from the web page..."
+ references:
+ - "URL:https://example.com/book/chapter1"
+```
+
+### Direct URL
+
+```yaml
+my_field:
+ value: "Some text from the web page..."
+ references:
+ - "https://example.com/book/chapter1"
+```
+
+Both formats are equivalent. If a reference starts with `http://` or `https://`, it's automatically recognized as a URL reference.
+
+## Example
+
+Suppose you have an online textbook chapter at `https://example.com/biology/cell-structure` with the following content:
+
+```html
+
+
+ Chapter 3: Cell Structure and Function
+
+
+ Cell Structure and Function
+ The cell is the basic structural and functional unit of all living organisms.
+ Cells contain various organelles that perform specific functions...
+
+
+```
+
+You can validate text extracted from this chapter:
+
+```yaml
+description:
+ value: "The cell is the basic structural and functional unit of all living organisms"
+ references:
+ - "https://example.com/biology/cell-structure"
+```
+
+## How URL Validation Works
+
+### 1. Content Fetching
+
+When the validator encounters a URL reference, it:
+
+- Makes an HTTP GET request to fetch the page
+- Uses a polite user agent header identifying the tool
+- Respects rate limiting (configurable via `rate_limit_delay`)
+- Handles timeouts (default 30 seconds)
+
+### 2. Content Extraction
+
+The fetcher extracts content from the HTML:
+
+- **Title**: Extracted from the `` tag
+- **Content**: HTML is converted to plain text using BeautifulSoup
+- **Cleanup**: Removes scripts, styles, navigation, headers, and footers
+- **Normalization**: Whitespace is normalized for better matching
+
+### 3. Content Type
+
+URL references are marked with content type `html_converted` to distinguish them from other reference types like abstracts or full-text articles.
+
+### 4. Caching
+
+Fetched URL content is cached to disk in markdown format with YAML frontmatter:
+
+```markdown
+---
+reference_id: URL:https://example.com/biology/cell-structure
+title: "Chapter 3: Cell Structure and Function"
+content_type: html_converted
+---
+
+# Chapter 3: Cell Structure and Function
+
+## Content
+
+The cell is the basic structural and functional unit of all living organisms.
+Cells contain various organelles that perform specific functions...
+```
+
+Cache files are stored in the configured cache directory (default: `.linkml-reference-validator-cache/`).
+
+## Configuration
+
+URL fetching behavior can be configured:
+
+```yaml
+# config.yaml
+rate_limit_delay: 0.5 # Wait 0.5 seconds between requests
+email: "your-email@example.com" # Used in user agent
+cache_dir: ".cache/references" # Where to cache fetched content
+```
+
+Or via command-line:
+
+```bash
+linkml-reference-validator validate \
+ --cache-dir .cache \
+ --rate-limit-delay 0.5 \
+ my-data.yaml
+```
+
+## Limitations
+
+### Static Content Only
+
+URL validation is designed for static web pages. It may not work well with:
+
+- Dynamic content loaded via JavaScript
+- Pages requiring authentication
+- Content behind paywalls
+- Frequently changing content
+
+### HTML Structure
+
+The content extraction works by:
+
+- Removing navigation, headers, and footers
+- Converting remaining HTML to text
+- Normalizing whitespace
+
+This works well for simple HTML but may not capture content perfectly from complex layouts.
+
+### No Rendering
+
+The fetcher downloads raw HTML and parses it directly. It does not:
+
+- Execute JavaScript
+- Render the page in a browser
+- Follow redirects automatically (may be added in future)
+- Handle dynamic content
+
+## Best Practices
+
+### 1. Use Stable URLs
+
+Choose URLs that are unlikely to change:
+
+- ✅ Versioned documentation: `https://docs.example.com/v1.0/chapter1`
+- ✅ Archived content: `https://archive.example.com/2024/article`
+- ❌ Blog posts with dates that might be reorganized
+- ❌ URLs with session parameters
+
+### 2. Verify Content Quality
+
+After adding a URL reference, verify the extracted content:
+
+```bash
+# Check what was extracted
+cat .linkml-reference-validator-cache/URL_https___example.com_page.md
+```
+
+Ensure the extracted text contains the relevant information you're referencing.
+
+### 3. Cache Management
+
+- Commit cache files to version control for reproducibility
+- Use `--force-refresh` to update cached content
+- Periodically review cached URLs to ensure they're still accessible
+
+### 4. Mix Reference Types
+
+URL references work alongside PMIDs and DOIs:
+
+```yaml
+findings:
+ value: "Multiple studies confirm this relationship"
+ references:
+ - "PMID:12345678" # Research paper
+ - "DOI:10.1234/journal.article" # Another paper
+ - "https://example.com/textbook/chapter5" # Textbook chapter
+```
+
+## Troubleshooting
+
+### URL Not Fetching
+
+If URL content isn't being fetched:
+
+1. Check network connectivity
+2. Verify the URL is accessible in a browser
+3. Check for rate limiting or IP blocks
+4. Look for error messages in the logs
+
+### Incorrect Content Extraction
+
+If the wrong content is extracted:
+
+1. Inspect the cached markdown file
+2. Check if the page uses complex JavaScript
+3. Consider if the page structure requires custom parsing
+4. File an issue with the page URL for improvement
+
+### Validation Failing
+
+If validation fails for URL references:
+
+1. Check the cached content to see what was extracted
+2. Verify your supporting text actually appears on the page
+3. Check for whitespace or formatting differences
+4. Consider if the page content has changed since caching
+
+## Comparison with Other Reference Types
+
+| Feature | PMID | DOI | URL |
+|---------|------|-----|-----|
+| Source | PubMed | Crossref | Any web page |
+| Content Type | Abstract + Full Text | Abstract | HTML converted |
+| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) |
+| Stability | High | High | Variable |
+| Access | Free for abstracts | Varies | Varies |
+| Caching | Yes | Yes | Yes |
+
+## See Also
+
+- [Validating DOIs](validate-dois.md) - For journal articles with DOIs
+- [Validating OBO Files](validate-obo-files.md) - For ontology-specific validation
+- [How It Works](../concepts/how-it-works.md) - Core validation concepts
+- [CLI Reference](../reference/cli.md) - Command-line options
diff --git a/docs/index.md b/docs/index.md
index 52be20f..ba92ac8 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -2,7 +2,7 @@
**Validate quotes and excerpts against their source publications**
-linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC and performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
+linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC, DOIs via Crossref, and URLs, then performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
## Key Features
diff --git a/docs/quickstart.md b/docs/quickstart.md
index 5962b70..c4a9b90 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -86,6 +86,33 @@ linkml-reference-validator validate text \
This works the same way as PMID validation - the reference is fetched and cached locally.
+## Validate Against a URL
+
+For online resources like book chapters, documentation, or educational content:
+
+```bash
+linkml-reference-validator validate text \
+ "The cell is the basic structural and functional unit of all living organisms" \
+ https://example.com/biology/cell-structure
+```
+
+Or with explicit URL prefix:
+
+```bash
+linkml-reference-validator validate text \
+ "The cell is the basic unit of life" \
+ URL:https://example.com/biology/cells
+```
+
+The validator will:
+1. Fetch the web page content
+2. Extract the title from the `` tag
+3. Convert HTML to plain text (removing scripts, styles, navigation)
+4. Cache the content locally
+5. Validate your text against the extracted content
+
+**Note:** URL validation works best with static HTML pages and may not work well with JavaScript-heavy or dynamic content.
+
## Key Features
- **Automatic Caching**: References cached locally after first fetch
@@ -94,6 +121,7 @@ This works the same way as PMID validation - the reference is fetched and cached
- **Deterministic Matching**: Substring-based (not AI/fuzzy matching)
- **PubMed & PMC**: Fetches from NCBI automatically
- **DOI Support**: Fetches metadata from Crossref API
+- **URL Support**: Validates against web content (books, docs, educational resources)
## Next Steps
diff --git a/mkdocs.yml b/mkdocs.yml
index 60aaa19..d9510ed 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -33,6 +33,7 @@ nav:
- How-To Guides:
- Validating OBO Files: how-to/validate-obo-files.md
- Validating DOIs: how-to/validate-dois.md
+ - Validating URLs: how-to/validate-urls.md
- Concepts:
- How It Works: concepts/how-it-works.md
- Editorial Conventions: concepts/editorial-conventions.md
diff --git a/src/linkml_reference_validator/etl/reference_fetcher.py b/src/linkml_reference_validator/etl/reference_fetcher.py
index 1a7b90c..5fc377b 100644
--- a/src/linkml_reference_validator/etl/reference_fetcher.py
+++ b/src/linkml_reference_validator/etl/reference_fetcher.py
@@ -86,6 +86,8 @@ def fetch(self, reference_id: str, force_refresh: bool = False) -> Optional[Refe
content = self._fetch_pmid(identifier)
elif prefix == "DOI":
content = self._fetch_doi(identifier)
+ elif prefix == "URL":
+ content = self._fetch_url(identifier)
else:
logger.warning(f"Unsupported reference type: {prefix}")
return None
@@ -100,7 +102,7 @@ def _parse_reference_id(self, reference_id: str) -> tuple[str, str]:
"""Parse a reference ID into prefix and identifier.
Args:
- reference_id: Reference ID like "PMID:12345678"
+ reference_id: Reference ID like "PMID:12345678" or URL
Returns:
Tuple of (prefix, identifier)
@@ -114,13 +116,27 @@ def _parse_reference_id(self, reference_id: str) -> tuple[str, str]:
('PMID', '12345678')
>>> fetcher._parse_reference_id("12345678")
('PMID', '12345678')
+ >>> fetcher._parse_reference_id("URL:https://example.com/book/chapter1")
+ ('URL', 'https://example.com/book/chapter1')
+ >>> fetcher._parse_reference_id("https://example.com/direct")
+ ('URL', 'https://example.com/direct')
"""
- match = re.match(r"^([A-Za-z_]+)[:\s]+(.+)$", reference_id.strip())
+ stripped = reference_id.strip()
+
+ # Check if it's a direct URL (starts with http or https)
+ if stripped.startswith(('http://', 'https://')):
+ return "URL", stripped
+
+ # Standard prefix:identifier format
+ match = re.match(r"^([A-Za-z_]+)[:\s]+(.+)$", stripped)
if match:
return match.group(1).upper(), match.group(2).strip()
- if reference_id.strip().isdigit():
- return "PMID", reference_id.strip()
- return "UNKNOWN", reference_id
+
+ # Plain numeric ID defaults to PMID
+ if stripped.isdigit():
+ return "PMID", stripped
+
+ return "UNKNOWN", stripped
def _fetch_pmid(self, pmid: str) -> Optional[ReferenceContent]:
"""Fetch a publication from PubMed by PMID.
@@ -234,6 +250,65 @@ def _fetch_doi(self, doi: str) -> Optional[ReferenceContent]:
doi=doi,
)
+ def _fetch_url(self, url: str) -> Optional[ReferenceContent]:
+ """Fetch content from a URL.
+
+ Fetches web content, extracts title and converts HTML to text.
+ Intended for static pages like book chapters.
+
+ Args:
+ url: The URL to fetch
+
+ Returns:
+ ReferenceContent if successful, None otherwise
+
+ Examples:
+ >>> config = ReferenceValidationConfig()
+ >>> fetcher = ReferenceFetcher(config)
+ >>> # Would fetch in real usage:
+ >>> # ref = fetcher._fetch_url("https://example.com/book/chapter1")
+ """
+ time.sleep(self.config.rate_limit_delay)
+
+ headers = {
+ "User-Agent": f"linkml-reference-validator/1.0 (mailto:{self.config.email})",
+ }
+
+ try:
+ response = requests.get(url, headers=headers, timeout=30)
+ if response.status_code != 200:
+ logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}")
+ return None
+
+ soup = BeautifulSoup(response.text, "html.parser")
+
+ # Extract title
+ title_tag = soup.find("title")
+ title = title_tag.get_text().strip() if title_tag else None
+
+ # Convert HTML to text
+ # Remove script and style elements
+ for script in soup(["script", "style", "nav", "header", "footer"]):
+ script.decompose()
+
+ # Get text content
+ content = soup.get_text()
+
+ # Clean up text - normalize whitespace
+ lines = (line.strip() for line in content.splitlines())
+ content = "\n".join(line for line in lines if line)
+
+ return ReferenceContent(
+ reference_id=f"URL:{url}",
+ title=title,
+ content=content if content else None,
+ content_type="html_converted",
+ )
+
+ except Exception as e:
+ logger.error(f"Error fetching URL:{url}: {e}")
+ return None
+
def _parse_crossref_authors(self, authors: list) -> list[str]:
"""Parse author list from Crossref response.
@@ -466,8 +541,11 @@ def _get_cache_path(self, reference_id: str) -> Path:
>>> path = fetcher._get_cache_path("PMID:12345678")
>>> path.name
'PMID_12345678.md'
+ >>> path = fetcher._get_cache_path("URL:https://example.com/book/chapter1")
+ >>> path.name
+ 'URL_https___example.com_book_chapter1.md'
"""
- safe_id = reference_id.replace(":", "_").replace("/", "_")
+ safe_id = reference_id.replace(":", "_").replace("/", "_").replace("?", "_").replace("=", "_")
cache_dir = self.config.get_cache_dir()
return cache_dir / f"{safe_id}.md"
diff --git a/tests/test_reference_fetcher.py b/tests/test_reference_fetcher.py
index ee9673d..a78c8d2 100644
--- a/tests/test_reference_fetcher.py
+++ b/tests/test_reference_fetcher.py
@@ -308,3 +308,144 @@ def test_save_and_load_doi_from_disk(mock_get, fetcher, tmp_path):
assert result2.reference_id == "DOI:10.9999/cached.doi"
assert result2.title == "Cached DOI Article"
assert result2.doi == "10.9999/cached.doi"
+
+
+def test_parse_url_reference_id(fetcher):
+ """Test parsing URL reference IDs."""
+ assert fetcher._parse_reference_id("URL:https://example.com/book/chapter1") == ("URL", "https://example.com/book/chapter1")
+ assert fetcher._parse_reference_id("url:https://example.com/article") == ("URL", "https://example.com/article")
+ assert fetcher._parse_reference_id("https://example.com/direct") == ("URL", "https://example.com/direct")
+
+
+@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+def test_fetch_url_success(mock_get, fetcher):
+ """Test fetching URL reference successfully."""
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = """
+
+
+ Chapter 1: Introduction to Biology
+
+
+ Chapter 1: Introduction to Biology
+ Biology is the natural science that studies life and living organisms.
+ This chapter provides an overview of cellular structure and function.
+ The cell is the basic unit of life.
+
+
+ """
+ mock_get.return_value = mock_response
+
+ result = fetcher.fetch("URL:https://example.com/biology-book/chapter1")
+
+ assert result is not None
+ assert result.reference_id == "URL:https://example.com/biology-book/chapter1"
+ assert result.title == "Chapter 1: Introduction to Biology"
+ assert result.content_type == "html_converted"
+ assert "Biology is the natural science" in result.content
+ assert "basic unit of life" in result.content
+
+
+@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+def test_fetch_url_no_title(mock_get, fetcher):
+ """Test fetching URL with no title tag."""
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = """
+
+
+ Main Heading
+ Content without title tag.
+
+
+ """
+ mock_get.return_value = mock_response
+
+ result = fetcher.fetch("URL:https://example.com/no-title")
+
+ assert result is not None
+ assert result.reference_id == "URL:https://example.com/no-title"
+ assert result.title is None
+ assert "Main Heading" in result.content
+
+
+@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+def test_fetch_url_http_error(mock_get, fetcher):
+ """Test fetching URL that returns HTTP error."""
+ mock_response = MagicMock()
+ mock_response.status_code = 404
+ mock_get.return_value = mock_response
+
+ result = fetcher.fetch("URL:https://example.com/not-found")
+
+ assert result is None
+
+
+@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+def test_fetch_url_request_exception(mock_get, fetcher):
+ """Test fetching URL that raises request exception."""
+ mock_get.side_effect = Exception("Network error")
+
+ result = fetcher.fetch("URL:https://example.com/error")
+
+ assert result is None
+
+
+@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+def test_fetch_url_malformed_html(mock_get, fetcher):
+ """Test fetching URL with malformed HTML.
+
+ BeautifulSoup is very forgiving and will parse even malformed HTML.
+ This test verifies that the fetcher doesn't crash on malformed input.
+ """
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = "TestContent without closing tags"
+ mock_get.return_value = mock_response
+
+ result = fetcher.fetch("URL:https://example.com/malformed")
+
+ assert result is not None
+ assert result.title == "Test"
+ assert "Content without closing tags" in result.content
+
+
+def test_url_cache_path(fetcher):
+ """Test cache path generation for URLs."""
+ path = fetcher._get_cache_path("URL:https://example.com/book/chapter1")
+ assert path.name == "URL_https___example.com_book_chapter1.md"
+
+ path = fetcher._get_cache_path("URL:https://example.com/path?param=value")
+ assert path.name == "URL_https___example.com_path_param_value.md"
+
+
+@patch("linkml_reference_validator.etl.reference_fetcher.requests.get")
+def test_save_and_load_url_from_disk(mock_get, fetcher, tmp_path):
+ """Test saving and loading URL reference from disk cache."""
+ mock_response = MagicMock()
+ mock_response.status_code = 200
+ mock_response.text = """
+
+
Cached URL Content
+ This content should be cached.
+
+ """
+ mock_get.return_value = mock_response
+
+ # First fetch - this should save to disk
+ result1 = fetcher.fetch("URL:https://example.com/cached")
+ assert result1 is not None
+
+ # Clear memory cache
+ fetcher._cache.clear()
+
+ # Second fetch - should load from disk without making HTTP request
+ with patch("linkml_reference_validator.etl.reference_fetcher.requests.get") as mock_no_request:
+ result2 = fetcher.fetch("URL:https://example.com/cached")
+ mock_no_request.assert_not_called()
+
+ assert result2 is not None
+ assert result2.reference_id == "URL:https://example.com/cached"
+ assert result2.title == "Cached URL Content"
+ assert "This content should be cached" in result2.content