diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md index 55cd2d1..edff926 100644 --- a/docs/concepts/how-it-works.md +++ b/docs/concepts/how-it-works.md @@ -68,14 +68,48 @@ We don't use LLMs or semantic similarity because: ## Reference Fetching +The validator supports multiple reference types: + ### PubMed (PMID) For `PMID:12345678`: 1. Queries NCBI E-utilities API 2. Fetches abstract and metadata -3. Parses XML response with BeautifulSoup -4. Caches as markdown with YAML frontmatter +3. Attempts to retrieve full-text from PMC if available +4. Parses XML response with BeautifulSoup +5. Caches as markdown with YAML frontmatter + +### DOI (Digital Object Identifier) + +For `DOI:10.1234/journal.article`: + +1. Queries Crossref API for metadata +2. Fetches abstract and bibliographic information +3. Extracts title, authors, journal, year +4. Caches abstract and metadata as markdown + +### URLs + +For `URL:https://example.com/page` or `https://example.com/page`: + +1. Makes HTTP GET request to fetch web page +2. Extracts title from `` tag +3. Converts HTML to plain text (removes scripts, styles, navigation) +4. Normalizes whitespace +5. Caches as markdown with content type `html_converted` + +**Use cases for URLs:** +- Online book chapters +- Educational resources +- Documentation pages +- Any static web content + +**Limitations:** +- Works best with static HTML content +- Does not execute JavaScript +- Cannot access content behind authentication +- Complex dynamic pages may not extract well ### PubMed Central (PMC) diff --git a/docs/how-to/validate-urls.md b/docs/how-to/validate-urls.md new file mode 100644 index 0000000..a85f617 --- /dev/null +++ b/docs/how-to/validate-urls.md @@ -0,0 +1,253 @@ +# Validating URL References + +This guide explains how to validate references that use URLs instead of traditional identifiers like PMIDs or DOIs. + +## Overview + +The linkml-reference-validator supports validating references that point to web content, such as: + +- Book chapters hosted online +- Educational resources +- Documentation pages +- Blog posts or articles +- Any static web content + +When a reference field contains a URL, the validator: + +1. Fetches the web page content +2. Extracts the page title +3. Converts HTML to plain text +4. Validates the extracted content against your supporting text + +## URL Format + +URLs can be specified in two ways: + +### Explicit URL Prefix + +```yaml +my_field: + value: "Some text from the web page..." + references: + - "URL:https://example.com/book/chapter1" +``` + +### Direct URL + +```yaml +my_field: + value: "Some text from the web page..." + references: + - "https://example.com/book/chapter1" +``` + +Both formats are equivalent. If a reference starts with `http://` or `https://`, it's automatically recognized as a URL reference. + +## Example + +Suppose you have an online textbook chapter at `https://example.com/biology/cell-structure` with the following content: + +```html +<html> + <head> + <title>Chapter 3: Cell Structure and Function + + +

Cell Structure and Function

+

The cell is the basic structural and functional unit of all living organisms.

+

Cells contain various organelles that perform specific functions...

+ + +``` + +You can validate text extracted from this chapter: + +```yaml +description: + value: "The cell is the basic structural and functional unit of all living organisms" + references: + - "https://example.com/biology/cell-structure" +``` + +## How URL Validation Works + +### 1. Content Fetching + +When the validator encounters a URL reference, it: + +- Makes an HTTP GET request to fetch the page +- Uses a polite user agent header identifying the tool +- Respects rate limiting (configurable via `rate_limit_delay`) +- Handles timeouts (default 30 seconds) + +### 2. Content Extraction + +The fetcher extracts content from the HTML: + +- **Title**: Extracted from the `` tag +- **Content**: HTML is converted to plain text using BeautifulSoup +- **Cleanup**: Removes scripts, styles, navigation, headers, and footers +- **Normalization**: Whitespace is normalized for better matching + +### 3. Content Type + +URL references are marked with content type `html_converted` to distinguish them from other reference types like abstracts or full-text articles. + +### 4. Caching + +Fetched URL content is cached to disk in markdown format with YAML frontmatter: + +```markdown +--- +reference_id: URL:https://example.com/biology/cell-structure +title: "Chapter 3: Cell Structure and Function" +content_type: html_converted +--- + +# Chapter 3: Cell Structure and Function + +## Content + +The cell is the basic structural and functional unit of all living organisms. +Cells contain various organelles that perform specific functions... +``` + +Cache files are stored in the configured cache directory (default: `.linkml-reference-validator-cache/`). + +## Configuration + +URL fetching behavior can be configured: + +```yaml +# config.yaml +rate_limit_delay: 0.5 # Wait 0.5 seconds between requests +email: "your-email@example.com" # Used in user agent +cache_dir: ".cache/references" # Where to cache fetched content +``` + +Or via command-line: + +```bash +linkml-reference-validator validate \ + --cache-dir .cache \ + --rate-limit-delay 0.5 \ + my-data.yaml +``` + +## Limitations + +### Static Content Only + +URL validation is designed for static web pages. It may not work well with: + +- Dynamic content loaded via JavaScript +- Pages requiring authentication +- Content behind paywalls +- Frequently changing content + +### HTML Structure + +The content extraction works by: + +- Removing navigation, headers, and footers +- Converting remaining HTML to text +- Normalizing whitespace + +This works well for simple HTML but may not capture content perfectly from complex layouts. + +### No Rendering + +The fetcher downloads raw HTML and parses it directly. It does not: + +- Execute JavaScript +- Render the page in a browser +- Follow redirects automatically (may be added in future) +- Handle dynamic content + +## Best Practices + +### 1. Use Stable URLs + +Choose URLs that are unlikely to change: + +- ✅ Versioned documentation: `https://docs.example.com/v1.0/chapter1` +- ✅ Archived content: `https://archive.example.com/2024/article` +- ❌ Blog posts with dates that might be reorganized +- ❌ URLs with session parameters + +### 2. Verify Content Quality + +After adding a URL reference, verify the extracted content: + +```bash +# Check what was extracted +cat .linkml-reference-validator-cache/URL_https___example.com_page.md +``` + +Ensure the extracted text contains the relevant information you're referencing. + +### 3. Cache Management + +- Commit cache files to version control for reproducibility +- Use `--force-refresh` to update cached content +- Periodically review cached URLs to ensure they're still accessible + +### 4. Mix Reference Types + +URL references work alongside PMIDs and DOIs: + +```yaml +findings: + value: "Multiple studies confirm this relationship" + references: + - "PMID:12345678" # Research paper + - "DOI:10.1234/journal.article" # Another paper + - "https://example.com/textbook/chapter5" # Textbook chapter +``` + +## Troubleshooting + +### URL Not Fetching + +If URL content isn't being fetched: + +1. Check network connectivity +2. Verify the URL is accessible in a browser +3. Check for rate limiting or IP blocks +4. Look for error messages in the logs + +### Incorrect Content Extraction + +If the wrong content is extracted: + +1. Inspect the cached markdown file +2. Check if the page uses complex JavaScript +3. Consider if the page structure requires custom parsing +4. File an issue with the page URL for improvement + +### Validation Failing + +If validation fails for URL references: + +1. Check the cached content to see what was extracted +2. Verify your supporting text actually appears on the page +3. Check for whitespace or formatting differences +4. Consider if the page content has changed since caching + +## Comparison with Other Reference Types + +| Feature | PMID | DOI | URL | +|---------|------|-----|-----| +| Source | PubMed | Crossref | Any web page | +| Content Type | Abstract + Full Text | Abstract | HTML converted | +| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) | +| Stability | High | High | Variable | +| Access | Free for abstracts | Varies | Varies | +| Caching | Yes | Yes | Yes | + +## See Also + +- [Validating DOIs](validate-dois.md) - For journal articles with DOIs +- [Validating OBO Files](validate-obo-files.md) - For ontology-specific validation +- [How It Works](../concepts/how-it-works.md) - Core validation concepts +- [CLI Reference](../reference/cli.md) - Command-line options diff --git a/docs/index.md b/docs/index.md index 52be20f..ba92ac8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,7 +2,7 @@ **Validate quotes and excerpts against their source publications** -linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC and performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`. +linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC, DOIs via Crossref, and URLs, then performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`. ## Key Features diff --git a/docs/quickstart.md b/docs/quickstart.md index 5962b70..c4a9b90 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -86,6 +86,33 @@ linkml-reference-validator validate text \ This works the same way as PMID validation - the reference is fetched and cached locally. +## Validate Against a URL + +For online resources like book chapters, documentation, or educational content: + +```bash +linkml-reference-validator validate text \ + "The cell is the basic structural and functional unit of all living organisms" \ + https://example.com/biology/cell-structure +``` + +Or with explicit URL prefix: + +```bash +linkml-reference-validator validate text \ + "The cell is the basic unit of life" \ + URL:https://example.com/biology/cells +``` + +The validator will: +1. Fetch the web page content +2. Extract the title from the `<title>` tag +3. Convert HTML to plain text (removing scripts, styles, navigation) +4. Cache the content locally +5. Validate your text against the extracted content + +**Note:** URL validation works best with static HTML pages and may not work well with JavaScript-heavy or dynamic content. + ## Key Features - **Automatic Caching**: References cached locally after first fetch @@ -94,6 +121,7 @@ This works the same way as PMID validation - the reference is fetched and cached - **Deterministic Matching**: Substring-based (not AI/fuzzy matching) - **PubMed & PMC**: Fetches from NCBI automatically - **DOI Support**: Fetches metadata from Crossref API +- **URL Support**: Validates against web content (books, docs, educational resources) ## Next Steps diff --git a/mkdocs.yml b/mkdocs.yml index 60aaa19..d9510ed 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -33,6 +33,7 @@ nav: - How-To Guides: - Validating OBO Files: how-to/validate-obo-files.md - Validating DOIs: how-to/validate-dois.md + - Validating URLs: how-to/validate-urls.md - Concepts: - How It Works: concepts/how-it-works.md - Editorial Conventions: concepts/editorial-conventions.md diff --git a/src/linkml_reference_validator/etl/reference_fetcher.py b/src/linkml_reference_validator/etl/reference_fetcher.py index 1a7b90c..5fc377b 100644 --- a/src/linkml_reference_validator/etl/reference_fetcher.py +++ b/src/linkml_reference_validator/etl/reference_fetcher.py @@ -86,6 +86,8 @@ def fetch(self, reference_id: str, force_refresh: bool = False) -> Optional[Refe content = self._fetch_pmid(identifier) elif prefix == "DOI": content = self._fetch_doi(identifier) + elif prefix == "URL": + content = self._fetch_url(identifier) else: logger.warning(f"Unsupported reference type: {prefix}") return None @@ -100,7 +102,7 @@ def _parse_reference_id(self, reference_id: str) -> tuple[str, str]: """Parse a reference ID into prefix and identifier. Args: - reference_id: Reference ID like "PMID:12345678" + reference_id: Reference ID like "PMID:12345678" or URL Returns: Tuple of (prefix, identifier) @@ -114,13 +116,27 @@ def _parse_reference_id(self, reference_id: str) -> tuple[str, str]: ('PMID', '12345678') >>> fetcher._parse_reference_id("12345678") ('PMID', '12345678') + >>> fetcher._parse_reference_id("URL:https://example.com/book/chapter1") + ('URL', 'https://example.com/book/chapter1') + >>> fetcher._parse_reference_id("https://example.com/direct") + ('URL', 'https://example.com/direct') """ - match = re.match(r"^([A-Za-z_]+)[:\s]+(.+)$", reference_id.strip()) + stripped = reference_id.strip() + + # Check if it's a direct URL (starts with http or https) + if stripped.startswith(('http://', 'https://')): + return "URL", stripped + + # Standard prefix:identifier format + match = re.match(r"^([A-Za-z_]+)[:\s]+(.+)$", stripped) if match: return match.group(1).upper(), match.group(2).strip() - if reference_id.strip().isdigit(): - return "PMID", reference_id.strip() - return "UNKNOWN", reference_id + + # Plain numeric ID defaults to PMID + if stripped.isdigit(): + return "PMID", stripped + + return "UNKNOWN", stripped def _fetch_pmid(self, pmid: str) -> Optional[ReferenceContent]: """Fetch a publication from PubMed by PMID. @@ -234,6 +250,65 @@ def _fetch_doi(self, doi: str) -> Optional[ReferenceContent]: doi=doi, ) + def _fetch_url(self, url: str) -> Optional[ReferenceContent]: + """Fetch content from a URL. + + Fetches web content, extracts title and converts HTML to text. + Intended for static pages like book chapters. + + Args: + url: The URL to fetch + + Returns: + ReferenceContent if successful, None otherwise + + Examples: + >>> config = ReferenceValidationConfig() + >>> fetcher = ReferenceFetcher(config) + >>> # Would fetch in real usage: + >>> # ref = fetcher._fetch_url("https://example.com/book/chapter1") + """ + time.sleep(self.config.rate_limit_delay) + + headers = { + "User-Agent": f"linkml-reference-validator/1.0 (mailto:{self.config.email})", + } + + try: + response = requests.get(url, headers=headers, timeout=30) + if response.status_code != 200: + logger.warning(f"Failed to fetch URL:{url} - status {response.status_code}") + return None + + soup = BeautifulSoup(response.text, "html.parser") + + # Extract title + title_tag = soup.find("title") + title = title_tag.get_text().strip() if title_tag else None + + # Convert HTML to text + # Remove script and style elements + for script in soup(["script", "style", "nav", "header", "footer"]): + script.decompose() + + # Get text content + content = soup.get_text() + + # Clean up text - normalize whitespace + lines = (line.strip() for line in content.splitlines()) + content = "\n".join(line for line in lines if line) + + return ReferenceContent( + reference_id=f"URL:{url}", + title=title, + content=content if content else None, + content_type="html_converted", + ) + + except Exception as e: + logger.error(f"Error fetching URL:{url}: {e}") + return None + def _parse_crossref_authors(self, authors: list) -> list[str]: """Parse author list from Crossref response. @@ -466,8 +541,11 @@ def _get_cache_path(self, reference_id: str) -> Path: >>> path = fetcher._get_cache_path("PMID:12345678") >>> path.name 'PMID_12345678.md' + >>> path = fetcher._get_cache_path("URL:https://example.com/book/chapter1") + >>> path.name + 'URL_https___example.com_book_chapter1.md' """ - safe_id = reference_id.replace(":", "_").replace("/", "_") + safe_id = reference_id.replace(":", "_").replace("/", "_").replace("?", "_").replace("=", "_") cache_dir = self.config.get_cache_dir() return cache_dir / f"{safe_id}.md" diff --git a/tests/test_reference_fetcher.py b/tests/test_reference_fetcher.py index ee9673d..a78c8d2 100644 --- a/tests/test_reference_fetcher.py +++ b/tests/test_reference_fetcher.py @@ -308,3 +308,144 @@ def test_save_and_load_doi_from_disk(mock_get, fetcher, tmp_path): assert result2.reference_id == "DOI:10.9999/cached.doi" assert result2.title == "Cached DOI Article" assert result2.doi == "10.9999/cached.doi" + + +def test_parse_url_reference_id(fetcher): + """Test parsing URL reference IDs.""" + assert fetcher._parse_reference_id("URL:https://example.com/book/chapter1") == ("URL", "https://example.com/book/chapter1") + assert fetcher._parse_reference_id("url:https://example.com/article") == ("URL", "https://example.com/article") + assert fetcher._parse_reference_id("https://example.com/direct") == ("URL", "https://example.com/direct") + + +@patch("linkml_reference_validator.etl.reference_fetcher.requests.get") +def test_fetch_url_success(mock_get, fetcher): + """Test fetching URL reference successfully.""" + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.text = """ + <html> + <head> + <title>Chapter 1: Introduction to Biology + + +

Chapter 1: Introduction to Biology

+

Biology is the natural science that studies life and living organisms.

+

This chapter provides an overview of cellular structure and function.

+
The cell is the basic unit of life.
+ + + """ + mock_get.return_value = mock_response + + result = fetcher.fetch("URL:https://example.com/biology-book/chapter1") + + assert result is not None + assert result.reference_id == "URL:https://example.com/biology-book/chapter1" + assert result.title == "Chapter 1: Introduction to Biology" + assert result.content_type == "html_converted" + assert "Biology is the natural science" in result.content + assert "basic unit of life" in result.content + + +@patch("linkml_reference_validator.etl.reference_fetcher.requests.get") +def test_fetch_url_no_title(mock_get, fetcher): + """Test fetching URL with no title tag.""" + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.text = """ + + +

Main Heading

+

Content without title tag.

+ + + """ + mock_get.return_value = mock_response + + result = fetcher.fetch("URL:https://example.com/no-title") + + assert result is not None + assert result.reference_id == "URL:https://example.com/no-title" + assert result.title is None + assert "Main Heading" in result.content + + +@patch("linkml_reference_validator.etl.reference_fetcher.requests.get") +def test_fetch_url_http_error(mock_get, fetcher): + """Test fetching URL that returns HTTP error.""" + mock_response = MagicMock() + mock_response.status_code = 404 + mock_get.return_value = mock_response + + result = fetcher.fetch("URL:https://example.com/not-found") + + assert result is None + + +@patch("linkml_reference_validator.etl.reference_fetcher.requests.get") +def test_fetch_url_request_exception(mock_get, fetcher): + """Test fetching URL that raises request exception.""" + mock_get.side_effect = Exception("Network error") + + result = fetcher.fetch("URL:https://example.com/error") + + assert result is None + + +@patch("linkml_reference_validator.etl.reference_fetcher.requests.get") +def test_fetch_url_malformed_html(mock_get, fetcher): + """Test fetching URL with malformed HTML. + + BeautifulSoup is very forgiving and will parse even malformed HTML. + This test verifies that the fetcher doesn't crash on malformed input. + """ + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.text = "Test

Content without closing tags" + mock_get.return_value = mock_response + + result = fetcher.fetch("URL:https://example.com/malformed") + + assert result is not None + assert result.title == "Test" + assert "Content without closing tags" in result.content + + +def test_url_cache_path(fetcher): + """Test cache path generation for URLs.""" + path = fetcher._get_cache_path("URL:https://example.com/book/chapter1") + assert path.name == "URL_https___example.com_book_chapter1.md" + + path = fetcher._get_cache_path("URL:https://example.com/path?param=value") + assert path.name == "URL_https___example.com_path_param_value.md" + + +@patch("linkml_reference_validator.etl.reference_fetcher.requests.get") +def test_save_and_load_url_from_disk(mock_get, fetcher, tmp_path): + """Test saving and loading URL reference from disk cache.""" + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.text = """ + + Cached URL Content +

This content should be cached.

+ + """ + mock_get.return_value = mock_response + + # First fetch - this should save to disk + result1 = fetcher.fetch("URL:https://example.com/cached") + assert result1 is not None + + # Clear memory cache + fetcher._cache.clear() + + # Second fetch - should load from disk without making HTTP request + with patch("linkml_reference_validator.etl.reference_fetcher.requests.get") as mock_no_request: + result2 = fetcher.fetch("URL:https://example.com/cached") + mock_no_request.assert_not_called() + + assert result2 is not None + assert result2.reference_id == "URL:https://example.com/cached" + assert result2.title == "Cached URL Content" + assert "This content should be cached" in result2.content