Merge pull request #13 from linkml/claude/issue-12-20251213-0032

dragon-ai-agent · web-flow · commit c4d8c17e85f2 · 2025-12-12T17:52:22.000-08:00
feat: Add generic URL checking functionality
diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
@@ -68,14 +68,48 @@ We don't use LLMs or semantic similarity because:
 
 ## Reference Fetching
 
+The validator supports multiple reference types:
+
 ### PubMed (PMID)
 
 For `PMID:12345678`:
 
 1. Queries NCBI E-utilities API
 2. Fetches abstract and metadata
-3. Parses XML response with BeautifulSoup
-4. Caches as markdown with YAML frontmatter
+3. Attempts to retrieve full-text from PMC if available
+4. Parses XML response with BeautifulSoup
+5. Caches as markdown with YAML frontmatter
+
+### DOI (Digital Object Identifier)
+
+For `DOI:10.1234/journal.article`:
+
+1. Queries Crossref API for metadata
+2. Fetches abstract and bibliographic information
+3. Extracts title, authors, journal, year
+4. Caches abstract and metadata as markdown
+
+### URLs
+
+For `URL:https://example.com/page` or `https://example.com/page`:
+
+1. Makes HTTP GET request to fetch web page
+2. Extracts title from `<title>` tag
+3. Converts HTML to plain text (removes scripts, styles, navigation)
+4. Normalizes whitespace
+5. Caches as markdown with content type `html_converted`
+
+**Use cases for URLs:**
+- Online book chapters
+- Educational resources
+- Documentation pages
+- Any static web content
+
+**Limitations:**
+- Works best with static HTML content
+- Does not execute JavaScript
+- Cannot access content behind authentication
+- Complex dynamic pages may not extract well
 
 ### PubMed Central (PMC)
 
diff --git a/docs/how-to/validate-urls.md b/docs/how-to/validate-urls.md
@@ -0,0 +1,253 @@
+# Validating URL References
+
+This guide explains how to validate references that use URLs instead of traditional identifiers like PMIDs or DOIs.
+
+## Overview
+
+The linkml-reference-validator supports validating references that point to web content, such as:
+
+- Book chapters hosted online
+- Educational resources
+- Documentation pages
+- Blog posts or articles
+- Any static web content
+
+When a reference field contains a URL, the validator:
+
+1. Fetches the web page content
+2. Extracts the page title
+3. Converts HTML to plain text
+4. Validates the extracted content against your supporting text
+
+## URL Format
+
+URLs can be specified in two ways:
+
+### Explicit URL Prefix
+
+```yaml
+my_field:
+  value: "Some text from the web page..."
+  references:
+    - "URL:https://example.com/book/chapter1"
+```
+
+### Direct URL
+
+```yaml
+my_field:
+  value: "Some text from the web page..."
+  references:
+    - "https://example.com/book/chapter1"
+```
+
+Both formats are equivalent. If a reference starts with `http://` or `https://`, it's automatically recognized as a URL reference.
+
+## Example
+
+Suppose you have an online textbook chapter at `https://example.com/biology/cell-structure` with the following content:
+
+```html
+<html>
+  <head>
+    <title>Chapter 3: Cell Structure and Function</title>
+  </head>
+  <body>
+    <h1>Cell Structure and Function</h1>
+    <p>The cell is the basic structural and functional unit of all living organisms.</p>
+    <p>Cells contain various organelles that perform specific functions...</p>
+  </body>
+</html>
+```
+
+You can validate text extracted from this chapter:
+
+```yaml
+description:
+  value: "The cell is the basic structural and functional unit of all living organisms"
+  references:
+    - "https://example.com/biology/cell-structure"
+```
+
+## How URL Validation Works
+
+### 1. Content Fetching
+
+When the validator encounters a URL reference, it:
+
+- Makes an HTTP GET request to fetch the page
+- Uses a polite user agent header identifying the tool
+- Respects rate limiting (configurable via `rate_limit_delay`)
+- Handles timeouts (default 30 seconds)
+
+### 2. Content Extraction
+
+The fetcher extracts content from the HTML:
+
+- **Title**: Extracted from the `<title>` tag
+- **Content**: HTML is converted to plain text using BeautifulSoup
+- **Cleanup**: Removes scripts, styles, navigation, headers, and footers
+- **Normalization**: Whitespace is normalized for better matching
+
+### 3. Content Type
+
+URL references are marked with content type `html_converted` to distinguish them from other reference types like abstracts or full-text articles.
+
+### 4. Caching
+
+Fetched URL content is cached to disk in markdown format with YAML frontmatter:
+
+```markdown
+---
+reference_id: URL:https://example.com/biology/cell-structure
+title: "Chapter 3: Cell Structure and Function"
+content_type: html_converted
+---
+
+# Chapter 3: Cell Structure and Function
+
+## Content
+
+The cell is the basic structural and functional unit of all living organisms.
+Cells contain various organelles that perform specific functions...
+```
+
+Cache files are stored in the configured cache directory (default: `.linkml-reference-validator-cache/`).
+
+## Configuration
+
+URL fetching behavior can be configured:
+
+```yaml
+# config.yaml
+rate_limit_delay: 0.5  # Wait 0.5 seconds between requests
+email: "your-email@example.com"  # Used in user agent
+cache_dir: ".cache/references"  # Where to cache fetched content
+```
+
+Or via command-line:
+
+```bash
+linkml-reference-validator validate \
+  --cache-dir .cache \
+  --rate-limit-delay 0.5 \
+  my-data.yaml
+```
+
+## Limitations
+
+### Static Content Only
+
+URL validation is designed for static web pages. It may not work well with:
+
+- Dynamic content loaded via JavaScript
+- Pages requiring authentication
+- Content behind paywalls
+- Frequently changing content
+
+### HTML Structure
+
+The content extraction works by:
+
+- Removing navigation, headers, and footers
+- Converting remaining HTML to text
+- Normalizing whitespace
+
+This works well for simple HTML but may not capture content perfectly from complex layouts.
+
+### No Rendering
+
+The fetcher downloads raw HTML and parses it directly. It does not:
+
+- Execute JavaScript
+- Render the page in a browser
+- Follow redirects automatically (may be added in future)
+- Handle dynamic content
+
+## Best Practices
+
+### 1. Use Stable URLs
+
+Choose URLs that are unlikely to change:
+
+- ✅ Versioned documentation: `https://docs.example.com/v1.0/chapter1`
+- ✅ Archived content: `https://archive.example.com/2024/article`
+- ❌ Blog posts with dates that might be reorganized
+- ❌ URLs with session parameters
+
+### 2. Verify Content Quality
+
+After adding a URL reference, verify the extracted content:
+
+```bash
+# Check what was extracted
+cat .linkml-reference-validator-cache/URL_https___example.com_page.md
+```
+
+Ensure the extracted text contains the relevant information you're referencing.
+
+### 3. Cache Management
+
+- Commit cache files to version control for reproducibility
+- Use `--force-refresh` to update cached content
+- Periodically review cached URLs to ensure they're still accessible
+
+### 4. Mix Reference Types
+
+URL references work alongside PMIDs and DOIs:
+
+```yaml
+findings:
+  value: "Multiple studies confirm this relationship"
+  references:
+    - "PMID:12345678"  # Research paper
+    - "DOI:10.1234/journal.article"  # Another paper
+    - "https://example.com/textbook/chapter5"  # Textbook chapter
+```
+
+## Troubleshooting
+
+### URL Not Fetching
+
+If URL content isn't being fetched:
+
+1. Check network connectivity
+2. Verify the URL is accessible in a browser
+3. Check for rate limiting or IP blocks
+4. Look for error messages in the logs
+
+### Incorrect Content Extraction
+
+If the wrong content is extracted:
+
+1. Inspect the cached markdown file
+2. Check if the page uses complex JavaScript
+3. Consider if the page structure requires custom parsing
+4. File an issue with the page URL for improvement
+
+### Validation Failing
+
+If validation fails for URL references:
+
+1. Check the cached content to see what was extracted
+2. Verify your supporting text actually appears on the page
+3. Check for whitespace or formatting differences
+4. Consider if the page content has changed since caching
+
+## Comparison with Other Reference Types
+
+| Feature | PMID | DOI | URL |
+|---------|------|-----|-----|
+| Source | PubMed | Crossref | Any web page |
+| Content Type | Abstract + Full Text | Abstract | HTML converted |
+| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) |
+| Stability | High | High | Variable |
+| Access | Free for abstracts | Varies | Varies |
+| Caching | Yes | Yes | Yes |
+
+## See Also
+
+- [Validating DOIs](validate-dois.md) - For journal articles with DOIs
+- [Validating OBO Files](validate-obo-files.md) - For ontology-specific validation
+- [How It Works](../concepts/how-it-works.md) - Core validation concepts
+- [CLI Reference](../reference/cli.md) - Command-line options
diff --git a/docs/index.md b/docs/index.md
@@ -2,7 +2,7 @@
 
 **Validate quotes and excerpts against their source publications**
 
-linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC and performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
+linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC, DOIs via Crossref, and URLs, then performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
 
 ## Key Features
 
diff --git a/docs/quickstart.md b/docs/quickstart.md
@@ -86,6 +86,33 @@ linkml-reference-validator validate text \
 
 This works the same way as PMID validation - the reference is fetched and cached locally.
 
+## Validate Against a URL
+
+For online resources like book chapters, documentation, or educational content:
+
+```bash
+linkml-reference-validator validate text \
+  "The cell is the basic structural and functional unit of all living organisms" \
+  https://example.com/biology/cell-structure
+```
+
+Or with explicit URL prefix:
+
+```bash
+linkml-reference-validator validate text \
+  "The cell is the basic unit of life" \
+  URL:https://example.com/biology/cells
+```
+
+The validator will:
+1. Fetch the web page content
+2. Extract the title from the `<title>` tag
+3. Convert HTML to plain text (removing scripts, styles, navigation)
+4. Cache the content locally
+5. Validate your text against the extracted content
+
+**Note:** URL validation works best with static HTML pages and may not work well with JavaScript-heavy or dynamic content.
+
 ## Key Features
 
 - **Automatic Caching**: References cached locally after first fetch
@@ -94,6 +121,7 @@ This works the same way as PMID validation - the reference is fetched and cached
 - **Deterministic Matching**: Substring-based (not AI/fuzzy matching)
 - **PubMed & PMC**: Fetches from NCBI automatically
 - **DOI Support**: Fetches metadata from Crossref API
+- **URL Support**: Validates against web content (books, docs, educational resources)
 
 ## Next Steps
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -33,6 +33,7 @@ nav:
   - How-To Guides:
       - Validating OBO Files: how-to/validate-obo-files.md
       - Validating DOIs: how-to/validate-dois.md
+      - Validating URLs: how-to/validate-urls.md
   - Concepts:
       - How It Works: concepts/how-it-works.md
       - Editorial Conventions: concepts/editorial-conventions.md
diff --git a/src/linkml_reference_validator/etl/reference_fetcher.py b/src/linkml_reference_validator/etl/reference_fetcher.py
diff --git a/tests/test_reference_fetcher.py b/tests/test_reference_fetcher.py