Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 36 additions & 2 deletions docs/concepts/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,48 @@ We don't use LLMs or semantic similarity because:

## Reference Fetching

The validator supports multiple reference types:

### PubMed (PMID)

For `PMID:12345678`:

1. Queries NCBI E-utilities API
2. Fetches abstract and metadata
3. Parses XML response with BeautifulSoup
4. Caches as markdown with YAML frontmatter
3. Attempts to retrieve full-text from PMC if available
4. Parses XML response with BeautifulSoup
5. Caches as markdown with YAML frontmatter

### DOI (Digital Object Identifier)

For `DOI:10.1234/journal.article`:

1. Queries Crossref API for metadata
2. Fetches abstract and bibliographic information
3. Extracts title, authors, journal, year
4. Caches abstract and metadata as markdown

### URLs

For `URL:https://example.com/page` or `https://example.com/page`:

1. Makes HTTP GET request to fetch web page
2. Extracts title from `<title>` tag
3. Converts HTML to plain text (removes scripts, styles, navigation)
4. Normalizes whitespace
5. Caches as markdown with content type `html_converted`

**Use cases for URLs:**
- Online book chapters
- Educational resources
- Documentation pages
- Any static web content

**Limitations:**
- Works best with static HTML content
- Does not execute JavaScript
- Cannot access content behind authentication
- Complex dynamic pages may not extract well

### PubMed Central (PMC)

Expand Down
253 changes: 253 additions & 0 deletions docs/how-to/validate-urls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
# Validating URL References

This guide explains how to validate references that use URLs instead of traditional identifiers like PMIDs or DOIs.

## Overview

The linkml-reference-validator supports validating references that point to web content, such as:

- Book chapters hosted online
- Educational resources
- Documentation pages
- Blog posts or articles
- Any static web content

When a reference field contains a URL, the validator:

1. Fetches the web page content
2. Extracts the page title
3. Converts HTML to plain text
4. Validates the extracted content against your supporting text

## URL Format

URLs can be specified in two ways:

### Explicit URL Prefix

```yaml
my_field:
value: "Some text from the web page..."
references:
- "URL:https://example.com/book/chapter1"
```

### Direct URL

```yaml
my_field:
value: "Some text from the web page..."
references:
- "https://example.com/book/chapter1"
```

Both formats are equivalent. If a reference starts with `http://` or `https://`, it's automatically recognized as a URL reference.

## Example

Suppose you have an online textbook chapter at `https://example.com/biology/cell-structure` with the following content:

```html
<html>
<head>
<title>Chapter 3: Cell Structure and Function</title>
</head>
<body>
<h1>Cell Structure and Function</h1>
<p>The cell is the basic structural and functional unit of all living organisms.</p>
<p>Cells contain various organelles that perform specific functions...</p>
</body>
</html>
```

You can validate text extracted from this chapter:

```yaml
description:
value: "The cell is the basic structural and functional unit of all living organisms"
references:
- "https://example.com/biology/cell-structure"
```

## How URL Validation Works

### 1. Content Fetching

When the validator encounters a URL reference, it:

- Makes an HTTP GET request to fetch the page
- Uses a polite user agent header identifying the tool
- Respects rate limiting (configurable via `rate_limit_delay`)
- Handles timeouts (default 30 seconds)

### 2. Content Extraction

The fetcher extracts content from the HTML:

- **Title**: Extracted from the `<title>` tag
- **Content**: HTML is converted to plain text using BeautifulSoup
- **Cleanup**: Removes scripts, styles, navigation, headers, and footers
- **Normalization**: Whitespace is normalized for better matching

### 3. Content Type

URL references are marked with content type `html_converted` to distinguish them from other reference types like abstracts or full-text articles.

### 4. Caching

Fetched URL content is cached to disk in markdown format with YAML frontmatter:

```markdown
---
reference_id: URL:https://example.com/biology/cell-structure
title: "Chapter 3: Cell Structure and Function"
content_type: html_converted
---

# Chapter 3: Cell Structure and Function

## Content

The cell is the basic structural and functional unit of all living organisms.
Cells contain various organelles that perform specific functions...
```

Cache files are stored in the configured cache directory (default: `.linkml-reference-validator-cache/`).

## Configuration

URL fetching behavior can be configured:

```yaml
# config.yaml
rate_limit_delay: 0.5 # Wait 0.5 seconds between requests
email: "[email protected]" # Used in user agent
cache_dir: ".cache/references" # Where to cache fetched content
```

Or via command-line:

```bash
linkml-reference-validator validate \
--cache-dir .cache \
--rate-limit-delay 0.5 \
my-data.yaml
```

## Limitations

### Static Content Only

URL validation is designed for static web pages. It may not work well with:

- Dynamic content loaded via JavaScript
- Pages requiring authentication
- Content behind paywalls
- Frequently changing content

### HTML Structure

The content extraction works by:

- Removing navigation, headers, and footers
- Converting remaining HTML to text
- Normalizing whitespace

This works well for simple HTML but may not capture content perfectly from complex layouts.

### No Rendering

The fetcher downloads raw HTML and parses it directly. It does not:

- Execute JavaScript
- Render the page in a browser
- Follow redirects automatically (may be added in future)
- Handle dynamic content

## Best Practices

### 1. Use Stable URLs

Choose URLs that are unlikely to change:

- ✅ Versioned documentation: `https://docs.example.com/v1.0/chapter1`
- ✅ Archived content: `https://archive.example.com/2024/article`
- ❌ Blog posts with dates that might be reorganized
- ❌ URLs with session parameters

### 2. Verify Content Quality

After adding a URL reference, verify the extracted content:

```bash
# Check what was extracted
cat .linkml-reference-validator-cache/URL_https___example.com_page.md
```

Ensure the extracted text contains the relevant information you're referencing.

### 3. Cache Management

- Commit cache files to version control for reproducibility
- Use `--force-refresh` to update cached content
- Periodically review cached URLs to ensure they're still accessible

### 4. Mix Reference Types

URL references work alongside PMIDs and DOIs:

```yaml
findings:
value: "Multiple studies confirm this relationship"
references:
- "PMID:12345678" # Research paper
- "DOI:10.1234/journal.article" # Another paper
- "https://example.com/textbook/chapter5" # Textbook chapter
```

## Troubleshooting

### URL Not Fetching

If URL content isn't being fetched:

1. Check network connectivity
2. Verify the URL is accessible in a browser
3. Check for rate limiting or IP blocks
4. Look for error messages in the logs

### Incorrect Content Extraction

If the wrong content is extracted:

1. Inspect the cached markdown file
2. Check if the page uses complex JavaScript
3. Consider if the page structure requires custom parsing
4. File an issue with the page URL for improvement

### Validation Failing

If validation fails for URL references:

1. Check the cached content to see what was extracted
2. Verify your supporting text actually appears on the page
3. Check for whitespace or formatting differences
4. Consider if the page content has changed since caching

## Comparison with Other Reference Types

| Feature | PMID | DOI | URL |
|---------|------|-----|-----|
| Source | PubMed | Crossref | Any web page |
| Content Type | Abstract + Full Text | Abstract | HTML converted |
| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) |
| Stability | High | High | Variable |
| Access | Free for abstracts | Varies | Varies |
| Caching | Yes | Yes | Yes |

## See Also

- [Validating DOIs](validate-dois.md) - For journal articles with DOIs
- [Validating OBO Files](validate-obo-files.md) - For ontology-specific validation
- [How It Works](../concepts/how-it-works.md) - Core validation concepts
- [CLI Reference](../reference/cli.md) - Command-line options
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**Validate quotes and excerpts against their source publications**

linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC and performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC, DOIs via Crossref, and URLs, then performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.

## Key Features

Expand Down
28 changes: 28 additions & 0 deletions docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,33 @@ linkml-reference-validator validate text \

This works the same way as PMID validation - the reference is fetched and cached locally.

## Validate Against a URL

For online resources like book chapters, documentation, or educational content:

```bash
linkml-reference-validator validate text \
"The cell is the basic structural and functional unit of all living organisms" \
https://example.com/biology/cell-structure
```

Or with explicit URL prefix:

```bash
linkml-reference-validator validate text \
"The cell is the basic unit of life" \
URL:https://example.com/biology/cells
```

The validator will:
1. Fetch the web page content
2. Extract the title from the `<title>` tag
3. Convert HTML to plain text (removing scripts, styles, navigation)
4. Cache the content locally
5. Validate your text against the extracted content

**Note:** URL validation works best with static HTML pages and may not work well with JavaScript-heavy or dynamic content.

## Key Features

- **Automatic Caching**: References cached locally after first fetch
Expand All @@ -94,6 +121,7 @@ This works the same way as PMID validation - the reference is fetched and cached
- **Deterministic Matching**: Substring-based (not AI/fuzzy matching)
- **PubMed & PMC**: Fetches from NCBI automatically
- **DOI Support**: Fetches metadata from Crossref API
- **URL Support**: Validates against web content (books, docs, educational resources)

## Next Steps

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ nav:
- How-To Guides:
- Validating OBO Files: how-to/validate-obo-files.md
- Validating DOIs: how-to/validate-dois.md
- Validating URLs: how-to/validate-urls.md
- Concepts:
- How It Works: concepts/how-it-works.md
- Editorial Conventions: concepts/editorial-conventions.md
Expand Down
Loading