Skip to content

Commit c4d8c17

Browse files
Merge pull request #13 from linkml/claude/issue-12-20251213-0032
feat: Add generic URL checking functionality
2 parents 78fbb5d + 216fae2 commit c4d8c17

File tree

7 files changed

+544
-9
lines changed

7 files changed

+544
-9
lines changed

docs/concepts/how-it-works.md

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,14 +68,48 @@ We don't use LLMs or semantic similarity because:
6868

6969
## Reference Fetching
7070

71+
The validator supports multiple reference types:
72+
7173
### PubMed (PMID)
7274

7375
For `PMID:12345678`:
7476

7577
1. Queries NCBI E-utilities API
7678
2. Fetches abstract and metadata
77-
3. Parses XML response with BeautifulSoup
78-
4. Caches as markdown with YAML frontmatter
79+
3. Attempts to retrieve full-text from PMC if available
80+
4. Parses XML response with BeautifulSoup
81+
5. Caches as markdown with YAML frontmatter
82+
83+
### DOI (Digital Object Identifier)
84+
85+
For `DOI:10.1234/journal.article`:
86+
87+
1. Queries Crossref API for metadata
88+
2. Fetches abstract and bibliographic information
89+
3. Extracts title, authors, journal, year
90+
4. Caches abstract and metadata as markdown
91+
92+
### URLs
93+
94+
For `URL:https://example.com/page` or `https://example.com/page`:
95+
96+
1. Makes HTTP GET request to fetch web page
97+
2. Extracts title from `<title>` tag
98+
3. Converts HTML to plain text (removes scripts, styles, navigation)
99+
4. Normalizes whitespace
100+
5. Caches as markdown with content type `html_converted`
101+
102+
**Use cases for URLs:**
103+
- Online book chapters
104+
- Educational resources
105+
- Documentation pages
106+
- Any static web content
107+
108+
**Limitations:**
109+
- Works best with static HTML content
110+
- Does not execute JavaScript
111+
- Cannot access content behind authentication
112+
- Complex dynamic pages may not extract well
79113

80114
### PubMed Central (PMC)
81115

docs/how-to/validate-urls.md

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Validating URL References
2+
3+
This guide explains how to validate references that use URLs instead of traditional identifiers like PMIDs or DOIs.
4+
5+
## Overview
6+
7+
The linkml-reference-validator supports validating references that point to web content, such as:
8+
9+
- Book chapters hosted online
10+
- Educational resources
11+
- Documentation pages
12+
- Blog posts or articles
13+
- Any static web content
14+
15+
When a reference field contains a URL, the validator:
16+
17+
1. Fetches the web page content
18+
2. Extracts the page title
19+
3. Converts HTML to plain text
20+
4. Validates the extracted content against your supporting text
21+
22+
## URL Format
23+
24+
URLs can be specified in two ways:
25+
26+
### Explicit URL Prefix
27+
28+
```yaml
29+
my_field:
30+
value: "Some text from the web page..."
31+
references:
32+
- "URL:https://example.com/book/chapter1"
33+
```
34+
35+
### Direct URL
36+
37+
```yaml
38+
my_field:
39+
value: "Some text from the web page..."
40+
references:
41+
- "https://example.com/book/chapter1"
42+
```
43+
44+
Both formats are equivalent. If a reference starts with `http://` or `https://`, it's automatically recognized as a URL reference.
45+
46+
## Example
47+
48+
Suppose you have an online textbook chapter at `https://example.com/biology/cell-structure` with the following content:
49+
50+
```html
51+
<html>
52+
<head>
53+
<title>Chapter 3: Cell Structure and Function</title>
54+
</head>
55+
<body>
56+
<h1>Cell Structure and Function</h1>
57+
<p>The cell is the basic structural and functional unit of all living organisms.</p>
58+
<p>Cells contain various organelles that perform specific functions...</p>
59+
</body>
60+
</html>
61+
```
62+
63+
You can validate text extracted from this chapter:
64+
65+
```yaml
66+
description:
67+
value: "The cell is the basic structural and functional unit of all living organisms"
68+
references:
69+
- "https://example.com/biology/cell-structure"
70+
```
71+
72+
## How URL Validation Works
73+
74+
### 1. Content Fetching
75+
76+
When the validator encounters a URL reference, it:
77+
78+
- Makes an HTTP GET request to fetch the page
79+
- Uses a polite user agent header identifying the tool
80+
- Respects rate limiting (configurable via `rate_limit_delay`)
81+
- Handles timeouts (default 30 seconds)
82+
83+
### 2. Content Extraction
84+
85+
The fetcher extracts content from the HTML:
86+
87+
- **Title**: Extracted from the `<title>` tag
88+
- **Content**: HTML is converted to plain text using BeautifulSoup
89+
- **Cleanup**: Removes scripts, styles, navigation, headers, and footers
90+
- **Normalization**: Whitespace is normalized for better matching
91+
92+
### 3. Content Type
93+
94+
URL references are marked with content type `html_converted` to distinguish them from other reference types like abstracts or full-text articles.
95+
96+
### 4. Caching
97+
98+
Fetched URL content is cached to disk in markdown format with YAML frontmatter:
99+
100+
```markdown
101+
---
102+
reference_id: URL:https://example.com/biology/cell-structure
103+
title: "Chapter 3: Cell Structure and Function"
104+
content_type: html_converted
105+
---
106+
107+
# Chapter 3: Cell Structure and Function
108+
109+
## Content
110+
111+
The cell is the basic structural and functional unit of all living organisms.
112+
Cells contain various organelles that perform specific functions...
113+
```
114+
115+
Cache files are stored in the configured cache directory (default: `.linkml-reference-validator-cache/`).
116+
117+
## Configuration
118+
119+
URL fetching behavior can be configured:
120+
121+
```yaml
122+
# config.yaml
123+
rate_limit_delay: 0.5 # Wait 0.5 seconds between requests
124+
email: "[email protected]" # Used in user agent
125+
cache_dir: ".cache/references" # Where to cache fetched content
126+
```
127+
128+
Or via command-line:
129+
130+
```bash
131+
linkml-reference-validator validate \
132+
--cache-dir .cache \
133+
--rate-limit-delay 0.5 \
134+
my-data.yaml
135+
```
136+
137+
## Limitations
138+
139+
### Static Content Only
140+
141+
URL validation is designed for static web pages. It may not work well with:
142+
143+
- Dynamic content loaded via JavaScript
144+
- Pages requiring authentication
145+
- Content behind paywalls
146+
- Frequently changing content
147+
148+
### HTML Structure
149+
150+
The content extraction works by:
151+
152+
- Removing navigation, headers, and footers
153+
- Converting remaining HTML to text
154+
- Normalizing whitespace
155+
156+
This works well for simple HTML but may not capture content perfectly from complex layouts.
157+
158+
### No Rendering
159+
160+
The fetcher downloads raw HTML and parses it directly. It does not:
161+
162+
- Execute JavaScript
163+
- Render the page in a browser
164+
- Follow redirects automatically (may be added in future)
165+
- Handle dynamic content
166+
167+
## Best Practices
168+
169+
### 1. Use Stable URLs
170+
171+
Choose URLs that are unlikely to change:
172+
173+
- ✅ Versioned documentation: `https://docs.example.com/v1.0/chapter1`
174+
- ✅ Archived content: `https://archive.example.com/2024/article`
175+
- ❌ Blog posts with dates that might be reorganized
176+
- ❌ URLs with session parameters
177+
178+
### 2. Verify Content Quality
179+
180+
After adding a URL reference, verify the extracted content:
181+
182+
```bash
183+
# Check what was extracted
184+
cat .linkml-reference-validator-cache/URL_https___example.com_page.md
185+
```
186+
187+
Ensure the extracted text contains the relevant information you're referencing.
188+
189+
### 3. Cache Management
190+
191+
- Commit cache files to version control for reproducibility
192+
- Use `--force-refresh` to update cached content
193+
- Periodically review cached URLs to ensure they're still accessible
194+
195+
### 4. Mix Reference Types
196+
197+
URL references work alongside PMIDs and DOIs:
198+
199+
```yaml
200+
findings:
201+
value: "Multiple studies confirm this relationship"
202+
references:
203+
- "PMID:12345678" # Research paper
204+
- "DOI:10.1234/journal.article" # Another paper
205+
- "https://example.com/textbook/chapter5" # Textbook chapter
206+
```
207+
208+
## Troubleshooting
209+
210+
### URL Not Fetching
211+
212+
If URL content isn't being fetched:
213+
214+
1. Check network connectivity
215+
2. Verify the URL is accessible in a browser
216+
3. Check for rate limiting or IP blocks
217+
4. Look for error messages in the logs
218+
219+
### Incorrect Content Extraction
220+
221+
If the wrong content is extracted:
222+
223+
1. Inspect the cached markdown file
224+
2. Check if the page uses complex JavaScript
225+
3. Consider if the page structure requires custom parsing
226+
4. File an issue with the page URL for improvement
227+
228+
### Validation Failing
229+
230+
If validation fails for URL references:
231+
232+
1. Check the cached content to see what was extracted
233+
2. Verify your supporting text actually appears on the page
234+
3. Check for whitespace or formatting differences
235+
4. Consider if the page content has changed since caching
236+
237+
## Comparison with Other Reference Types
238+
239+
| Feature | PMID | DOI | URL |
240+
|---------|------|-----|-----|
241+
| Source | PubMed | Crossref | Any web page |
242+
| Content Type | Abstract + Full Text | Abstract | HTML converted |
243+
| Metadata | Rich (authors, journal, etc.) | Rich | Minimal (title only) |
244+
| Stability | High | High | Variable |
245+
| Access | Free for abstracts | Varies | Varies |
246+
| Caching | Yes | Yes | Yes |
247+
248+
## See Also
249+
250+
- [Validating DOIs](validate-dois.md) - For journal articles with DOIs
251+
- [Validating OBO Files](validate-obo-files.md) - For ontology-specific validation
252+
- [How It Works](../concepts/how-it-works.md) - Core validation concepts
253+
- [CLI Reference](../reference/cli.md) - Command-line options

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Validate quotes and excerpts against their source publications**
44

5-
linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC and performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
5+
linkml-reference-validator ensures that text excerpts in your data accurately match their cited sources. It fetches references from PubMed/PMC, DOIs via Crossref, and URLs, then performs deterministic substring matching with support for editorial conventions like brackets `[...]` and ellipsis `...`.
66

77
## Key Features
88

docs/quickstart.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,33 @@ linkml-reference-validator validate text \
8686

8787
This works the same way as PMID validation - the reference is fetched and cached locally.
8888

89+
## Validate Against a URL
90+
91+
For online resources like book chapters, documentation, or educational content:
92+
93+
```bash
94+
linkml-reference-validator validate text \
95+
"The cell is the basic structural and functional unit of all living organisms" \
96+
https://example.com/biology/cell-structure
97+
```
98+
99+
Or with explicit URL prefix:
100+
101+
```bash
102+
linkml-reference-validator validate text \
103+
"The cell is the basic unit of life" \
104+
URL:https://example.com/biology/cells
105+
```
106+
107+
The validator will:
108+
1. Fetch the web page content
109+
2. Extract the title from the `<title>` tag
110+
3. Convert HTML to plain text (removing scripts, styles, navigation)
111+
4. Cache the content locally
112+
5. Validate your text against the extracted content
113+
114+
**Note:** URL validation works best with static HTML pages and may not work well with JavaScript-heavy or dynamic content.
115+
89116
## Key Features
90117

91118
- **Automatic Caching**: References cached locally after first fetch
@@ -94,6 +121,7 @@ This works the same way as PMID validation - the reference is fetched and cached
94121
- **Deterministic Matching**: Substring-based (not AI/fuzzy matching)
95122
- **PubMed & PMC**: Fetches from NCBI automatically
96123
- **DOI Support**: Fetches metadata from Crossref API
124+
- **URL Support**: Validates against web content (books, docs, educational resources)
97125

98126
## Next Steps
99127

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ nav:
3333
- How-To Guides:
3434
- Validating OBO Files: how-to/validate-obo-files.md
3535
- Validating DOIs: how-to/validate-dois.md
36+
- Validating URLs: how-to/validate-urls.md
3637
- Concepts:
3738
- How It Works: concepts/how-it-works.md
3839
- Editorial Conventions: concepts/editorial-conventions.md

0 commit comments

Comments
 (0)