Skip to content
This repository was archived by the owner on Oct 25, 2025. It is now read-only.

Commit 4c0342e

Browse files
author
Claude Code Plugin Developer
committed
feat: add static link validation with HTTP accessibility checks
- Add validate-links.sh script with URL validation functions - Implement HTTP HEAD request checking for link accessibility - Support curl and wget with graceful fallback - Add domain blacklist validation - Integrate static link checking into search-wrapper.sh - Add ENABLE_LINK_VALIDATION configuration option - Create comprehensive unit tests for validation - Add detailed VALIDATION.md documentation - Return URL status (accessible/inaccessible/unknown) in results
1 parent f655b1b commit 4c0342e

File tree

4 files changed

+746
-4
lines changed

4 files changed

+746
-4
lines changed

docs/VALIDATION.md

Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
# Search Result Validation
2+
3+
The Gemini Search Plugin includes comprehensive validation to ensure search results are relevant, accurate, and accessible.
4+
5+
## Validation Layers
6+
7+
### 1. False Positive Detection
8+
9+
**Purpose**: Filter out irrelevant search results
10+
11+
**How it works**:
12+
- Calculates relevance score by matching query terms against result title, snippet, and URL
13+
- Minimum relevance threshold: 50%
14+
- Returns results with relevance scores
15+
16+
**Example**:
17+
```bash
18+
Query: "Claude Code plugins"
19+
Result: "Plugin Development Guide - Claude Code Documentation"
20+
Relevance: 100% (all terms matched)
21+
Status: VALID
22+
```
23+
24+
### 2. Static Link Validation
25+
26+
**Purpose**: Verify URLs exist and are accessible
27+
28+
**How it works**:
29+
- Sends HTTP HEAD request to check URL accessibility
30+
- Validates HTTP status codes (200-399 = valid)
31+
- Supports redirects (max 5 by default)
32+
- Times out after 10 seconds
33+
- Falls back gracefully if no HTTP client available
34+
35+
**Tools used** (in order of preference):
36+
1. `curl` (primary)
37+
2. `wget` (fallback)
38+
3. Skip validation (if neither available)
39+
40+
**Example**:
41+
```bash
42+
URL: https://docs.claude.com/plugins
43+
Status: HTTP 200
44+
Result: accessible ✓
45+
```
46+
47+
### 3. URL Format Validation
48+
49+
**Purpose**: Ensure URLs have valid structure
50+
51+
**Checks**:
52+
- Protocol: Must be `http://` or `https://`
53+
- Domain: Valid domain name format
54+
- Path: Optional, any valid path
55+
56+
**Examples**:
57+
```bash
58+
✓ https://docs.claude.com/plugins
59+
✓ http://example.org/page
60+
✗ not-a-url
61+
✗ ftp://example.com
62+
```
63+
64+
### 4. Domain Blacklist
65+
66+
**Purpose**: Filter out test and invalid domains
67+
68+
**Blacklisted domains**:
69+
- `example.com`
70+
- `test.com`
71+
- `invalid.com`
72+
- `localhost`
73+
- `127.0.0.1`
74+
- `0.0.0.0`
75+
- `::1`
76+
- `*.local`
77+
78+
**Example**:
79+
```bash
80+
URL: https://example.com/test
81+
Status: INVALID (blacklisted domain)
82+
```
83+
84+
## Configuration
85+
86+
### Enable/Disable Validation
87+
88+
```bash
89+
# Enable static link validation (default)
90+
export ENABLE_LINK_VALIDATION=true
91+
92+
# Disable static link validation (faster, less accurate)
93+
export ENABLE_LINK_VALIDATION=false
94+
```
95+
96+
### Timeout Configuration
97+
98+
```bash
99+
# HTTP request timeout in seconds (default: 10)
100+
export TIMEOUT_SECONDS=10
101+
102+
# Maximum HTTP redirects to follow (default: 5)
103+
export MAX_REDIRECTS=5
104+
```
105+
106+
### Relevance Threshold
107+
108+
Currently hardcoded to 50%. To modify, edit `scripts/search-wrapper.sh`:
109+
110+
```bash
111+
# Line 175
112+
if [[ $relevance_percentage -ge 50 ]] && [[ "$is_valid" == "true" ]]; then
113+
```
114+
115+
## Validation Output Format
116+
117+
Results include validation metadata:
118+
119+
```
120+
VALID|85|accessible
121+
```
122+
123+
Format: `STATUS|RELEVANCE_SCORE|URL_STATUS`
124+
125+
- **STATUS**: `VALID` or `INVALID`
126+
- **RELEVANCE_SCORE**: 0-100 percentage
127+
- **URL_STATUS**: `accessible`, `inaccessible`, or `unknown`
128+
129+
## Performance Considerations
130+
131+
### With Link Validation Enabled
132+
133+
**Pros**:
134+
- ✅ Filters out broken links
135+
- ✅ Higher quality results
136+
- ✅ Better user experience
137+
138+
**Cons**:
139+
- ⏱️ Slower (adds ~1-2s per result)
140+
- 🌐 Requires network access
141+
- 💾 Not cached
142+
143+
**Best for**: Production use, critical searches
144+
145+
### With Link Validation Disabled
146+
147+
**Pros**:
148+
- ⚡ Faster results
149+
- 📡 Works offline
150+
- 💨 Lower latency
151+
152+
**Cons**:
153+
- ❌ May return broken links
154+
- ⚠️ Lower quality assurance
155+
156+
**Best for**: Development, testing, offline use
157+
158+
## Testing Validation
159+
160+
### Unit Tests
161+
162+
Run validation tests:
163+
164+
```bash
165+
bash tests/test-link-validation.sh
166+
```
167+
168+
### Manual Testing
169+
170+
Test individual validation functions:
171+
172+
```bash
173+
# Source the validation script
174+
source scripts/validate-links.sh
175+
176+
# Test URL format
177+
validate_url_format "https://docs.claude.com/plugins"
178+
echo $? # 0 = valid, 1 = invalid
179+
180+
# Test URL exists
181+
check_url_exists "https://docs.claude.com/plugins"
182+
echo $? # 0 = exists, 1 = doesn't exist
183+
184+
# Test blacklist
185+
check_url_blacklist "https://example.com/test"
186+
echo $? # 0 = not blacklisted, 1 = blacklisted
187+
188+
# Calculate relevance
189+
calculate_relevance_score "claude plugins" "Claude Plugin Guide" "Guide to plugins" "https://claude.com/plugins"
190+
# Returns: 100
191+
```
192+
193+
### Full Validation Test
194+
195+
```bash
196+
bash scripts/validate-links.sh \
197+
"claude code plugins" \
198+
"Plugin Development Guide" \
199+
"https://docs.claude.com/plugins" \
200+
"Comprehensive guide to developing plugins for Claude Code"
201+
```
202+
203+
Output:
204+
```json
205+
{
206+
"valid": true,
207+
"url": "https://docs.claude.com/plugins",
208+
"url_status": "accessible",
209+
"relevance_score": 100,
210+
"relevance_threshold": 50,
211+
"failure_reasons": []
212+
}
213+
```
214+
215+
## Debugging Validation Issues
216+
217+
### Enable Debug Logging
218+
219+
```bash
220+
export LOG_FILE="/tmp/gemini-search-debug.log"
221+
222+
# Run search
223+
/search "your query"
224+
225+
# View logs
226+
tail -f /tmp/gemini-search-debug.log | grep "Validating\|accessible"
227+
```
228+
229+
### Common Issues
230+
231+
#### Issue: All results marked INVALID
232+
233+
**Cause**: Link validation timing out
234+
235+
**Solution**:
236+
```bash
237+
# Increase timeout
238+
export TIMEOUT_SECONDS=30
239+
240+
# Or disable link validation
241+
export ENABLE_LINK_VALIDATION=false
242+
```
243+
244+
#### Issue: Validation too slow
245+
246+
**Cause**: HTTP requests taking too long
247+
248+
**Solution**:
249+
```bash
250+
# Reduce timeout
251+
export TIMEOUT_SECONDS=5
252+
253+
# Reduce max redirects
254+
export MAX_REDIRECTS=2
255+
```
256+
257+
#### Issue: "No HTTP client available"
258+
259+
**Cause**: Neither curl nor wget installed
260+
261+
**Solution**:
262+
```bash
263+
# Install curl (Ubuntu/Debian)
264+
sudo apt-get install curl
265+
266+
# Install curl (macOS)
267+
brew install curl
268+
269+
# Install curl (Windows/Chocolatey)
270+
choco install curl
271+
```
272+
273+
## Validation Statistics
274+
275+
View validation performance:
276+
277+
```bash
278+
/search-stats
279+
```
280+
281+
Shows:
282+
- Total searches
283+
- Cache hit rate
284+
- Average relevance scores (future feature)
285+
- URL accessibility rate (future feature)
286+
287+
## Future Enhancements
288+
289+
Planned validation improvements:
290+
291+
- [ ] SSL certificate validation
292+
- [ ] Content-type checking (HTML only)
293+
- [ ] Duplicate URL detection
294+
- [ ] Custom blacklist configuration
295+
- [ ] Whitelist support
296+
- [ ] Validation result caching
297+
- [ ] Async validation (parallel checks)
298+
- [ ] Configurable relevance thresholds
299+
- [ ] Machine learning relevance scoring
300+
301+
## Best Practices
302+
303+
### For Users
304+
305+
1. **Enable link validation in production**
306+
- Ensures high-quality results
307+
- Prevents dead links
308+
309+
2. **Disable link validation for development**
310+
- Faster iteration
311+
- Works offline
312+
313+
3. **Monitor validation logs**
314+
- Identify patterns
315+
- Tune thresholds
316+
317+
### For Developers
318+
319+
1. **Test with validation enabled and disabled**
320+
- Ensure both modes work
321+
- Handle graceful degradation
322+
323+
2. **Add validation tests**
324+
- Test new validation rules
325+
- Prevent regressions
326+
327+
3. **Document validation behavior**
328+
- Update VALIDATION.md
329+
- Add examples
330+
331+
## Related Documentation
332+
333+
- [README.md](../README.md) - Overview and features
334+
- [TESTING.md](../TESTING.md) - Testing guide
335+
- [DEPLOYMENT.md](../DEPLOYMENT.md) - Deployment procedures
336+
- [scripts/validate-links.sh](../scripts/validate-links.sh) - Validation implementation

scripts/search-wrapper.sh

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ LOG_FILE="${LOG_FILE:-/tmp/gemini-search.log}"
1212
ERROR_LOG_FILE="${ERROR_LOG_FILE:-/tmp/gemini-search-errors.log}"
1313
MAX_RETRIES="${MAX_RETRIES:-3}"
1414
RETRY_DELAY="${RETRY_DELAY:-1}" # seconds
15+
ENABLE_LINK_VALIDATION="${ENABLE_LINK_VALIDATION:-true}" # Enable static link validation
16+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
1517

1618
# Logging function
1719
log_message() {
@@ -150,14 +152,31 @@ validate_search_result() {
150152
log_message "DEBUG" "Result failed validation: contains invalid domain"
151153
fi
152154

153-
# Additional checks can be added here
154-
155+
# Enhanced validation: Check if URL exists (static link check)
156+
local url_status="unknown"
157+
if [[ "$ENABLE_LINK_VALIDATION" == "true" ]] && [[ "$is_valid" == "true" ]] && [[ -x "$SCRIPT_DIR/validate-links.sh" ]]; then
158+
log_message "DEBUG" "Performing static link validation for: $url"
159+
160+
# Source validation functions
161+
source "$SCRIPT_DIR/validate-links.sh"
162+
163+
# Check if URL exists
164+
if check_url_exists "$url"; then
165+
url_status="accessible"
166+
log_message "DEBUG" "URL is accessible: $url"
167+
else
168+
url_status="inaccessible"
169+
is_valid=false
170+
log_message "DEBUG" "URL is not accessible: $url"
171+
fi
172+
fi
173+
155174
# Return validation result
156175
if [[ $relevance_percentage -ge 50 ]] && [[ "$is_valid" == "true" ]]; then
157-
echo "VALID|$relevance_percentage" # Valid result with relevance score
176+
echo "VALID|$relevance_percentage|$url_status" # Valid result with relevance score and URL status
158177
return 0
159178
else
160-
echo "INVALID|$relevance_percentage" # Invalid result with relevance score
179+
echo "INVALID|$relevance_percentage|$url_status" # Invalid result with relevance score and URL status
161180
return 1
162181
fi
163182
}

0 commit comments

Comments
 (0)