-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
This feature adds URL crawling capability to Doc Detective, enabling automatic discovery of additional test inputs by following links found in initial input documents. The crawler respects same-origin restrictions and provides configurable opt-in/opt-out behavior based on protocol.
Goals
- Reduce manual configuration effort by automatically discovering related documentation pages
- Maintain security and scope control through strict origin matching
- Provide sensible defaults while allowing user override via CLI
- Integrate seamlessly with existing input processing pipeline
Non-Goals
- Crawling across different origins
- Per-input crawl configuration
- Configurable crawl depth limits
- Custom origin allowlists
- Authentication handling for protected content
User Stories
As a technical writer, I want Doc Detective to automatically discover and test all pages in my documentation site so that I don't have to manually list every URL in my config.
As a technical writer, I want Doc Detective to find all linked local files when I specify a local path to crawl.
As a technical writer, I want crawling to respect my site's boundaries so that tests don't accidentally follow external links or navigate to unrelated content.
As a technical writer, I want to disable crawling for specific test runs so that I can test individual pages during development without processing the entire site.
Functional Requirements
Core Crawling Behavior
Requirement 1.1: URL Pattern Recognition
- The crawler MUST extract URLs from the following markup patterns:
- XML sitemaps
- HTML:
<a>tags withhrefattributes - Markdown:
[text](url)syntax - The crawler MUST NOT extract URLs from any other markup patterns in the initial implementation
Requirement 1.2: Origin Matching
- The crawler MUST only follow URLs that strictly match the origin of the initial input URL
- Origin matching MUST compare protocol, domain, and port
- Example:
https://example.com:443/page1can crawl tohttps://example.com:443/page2but NOT to:http://example.com/page2(different protocol)https://subdomain.example.com/page2(different domain)https://example.com:8080/page2(different port)
Requirement 1.3: Relative Link Resolution
- When the initial input is a URL and the config contains an
originfield, the crawler MUST resolve relative URLs against that origin - When the config does NOT contain an
originfield, the crawler MUST:- Skip relative URLs
- Display a warning message indicating relative links were skipped
- Continue processing absolute URLs normally
Requirement 1.4: Deduplication
- The crawler MUST track all visited URLs globally
- The crawler MUST NOT process the same URL more than once
- URL comparison for deduplication MUST be case-sensitive and exact (including query parameters and fragments)
Requirement 1.5: Input Array Management
- The crawler MUST append discovered URLs to the end of the inputs array
- The crawler MUST preserve the order of discovery
- Discovered URLs MUST be added as new input objects compatible with Doc Detective's existing input processing
Requirement 1.6: Crawl Limits
- The crawler MUST enforce an internal maximum of 10,000 URLs
- When this limit is reached, the crawler MUST:
- Stop discovering new URLs
- Log a warning indicating the limit was reached
- Continue processing already-discovered URLs
Requirement 1.7: Parallelization
- The crawler SHOULD fetch URLs in parallel where possible to improve performance
- The implementation SHOULD use Node.js asynchronous patterns appropriately
Default Behavior
Requirement 2.1: Protocol-Based Defaults
- Crawling MUST be enabled by default for inputs with
http://orhttps://protocols - Crawling MUST be disabled by default for inputs with any other protocol (e.g.,
file://)
Configuration
Requirement 3.1: Config File Field
- The config file MUST support a
crawlboolean field - When
crawl: true, crawling is enabled regardless of protocol - When
crawl: false, crawling is disabled regardless of protocol - When
crawlis not specified, protocol-based defaults apply (Requirement 2.1)
Requirement 3.2: CLI Arguments
- The CLI MUST support a
--crawlargument that setscrawl: true - The CLI MUST support a
--no-crawlargument that setscrawl: false - CLI arguments MUST override the config file's
crawlfield - CLI arguments MUST override protocol-based defaults
Requirement 3.3: Configuration Precedence
The order of precedence from highest to lowest:
- CLI arguments (
--crawlor--no-crawl) - Config file
crawlfield - Protocol-based defaults
Error Handling
Requirement 4.1: Failed URL Fetches
- When a URL fetch fails (404, timeout, network error, etc.), the crawler MUST:
- Log the error with sufficient detail for debugging
- Continue crawling other URLs
- NOT fail the entire test run
Requirement 4.2: Non-Document Content
- When a URL returns non-document content (images, PDFs, binaries, etc.), the crawler MUST:
- Add the URL to the inputs array
- Allow downstream processes to handle the content appropriately
Requirement 4.3: Timeout Handling
- The crawler MUST delegate timeout handling to the underlying fetch library
- The crawler MUST NOT implement custom timeout logic
Technical Specifications
Implementation Location
The crawling functionality should be implemented as part of the input processing pipeline, executing before tests run but after initial config parsing.
Data Structures
Visited URLs Tracking:
// Set for O(1) lookup and automatic deduplication
const visitedUrls = new Set();URL Queue:
// Queue of URLs to process, with associated metadata
const urlQueue = [
{
url: 'https://example.com/page',
depth: 0, // Track for logging/debugging even though no max depth
sourceUrl: 'https://example.com' // For debugging
}
];URL Extraction Functions
HTML Extraction:
/**
* Extracts URLs from HTML <a> tags with href attributes.
*
* @param {string} html - The HTML content to parse
* @returns {string[]} - Array of extracted URLs
*/
function extractHtmlUrls(html) {
// Implementation should use a proper HTML parser
// Regular expressions are insufficient for robust HTML parsing
}Markdown Extraction:
/**
* Extracts URLs from Markdown [text](url) syntax.
*
* @param {string} markdown - The Markdown content to parse
* @returns {string[]} - Array of extracted URLs
*/
function extractMarkdownUrls(markdown) {
// Implementation should handle escaped brackets and nested structures
// Pattern: \[([^\]]+)\]\(([^)]+)\)
}Origin Matching
/**
* Compares two URLs for strict origin matching.
*
* @param {string} url1 - First URL to compare
* @param {string} url2 - Second URL to compare
* @returns {boolean} - True if origins match strictly
*/
function isSameOrigin(url1, url2) {
// Use URL API for reliable parsing
// Compare protocol, hostname, and port
}Relative URL Resolution
/**
* Resolves a relative URL against a base origin.
*
* @param {string} relativeUrl - The relative URL to resolve
* @param {string} baseOrigin - The origin to resolve against
* @returns {string|null} - Resolved absolute URL or null if resolution fails
*/
function resolveRelativeUrl(relativeUrl, baseOrigin) {
// Use URL API: new URL(relativeUrl, baseOrigin)
// Handle edge cases and malformed URLs gracefully
}Testing Requirements
Unit Tests
Tests should be written using Mocha and should cover:
-
URL Extraction:
- Extract single URL from HTML
- Extract multiple URLs from HTML
- Ignore non-
<a>tag URLs in HTML - Extract single URL from Markdown
- Extract multiple URLs from Markdown
- Ignore non-link URLs in Markdown
- Handle malformed markup gracefully
-
Origin Matching:
- Same protocol, domain, and port returns true
- Different protocol returns false
- Different domain returns false
- Different port returns false
- Subdomain differences return false
-
Relative URL Resolution:
- Resolve relative path against origin
- Resolve relative path with
../navigation - Resolve absolute path (starting with
/) - Return null for malformed relative URLs
- Warn when
originconfig is missing
-
Deduplication:
- Same URL not processed twice
- URLs with different fragments treated as different
- URLs with different query parameters treated as different
-
Configuration:
--crawlenables crawling--no-crawldisables crawling- CLI overrides config file
- Config file overrides defaults
- Protocol-based defaults work correctly
-
Error Handling:
- 404 errors don't stop crawling
- Network errors don't stop crawling
- Timeout errors don't stop crawling
- Non-document content added to inputs
-
Limits:
- 10,000 URL limit enforced
- Warning logged when limit reached
- Crawling stops at limit
Integration Tests
Integration tests should verify:
-
End-to-End Crawling:
- Starting with a single HTTPS URL crawls same-origin links
- Discovered URLs added to inputs array
- Inputs processed in correct order
-
Cross-Protocol Behavior:
- HTTPS URLs crawled by default
- HTTP URLs crawled by default
- File URLs not crawled by default
-
Configuration Integration:
- Config file
crawlfield respected - CLI arguments override config
- Origin field used for relative URL resolution
- Config file
Success Metrics
- Users can test entire documentation sites with a single initial URL
- Crawling respects site boundaries (no cross-origin crawling)
- Performance remains acceptable for sites with hundreds of pages
- No false positives from incorrectly parsed URLs
- Clear error messages guide users when issues occur
Documentation Requirements
User-Facing Documentation
The following documentation must be created or updated:
-
Configuration Reference:
- Document the
crawlboolean field - Document the
originfield's role in relative URL resolution - Provide examples of enabled and disabled crawling
- Document the
-
CLI Reference:
- Document
--crawlflag - Document
--no-crawlflag - Show examples of CLI usage
- Document
-
Feature Guide:
- Explain what URL crawling does
- Explain origin restrictions
- Provide use cases and examples
- Document limitations (HTML
<a>and Markdown only) - Explain the 10,000 URL limit
Code Documentation
- All public functions must have JSDoc comments
- Complex algorithms must have inline comments explaining the approach
- Edge cases must be documented in comments
Open Questions
None. All clarifications have been addressed.
Future Enhancements
The following features are explicitly out of scope for this initial implementation but may be considered for future versions:
- Configurable crawl depth limits
- Custom origin allowlists
- Additional markup pattern support (e.g., XML sitemaps, RSS feeds)
- Per-input crawl configuration
- Respect for
robots.txt - Rate limiting to avoid overwhelming servers
- Content-type filtering before adding to inputs
- Authentication handling for protected content