Skip to content

Input Crawling #74

@hawkeyexl

Description

@hawkeyexl

This feature adds URL crawling capability to Doc Detective, enabling automatic discovery of additional test inputs by following links found in initial input documents. The crawler respects same-origin restrictions and provides configurable opt-in/opt-out behavior based on protocol.

Goals

  • Reduce manual configuration effort by automatically discovering related documentation pages
  • Maintain security and scope control through strict origin matching
  • Provide sensible defaults while allowing user override via CLI
  • Integrate seamlessly with existing input processing pipeline

Non-Goals

  • Crawling across different origins
  • Per-input crawl configuration
  • Configurable crawl depth limits
  • Custom origin allowlists
  • Authentication handling for protected content

User Stories

As a technical writer, I want Doc Detective to automatically discover and test all pages in my documentation site so that I don't have to manually list every URL in my config.

As a technical writer, I want Doc Detective to find all linked local files when I specify a local path to crawl.

As a technical writer, I want crawling to respect my site's boundaries so that tests don't accidentally follow external links or navigate to unrelated content.

As a technical writer, I want to disable crawling for specific test runs so that I can test individual pages during development without processing the entire site.

Functional Requirements

Core Crawling Behavior

Requirement 1.1: URL Pattern Recognition

  • The crawler MUST extract URLs from the following markup patterns:
  • XML sitemaps
  • HTML: <a> tags with href attributes
  • Markdown: [text](url) syntax
  • The crawler MUST NOT extract URLs from any other markup patterns in the initial implementation

Requirement 1.2: Origin Matching

  • The crawler MUST only follow URLs that strictly match the origin of the initial input URL
  • Origin matching MUST compare protocol, domain, and port
  • Example: https://example.com:443/page1 can crawl to https://example.com:443/page2 but NOT to:
    • http://example.com/page2 (different protocol)
    • https://subdomain.example.com/page2 (different domain)
    • https://example.com:8080/page2 (different port)

Requirement 1.3: Relative Link Resolution

  • When the initial input is a URL and the config contains an origin field, the crawler MUST resolve relative URLs against that origin
  • When the config does NOT contain an origin field, the crawler MUST:
    • Skip relative URLs
    • Display a warning message indicating relative links were skipped
    • Continue processing absolute URLs normally

Requirement 1.4: Deduplication

  • The crawler MUST track all visited URLs globally
  • The crawler MUST NOT process the same URL more than once
  • URL comparison for deduplication MUST be case-sensitive and exact (including query parameters and fragments)

Requirement 1.5: Input Array Management

  • The crawler MUST append discovered URLs to the end of the inputs array
  • The crawler MUST preserve the order of discovery
  • Discovered URLs MUST be added as new input objects compatible with Doc Detective's existing input processing

Requirement 1.6: Crawl Limits

  • The crawler MUST enforce an internal maximum of 10,000 URLs
  • When this limit is reached, the crawler MUST:
    • Stop discovering new URLs
    • Log a warning indicating the limit was reached
    • Continue processing already-discovered URLs

Requirement 1.7: Parallelization

  • The crawler SHOULD fetch URLs in parallel where possible to improve performance
  • The implementation SHOULD use Node.js asynchronous patterns appropriately

Default Behavior

Requirement 2.1: Protocol-Based Defaults

  • Crawling MUST be enabled by default for inputs with http:// or https:// protocols
  • Crawling MUST be disabled by default for inputs with any other protocol (e.g., file://)

Configuration

Requirement 3.1: Config File Field

  • The config file MUST support a crawl boolean field
  • When crawl: true, crawling is enabled regardless of protocol
  • When crawl: false, crawling is disabled regardless of protocol
  • When crawl is not specified, protocol-based defaults apply (Requirement 2.1)

Requirement 3.2: CLI Arguments

  • The CLI MUST support a --crawl argument that sets crawl: true
  • The CLI MUST support a --no-crawl argument that sets crawl: false
  • CLI arguments MUST override the config file's crawl field
  • CLI arguments MUST override protocol-based defaults

Requirement 3.3: Configuration Precedence
The order of precedence from highest to lowest:

  1. CLI arguments (--crawl or --no-crawl)
  2. Config file crawl field
  3. Protocol-based defaults

Error Handling

Requirement 4.1: Failed URL Fetches

  • When a URL fetch fails (404, timeout, network error, etc.), the crawler MUST:
    • Log the error with sufficient detail for debugging
    • Continue crawling other URLs
    • NOT fail the entire test run

Requirement 4.2: Non-Document Content

  • When a URL returns non-document content (images, PDFs, binaries, etc.), the crawler MUST:
    • Add the URL to the inputs array
    • Allow downstream processes to handle the content appropriately

Requirement 4.3: Timeout Handling

  • The crawler MUST delegate timeout handling to the underlying fetch library
  • The crawler MUST NOT implement custom timeout logic

Technical Specifications

Implementation Location

The crawling functionality should be implemented as part of the input processing pipeline, executing before tests run but after initial config parsing.

Data Structures

Visited URLs Tracking:

// Set for O(1) lookup and automatic deduplication
const visitedUrls = new Set();

URL Queue:

// Queue of URLs to process, with associated metadata
const urlQueue = [
  {
    url: 'https://example.com/page',
    depth: 0, // Track for logging/debugging even though no max depth
    sourceUrl: 'https://example.com' // For debugging
  }
];

URL Extraction Functions

HTML Extraction:

/**
 * Extracts URLs from HTML <a> tags with href attributes.
 * 
 * @param {string} html - The HTML content to parse
 * @returns {string[]} - Array of extracted URLs
 */
function extractHtmlUrls(html) {
  // Implementation should use a proper HTML parser
  // Regular expressions are insufficient for robust HTML parsing
}

Markdown Extraction:

/**
 * Extracts URLs from Markdown [text](url) syntax.
 * 
 * @param {string} markdown - The Markdown content to parse
 * @returns {string[]} - Array of extracted URLs
 */
function extractMarkdownUrls(markdown) {
  // Implementation should handle escaped brackets and nested structures
  // Pattern: \[([^\]]+)\]\(([^)]+)\)
}

Origin Matching

/**
 * Compares two URLs for strict origin matching.
 * 
 * @param {string} url1 - First URL to compare
 * @param {string} url2 - Second URL to compare
 * @returns {boolean} - True if origins match strictly
 */
function isSameOrigin(url1, url2) {
  // Use URL API for reliable parsing
  // Compare protocol, hostname, and port
}

Relative URL Resolution

/**
 * Resolves a relative URL against a base origin.
 * 
 * @param {string} relativeUrl - The relative URL to resolve
 * @param {string} baseOrigin - The origin to resolve against
 * @returns {string|null} - Resolved absolute URL or null if resolution fails
 */
function resolveRelativeUrl(relativeUrl, baseOrigin) {
  // Use URL API: new URL(relativeUrl, baseOrigin)
  // Handle edge cases and malformed URLs gracefully
}

Testing Requirements

Unit Tests

Tests should be written using Mocha and should cover:

  1. URL Extraction:

    • Extract single URL from HTML
    • Extract multiple URLs from HTML
    • Ignore non-<a> tag URLs in HTML
    • Extract single URL from Markdown
    • Extract multiple URLs from Markdown
    • Ignore non-link URLs in Markdown
    • Handle malformed markup gracefully
  2. Origin Matching:

    • Same protocol, domain, and port returns true
    • Different protocol returns false
    • Different domain returns false
    • Different port returns false
    • Subdomain differences return false
  3. Relative URL Resolution:

    • Resolve relative path against origin
    • Resolve relative path with ../ navigation
    • Resolve absolute path (starting with /)
    • Return null for malformed relative URLs
    • Warn when origin config is missing
  4. Deduplication:

    • Same URL not processed twice
    • URLs with different fragments treated as different
    • URLs with different query parameters treated as different
  5. Configuration:

    • --crawl enables crawling
    • --no-crawl disables crawling
    • CLI overrides config file
    • Config file overrides defaults
    • Protocol-based defaults work correctly
  6. Error Handling:

    • 404 errors don't stop crawling
    • Network errors don't stop crawling
    • Timeout errors don't stop crawling
    • Non-document content added to inputs
  7. Limits:

    • 10,000 URL limit enforced
    • Warning logged when limit reached
    • Crawling stops at limit

Integration Tests

Integration tests should verify:

  1. End-to-End Crawling:

    • Starting with a single HTTPS URL crawls same-origin links
    • Discovered URLs added to inputs array
    • Inputs processed in correct order
  2. Cross-Protocol Behavior:

    • HTTPS URLs crawled by default
    • HTTP URLs crawled by default
    • File URLs not crawled by default
  3. Configuration Integration:

    • Config file crawl field respected
    • CLI arguments override config
    • Origin field used for relative URL resolution

Success Metrics

  • Users can test entire documentation sites with a single initial URL
  • Crawling respects site boundaries (no cross-origin crawling)
  • Performance remains acceptable for sites with hundreds of pages
  • No false positives from incorrectly parsed URLs
  • Clear error messages guide users when issues occur

Documentation Requirements

User-Facing Documentation

The following documentation must be created or updated:

  1. Configuration Reference:

    • Document the crawl boolean field
    • Document the origin field's role in relative URL resolution
    • Provide examples of enabled and disabled crawling
  2. CLI Reference:

    • Document --crawl flag
    • Document --no-crawl flag
    • Show examples of CLI usage
  3. Feature Guide:

    • Explain what URL crawling does
    • Explain origin restrictions
    • Provide use cases and examples
    • Document limitations (HTML <a> and Markdown only)
    • Explain the 10,000 URL limit

Code Documentation

  • All public functions must have JSDoc comments
  • Complex algorithms must have inline comments explaining the approach
  • Edge cases must be documented in comments

Open Questions

None. All clarifications have been addressed.

Future Enhancements

The following features are explicitly out of scope for this initial implementation but may be considered for future versions:

  • Configurable crawl depth limits
  • Custom origin allowlists
  • Additional markup pattern support (e.g., XML sitemaps, RSS feeds)
  • Per-input crawl configuration
  • Respect for robots.txt
  • Rate limiting to avoid overwhelming servers
  • Content-type filtering before adding to inputs
  • Authentication handling for protected content

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions