Input Crawling

This feature adds URL crawling capability to Doc Detective, enabling automatic discovery of additional test inputs by following links found in initial input documents. The crawler respects same-origin restrictions and provides configurable opt-in/opt-out behavior based on protocol.

## Goals

- Reduce manual configuration effort by automatically discovering related documentation pages
- Maintain security and scope control through strict origin matching
- Provide sensible defaults while allowing user override via CLI
- Integrate seamlessly with existing input processing pipeline

## Non-Goals

- Crawling across different origins
- Per-input crawl configuration
- Configurable crawl depth limits
- Custom origin allowlists
- Authentication handling for protected content

## User Stories

**As a technical writer**, I want Doc Detective to automatically discover and test all pages in my documentation site so that I don't have to manually list every URL in my config.

**As a technical writer**, I want Doc Detective to find all linked local files when I specify a local path to crawl.

**As a technical writer**, I want crawling to respect my site's boundaries so that tests don't accidentally follow external links or navigate to unrelated content.

**As a technical writer**, I want to disable crawling for specific test runs so that I can test individual pages during development without processing the entire site.

## Functional Requirements

### Core Crawling Behavior

**Requirement 1.1: URL Pattern Recognition**
- The crawler MUST extract URLs from the following markup patterns:
 - XML sitemaps 
  - HTML: `<a>` tags with `href` attributes
  - Markdown: `[text](url)` syntax
- The crawler MUST NOT extract URLs from any other markup patterns in the initial implementation

**Requirement 1.2: Origin Matching**
- The crawler MUST only follow URLs that strictly match the origin of the initial input URL
- Origin matching MUST compare protocol, domain, and port
- Example: `https://example.com:443/page1` can crawl to `https://example.com:443/page2` but NOT to:
  - `http://example.com/page2` (different protocol)
  - `https://subdomain.example.com/page2` (different domain)
  - `https://example.com:8080/page2` (different port)

**Requirement 1.3: Relative Link Resolution**
- When the initial input is a URL and the config contains an `origin` field, the crawler MUST resolve relative URLs against that origin
- When the config does NOT contain an `origin` field, the crawler MUST:
  - Skip relative URLs
  - Display a warning message indicating relative links were skipped
  - Continue processing absolute URLs normally

**Requirement 1.4: Deduplication**
- The crawler MUST track all visited URLs globally
- The crawler MUST NOT process the same URL more than once
- URL comparison for deduplication MUST be case-sensitive and exact (including query parameters and fragments)

**Requirement 1.5: Input Array Management**
- The crawler MUST append discovered URLs to the end of the inputs array
- The crawler MUST preserve the order of discovery
- Discovered URLs MUST be added as new input objects compatible with Doc Detective's existing input processing

**Requirement 1.6: Crawl Limits**
- The crawler MUST enforce an internal maximum of 10,000 URLs
- When this limit is reached, the crawler MUST:
  - Stop discovering new URLs
  - Log a warning indicating the limit was reached
  - Continue processing already-discovered URLs

**Requirement 1.7: Parallelization**
- The crawler SHOULD fetch URLs in parallel where possible to improve performance
- The implementation SHOULD use Node.js asynchronous patterns appropriately

### Default Behavior

**Requirement 2.1: Protocol-Based Defaults**
- Crawling MUST be enabled by default for inputs with `http://` or `https://` protocols
- Crawling MUST be disabled by default for inputs with any other protocol (e.g., `file://`)

### Configuration

**Requirement 3.1: Config File Field**
- The config file MUST support a `crawl` boolean field
- When `crawl: true`, crawling is enabled regardless of protocol
- When `crawl: false`, crawling is disabled regardless of protocol
- When `crawl` is not specified, protocol-based defaults apply (Requirement 2.1)

**Requirement 3.2: CLI Arguments**
- The CLI MUST support a `--crawl` argument that sets `crawl: true`
- The CLI MUST support a `--no-crawl` argument that sets `crawl: false`
- CLI arguments MUST override the config file's `crawl` field
- CLI arguments MUST override protocol-based defaults

**Requirement 3.3: Configuration Precedence**
The order of precedence from highest to lowest:
1. CLI arguments (`--crawl` or `--no-crawl`)
2. Config file `crawl` field
3. Protocol-based defaults

### Error Handling

**Requirement 4.1: Failed URL Fetches**
- When a URL fetch fails (404, timeout, network error, etc.), the crawler MUST:
  - Log the error with sufficient detail for debugging
  - Continue crawling other URLs
  - NOT fail the entire test run

**Requirement 4.2: Non-Document Content**
- When a URL returns non-document content (images, PDFs, binaries, etc.), the crawler MUST:
  - Add the URL to the inputs array
  - Allow downstream processes to handle the content appropriately

**Requirement 4.3: Timeout Handling**
- The crawler MUST delegate timeout handling to the underlying fetch library
- The crawler MUST NOT implement custom timeout logic

## Technical Specifications

### Implementation Location

The crawling functionality should be implemented as part of the input processing pipeline, executing before tests run but after initial config parsing.

### Data Structures

**Visited URLs Tracking:**
```javascript
// Set for O(1) lookup and automatic deduplication
const visitedUrls = new Set();
```

**URL Queue:**
```javascript
// Queue of URLs to process, with associated metadata
const urlQueue = [
  {
    url: 'https://example.com/page',
    depth: 0, // Track for logging/debugging even though no max depth
    sourceUrl: 'https://example.com' // For debugging
  }
];
```

### URL Extraction Functions

**HTML Extraction:**
```javascript
/**
 * Extracts URLs from HTML <a> tags with href attributes.
 * 
 * @param {string} html - The HTML content to parse
 * @returns {string[]} - Array of extracted URLs
 */
function extractHtmlUrls(html) {
  // Implementation should use a proper HTML parser
  // Regular expressions are insufficient for robust HTML parsing
}
```

**Markdown Extraction:**
```javascript
/**
 * Extracts URLs from Markdown [text](url) syntax.
 * 
 * @param {string} markdown - The Markdown content to parse
 * @returns {string[]} - Array of extracted URLs
 */
function extractMarkdownUrls(markdown) {
  // Implementation should handle escaped brackets and nested structures
  // Pattern: \[([^\]]+)\]\(([^)]+)\)
}
```

### Origin Matching

```javascript
/**
 * Compares two URLs for strict origin matching.
 * 
 * @param {string} url1 - First URL to compare
 * @param {string} url2 - Second URL to compare
 * @returns {boolean} - True if origins match strictly
 */
function isSameOrigin(url1, url2) {
  // Use URL API for reliable parsing
  // Compare protocol, hostname, and port
}
```

### Relative URL Resolution

```javascript
/**
 * Resolves a relative URL against a base origin.
 * 
 * @param {string} relativeUrl - The relative URL to resolve
 * @param {string} baseOrigin - The origin to resolve against
 * @returns {string|null} - Resolved absolute URL or null if resolution fails
 */
function resolveRelativeUrl(relativeUrl, baseOrigin) {
  // Use URL API: new URL(relativeUrl, baseOrigin)
  // Handle edge cases and malformed URLs gracefully
}
```

## Testing Requirements

### Unit Tests

Tests should be written using Mocha and should cover:

1. **URL Extraction:**
   - Extract single URL from HTML
   - Extract multiple URLs from HTML
   - Ignore non-`<a>` tag URLs in HTML
   - Extract single URL from Markdown
   - Extract multiple URLs from Markdown
   - Ignore non-link URLs in Markdown
   - Handle malformed markup gracefully

2. **Origin Matching:**
   - Same protocol, domain, and port returns true
   - Different protocol returns false
   - Different domain returns false
   - Different port returns false
   - Subdomain differences return false

3. **Relative URL Resolution:**
   - Resolve relative path against origin
   - Resolve relative path with `../` navigation
   - Resolve absolute path (starting with `/`)
   - Return null for malformed relative URLs
   - Warn when `origin` config is missing

4. **Deduplication:**
   - Same URL not processed twice
   - URLs with different fragments treated as different
   - URLs with different query parameters treated as different

5. **Configuration:**
   - `--crawl` enables crawling
   - `--no-crawl` disables crawling
   - CLI overrides config file
   - Config file overrides defaults
   - Protocol-based defaults work correctly

6. **Error Handling:**
   - 404 errors don't stop crawling
   - Network errors don't stop crawling
   - Timeout errors don't stop crawling
   - Non-document content added to inputs

7. **Limits:**
   - 10,000 URL limit enforced
   - Warning logged when limit reached
   - Crawling stops at limit

### Integration Tests

Integration tests should verify:

1. **End-to-End Crawling:**
   - Starting with a single HTTPS URL crawls same-origin links
   - Discovered URLs added to inputs array
   - Inputs processed in correct order

2. **Cross-Protocol Behavior:**
   - HTTPS URLs crawled by default
   - HTTP URLs crawled by default
   - File URLs not crawled by default

3. **Configuration Integration:**
   - Config file `crawl` field respected
   - CLI arguments override config
   - Origin field used for relative URL resolution

## Success Metrics

- Users can test entire documentation sites with a single initial URL
- Crawling respects site boundaries (no cross-origin crawling)
- Performance remains acceptable for sites with hundreds of pages
- No false positives from incorrectly parsed URLs
- Clear error messages guide users when issues occur

## Documentation Requirements

### User-Facing Documentation

The following documentation must be created or updated:

1. **Configuration Reference:**
   - Document the `crawl` boolean field
   - Document the `origin` field's role in relative URL resolution
   - Provide examples of enabled and disabled crawling

2. **CLI Reference:**
   - Document `--crawl` flag
   - Document `--no-crawl` flag
   - Show examples of CLI usage

3. **Feature Guide:**
   - Explain what URL crawling does
   - Explain origin restrictions
   - Provide use cases and examples
   - Document limitations (HTML `<a>` and Markdown only)
   - Explain the 10,000 URL limit

### Code Documentation

- All public functions must have JSDoc comments
- Complex algorithms must have inline comments explaining the approach
- Edge cases must be documented in comments

## Open Questions

None. All clarifications have been addressed.

## Future Enhancements

The following features are explicitly out of scope for this initial implementation but may be considered for future versions:

- Configurable crawl depth limits
- Custom origin allowlists
- Additional markup pattern support (e.g., XML sitemaps, RSS feeds)
- Per-input crawl configuration
- Respect for `robots.txt`
- Rate limiting to avoid overwhelming servers
- Content-type filtering before adding to inputs
- Authentication handling for protected content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Input Crawling #74

Goals

Non-Goals

User Stories

Functional Requirements

Core Crawling Behavior

Default Behavior

Configuration

Error Handling

Technical Specifications

Implementation Location

Data Structures

URL Extraction Functions

Origin Matching

Relative URL Resolution

Testing Requirements

Unit Tests

Integration Tests

Success Metrics

Documentation Requirements

User-Facing Documentation

Code Documentation

Open Questions

Future Enhancements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Input Crawling #74

Description

Goals

Non-Goals

User Stories

Functional Requirements

Core Crawling Behavior

Default Behavior

Configuration

Error Handling

Technical Specifications

Implementation Location

Data Structures

URL Extraction Functions

Origin Matching

Relative URL Resolution

Testing Requirements

Unit Tests

Integration Tests

Success Metrics

Documentation Requirements

User-Facing Documentation

Code Documentation

Open Questions

Future Enhancements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions