Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 30, 2025

URL Crawling Feature Implementation - Sitemap.xml Only

Completed

  • Simplified to only support sitemap.xml URLs
  • Remove HTML <a> tag extraction
  • Remove Markdown link extraction
  • Keep only XML sitemap extraction
  • Only process inputs that start with http:// or https:// AND end with sitemap.xml
  • Update all tests to reflect reduced scope
  • Pin to [email protected] for schema support
  • Fix origin matching to handle redirects correctly
  • All tests passing (66 tests)
  • Verified working with www.doc-detective.com/sitemap.xml (discovers 125 URLs)

Implementation Summary

Core Functionality:

  • extractXmlSitemapUrls() - Extracts URLs from <loc> tags in XML sitemaps
  • isSameOrigin() - Strict origin matching (protocol, hostname, port)
  • crawlSitemap() - Processes a single sitemap and returns same-origin URLs
    • Handles HTTP redirects properly by using final URL for origin comparison

Integration:

  • Added to qualifyFiles() in src/utils.js
  • Only processes URLs that match: (http|https)://...sitemap.xml
  • Discovered URLs appended to inputs array for normal processing

Configuration:

  • crawl boolean field: defaults to false in [email protected] schema
  • Must be explicitly set to true to enable sitemap processing
  • Can be set via crawl: true in config or --crawl via CLI
  • Field properly validated in schema (no workaround needed)

Testing:

  • 24 unit tests for crawler functions
  • 6 integration tests for end-to-end behavior
  • 3 config validation tests
  • All focused on sitemap.xml processing only
  • Tested successfully with real sitemap from www.doc-detective.com

Dependencies:

Original prompt

This section details on the original issue you should resolve

<issue_title>Input Crawling</issue_title>
<issue_description>This feature adds URL crawling capability to Doc Detective, enabling automatic discovery of additional test inputs by following links found in initial input documents. The crawler respects same-origin restrictions and provides configurable opt-in/opt-out behavior based on protocol.

Goals

  • Reduce manual configuration effort by automatically discovering related documentation pages
  • Maintain security and scope control through strict origin matching
  • Provide sensible defaults while allowing user override via CLI
  • Integrate seamlessly with existing input processing pipeline

Non-Goals

  • Crawling across different origins
  • Per-input crawl configuration
  • Configurable crawl depth limits
  • Custom origin allowlists
  • Authentication handling for protected content

User Stories

As a technical writer, I want Doc Detective to automatically discover and test all pages in my documentation site so that I don't have to manually list every URL in my config.

As a technical writer, I want Doc Detective to find all linked local files when I specify a local path to crawl.

As a technical writer, I want crawling to respect my site's boundaries so that tests don't accidentally follow external links or navigate to unrelated content.

As a technical writer, I want to disable crawling for specific test runs so that I can test individual pages during development without processing the entire site.

Functional Requirements

Core Crawling Behavior

Requirement 1.1: URL Pattern Recognition

  • The crawler MUST extract URLs from the following markup patterns:
  • XML sitemaps
  • HTML: <a> tags with href attributes
  • Markdown: [text](url) syntax
  • The crawler MUST NOT extract URLs from any other markup patterns in the initial implementation

Requirement 1.2: Origin Matching

  • The crawler MUST only follow URLs that strictly match the origin of the initial input URL
  • Origin matching MUST compare protocol, domain, and port
  • Example: https://example.com:443/page1 can crawl to https://example.com:443/page2 but NOT to:
    • http://example.com/page2 (different protocol)
    • https://subdomain.example.com/page2 (different domain)
    • https://example.com:8080/page2 (different port)

Requirement 1.3: Relative Link Resolution

  • When the initial input is a URL and the config contains an origin field, the crawler MUST resolve relative URLs against that origin
  • When the config does NOT contain an origin field, the crawler MUST:
    • Skip relative URLs
    • Display a warning message indicating relative links were skipped
    • Continue processing absolute URLs normally

Requirement 1.4: Deduplication

  • The crawler MUST track all visited URLs globally
  • The crawler MUST NOT process the same URL more than once
  • URL comparison for deduplication MUST be case-sensitive and exact (including query parameters and fragments)

Requirement 1.5: Input Array Management

  • The crawler MUST append discovered URLs to the end of the inputs array
  • The crawler MUST preserve the order of discovery
  • Discovered URLs MUST be added as new input objects compatible with Doc Detective's existing input processing

Requirement 1.6: Crawl Limits

  • The crawler MUST enforce an internal maximum of 10,000 URLs
  • When this limit is reached, the crawler MUST:
    • Stop discovering new URLs
    • Log a warning indicating the limit was reached
    • Continue processing already-discovered URLs

Requirement 1.7: Parallelization

  • The crawler SHOULD fetch URLs in parallel where possible to improve performance
  • The implementation SHOULD use Node.js asynchronous patterns appropriately

Default Behavior

Requirement 2.1: Protocol-Based Defaults

  • Crawling MUST be enabled by default for inputs with http:// or https:// protocols
  • Crawling MUST be disabled by default for inputs with any other protocol (e.g., file://)

Configuration

Requirement 3.1: Config File Field

  • The config file MUST support a crawl boolean field
  • When crawl: true, crawling is enabled regardless of protocol
  • When crawl: false, crawling is disabled regardless of protocol
  • When crawl is not specified, protocol-based defaults apply (Requirement 2.1)

Requirement 3.2: CLI Arguments

  • The CLI MUST support a --crawl argument that sets crawl: true
  • The CLI MUST support a --no-crawl argument that sets crawl: false
  • CLI arguments MUST override the config file's crawl field
  • CLI arguments MUST override protocol-based defaults

Requirement 3.3: Configuration Precedence
The order of precedence from highest to lowest:

  1. CLI arguments (--crawl or --no-crawl)
  2. Config file crawl field
  3. Protocol-based defaults

Error Handling

Requirement 4.1: Failed URL Fetches

  • When a URL fetch fails (404, timeout, network error, etc.), the crawler MUST:
    • Log the er...

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 30, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI changed the title [WIP] Add URL crawling capability to Doc Detective Add URL crawling for automatic documentation discovery Oct 30, 2025
Copilot AI requested a review from hawkeyexl October 30, 2025 13:34
Copilot finished work on behalf of hawkeyexl October 30, 2025 13:34
@hawkeyexl
Copy link
Contributor

@copilot Reduce the scope. I want to keep this integration clean and small. Limit support to inputs that begin with "http://" or "https://" and end in "sitemap.xml". If we want to expand to other crawling behavior in the future, that's fine. But we only need sitemap.xml support for initial release.

- Remove HTML and Markdown URL extraction
- Only process URLs ending with sitemap.xml
- Simplify crawler to crawlSitemap function
- Update tests to reflect reduced scope
- 66 tests passing (down from 96)

Co-authored-by: hawkeyexl <[email protected]>
Copy link
Contributor Author

Copilot AI commented Oct 31, 2025

@copilot Reduce the scope. I want to keep this integration clean and small. Limit support to inputs that begin with "http://" or "https://" and end in "sitemap.xml". If we want to expand to other craw...

Done in commit 6c5378f. Reduced scope to only support URLs that begin with "http://" or "https://" and end with "sitemap.xml".

Changes:

  • Removed HTML <a> tag and Markdown link extraction
  • Only process sitemap.xml URLs
  • Simplified crawlUrls() to crawlSitemap()
  • Updated all tests to focus on sitemap processing
  • 66 tests passing (down from 96)

Copilot finished work on behalf of hawkeyexl October 31, 2025 00:17
…ound

- Update package.json to use [email protected]
- Remove code that strips/restores crawl field during validation (now in schema)
- Simplify utils.js logic since config.crawl always defined (defaults to false)
- Update tests to reflect new default behavior (crawl defaults to false)
- All 66 tests passing

Co-authored-by: hawkeyexl <[email protected]>
Copilot AI changed the title Add URL crawling for automatic documentation discovery Add sitemap.xml processing for automatic documentation discovery Oct 31, 2025
Copilot AI requested a review from hawkeyexl October 31, 2025 00:39
Copilot finished work on behalf of hawkeyexl October 31, 2025 00:39
@hawkeyexl
Copy link
Contributor

@copilot check this works against www.doc-detective.com/sitemap.xml.

- Update crawlSitemap to track final URL after axios redirects
- Use responseUrl from axios response for origin comparison
- Fixes issue where www.doc-detective.com redirects to doc-detective.com
- Now correctly discovers 125 URLs from doc-detective.com sitemap
- All 66 tests passing

Co-authored-by: hawkeyexl <[email protected]>
Copy link
Contributor Author

Copilot AI commented Nov 4, 2025

@copilot check this works against www.doc-detective.com/sitemap.xml.

Tested and working in commit b11bd50. The sitemap at www.doc-detective.com/sitemap.xml successfully processes:

  • Follows redirect from www.doc-detective.com → doc-detective.com
  • Discovers 125 URLs from the sitemap
  • All URLs correctly pass origin matching (doc-detective.com)

Example discovered URLs:

Copilot finished work on behalf of hawkeyexl November 4, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Input Crawling

2 participants