Add sitemap.xml processing for automatic documentation discovery #75

Copilot · 2025-10-30T13:11:06Z

URL Crawling Feature Implementation - Sitemap.xml Only

Completed

Implementation Summary

Core Functionality:

extractXmlSitemapUrls() - Extracts URLs from <loc> tags in XML sitemaps
isSameOrigin() - Strict origin matching (protocol, hostname, port)
crawlSitemap() - Processes a single sitemap and returns same-origin URLs
- Handles HTTP redirects properly by using final URL for origin comparison

Integration:

Added to qualifyFiles() in src/utils.js
Only processes URLs that match: (http|https)://...sitemap.xml
Discovered URLs appended to inputs array for normal processing

Configuration:

crawl boolean field: defaults to false in [email protected] schema
Must be explicitly set to true to enable sitemap processing
Can be set via crawl: true in config or --crawl via CLI
Field properly validated in schema (no workaround needed)

Testing:

24 unit tests for crawler functions
6 integration tests for end-to-end behavior
3 config validation tests
All focused on sitemap.xml processing only
Tested successfully with real sitemap from www.doc-detective.com

Dependencies:

Requires [email protected] or later for crawl field schema support

Original prompt

This section details on the original issue you should resolve

<issue_title>Input Crawling</issue_title>
<issue_description>This feature adds URL crawling capability to Doc Detective, enabling automatic discovery of additional test inputs by following links found in initial input documents. The crawler respects same-origin restrictions and provides configurable opt-in/opt-out behavior based on protocol.

Goals

Reduce manual configuration effort by automatically discovering related documentation pages
Maintain security and scope control through strict origin matching
Provide sensible defaults while allowing user override via CLI
Integrate seamlessly with existing input processing pipeline

Non-Goals

Crawling across different origins
Per-input crawl configuration
Configurable crawl depth limits
Custom origin allowlists
Authentication handling for protected content

User Stories

As a technical writer, I want Doc Detective to automatically discover and test all pages in my documentation site so that I don't have to manually list every URL in my config.

As a technical writer, I want Doc Detective to find all linked local files when I specify a local path to crawl.

As a technical writer, I want crawling to respect my site's boundaries so that tests don't accidentally follow external links or navigate to unrelated content.

As a technical writer, I want to disable crawling for specific test runs so that I can test individual pages during development without processing the entire site.

Functional Requirements

Core Crawling Behavior

Requirement 1.1: URL Pattern Recognition

The crawler MUST extract URLs from the following markup patterns:
XML sitemaps
HTML: <a> tags with href attributes
Markdown: [text](url) syntax
The crawler MUST NOT extract URLs from any other markup patterns in the initial implementation

Requirement 1.2: Origin Matching

The crawler MUST only follow URLs that strictly match the origin of the initial input URL
Origin matching MUST compare protocol, domain, and port
Example: https://example.com:443/page1 can crawl to https://example.com:443/page2 but NOT to:
- http://example.com/page2 (different protocol)
- https://subdomain.example.com/page2 (different domain)
- https://example.com:8080/page2 (different port)

Requirement 1.3: Relative Link Resolution

When the initial input is a URL and the config contains an origin field, the crawler MUST resolve relative URLs against that origin
When the config does NOT contain an origin field, the crawler MUST:
- Skip relative URLs
- Display a warning message indicating relative links were skipped
- Continue processing absolute URLs normally

Requirement 1.4: Deduplication

The crawler MUST track all visited URLs globally
The crawler MUST NOT process the same URL more than once
URL comparison for deduplication MUST be case-sensitive and exact (including query parameters and fragments)

Requirement 1.5: Input Array Management

The crawler MUST append discovered URLs to the end of the inputs array
The crawler MUST preserve the order of discovery
Discovered URLs MUST be added as new input objects compatible with Doc Detective's existing input processing

Requirement 1.6: Crawl Limits

The crawler MUST enforce an internal maximum of 10,000 URLs
When this limit is reached, the crawler MUST:
- Stop discovering new URLs
- Log a warning indicating the limit was reached
- Continue processing already-discovered URLs

Requirement 1.7: Parallelization

The crawler SHOULD fetch URLs in parallel where possible to improve performance
The implementation SHOULD use Node.js asynchronous patterns appropriately

Default Behavior

Requirement 2.1: Protocol-Based Defaults

Crawling MUST be enabled by default for inputs with http:// or https:// protocols
Crawling MUST be disabled by default for inputs with any other protocol (e.g., file://)

Configuration

Requirement 3.1: Config File Field

The config file MUST support a crawl boolean field
When crawl: true, crawling is enabled regardless of protocol
When crawl: false, crawling is disabled regardless of protocol
When crawl is not specified, protocol-based defaults apply (Requirement 2.1)

Requirement 3.2: CLI Arguments

The CLI MUST support a --crawl argument that sets crawl: true
The CLI MUST support a --no-crawl argument that sets crawl: false
CLI arguments MUST override the config file's crawl field
CLI arguments MUST override protocol-based defaults

Requirement 3.3: Configuration Precedence
The order of precedence from highest to lowest:

CLI arguments (--crawl or --no-crawl)
Config file crawl field
Protocol-based defaults

Error Handling

Requirement 4.1: Failed URL Fetches

When a URL fetch fails (404, timeout, network error, etc.), the crawler MUST:
- Log the er...

Fixes Input Crawling #74

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

coderabbitai · 2025-10-30T13:11:30Z

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Co-authored-by: hawkeyexl <[email protected]>

hawkeyexl · 2025-10-31T00:12:11Z

@copilot Reduce the scope. I want to keep this integration clean and small. Limit support to inputs that begin with "http://" or "https://" and end in "sitemap.xml". If we want to expand to other crawling behavior in the future, that's fine. But we only need sitemap.xml support for initial release.

- Remove HTML and Markdown URL extraction - Only process URLs ending with sitemap.xml - Simplify crawler to crawlSitemap function - Update tests to reflect reduced scope - 66 tests passing (down from 96) Co-authored-by: hawkeyexl <[email protected]>

Copilot · 2025-10-31T00:17:05Z

@copilot Reduce the scope. I want to keep this integration clean and small. Limit support to inputs that begin with "http://" or "https://" and end in "sitemap.xml". If we want to expand to other craw...

Done in commit 6c5378f. Reduced scope to only support URLs that begin with "http://" or "https://" and end with "sitemap.xml".

Changes:

Removed HTML <a> tag and Markdown link extraction
Only process sitemap.xml URLs
Simplified crawlUrls() to crawlSitemap()
Updated all tests to focus on sitemap processing
66 tests passing (down from 96)

src/config.js

src/utils.js

…ound - Update package.json to use [email protected] - Remove code that strips/restores crawl field during validation (now in schema) - Simplify utils.js logic since config.crawl always defined (defaults to false) - Update tests to reflect new default behavior (crawl defaults to false) - All 66 tests passing Co-authored-by: hawkeyexl <[email protected]>

hawkeyexl · 2025-11-04T13:42:11Z

@copilot check this works against www.doc-detective.com/sitemap.xml.

- Update crawlSitemap to track final URL after axios redirects - Use responseUrl from axios response for origin comparison - Fixes issue where www.doc-detective.com redirects to doc-detective.com - Now correctly discovers 125 URLs from doc-detective.com sitemap - All 66 tests passing Co-authored-by: hawkeyexl <[email protected]>

Copilot · 2025-11-04T13:48:25Z

@copilot check this works against www.doc-detective.com/sitemap.xml.

Tested and working in commit b11bd50. The sitemap at www.doc-detective.com/sitemap.xml successfully processes:

Follows redirect from www.doc-detective.com → doc-detective.com
Discovers 125 URLs from the sitemap
All URLs correctly pass origin matching (doc-detective.com)

Example discovered URLs:

Initial plan

4c35146

Copilot AI assigned Copilot and hawkeyexl Oct 30, 2025

Copilot started work on behalf of hawkeyexl October 30, 2025 13:11 View session

Copilot AI and others added 4 commits October 30, 2025 13:20

Add URL crawling functionality with comprehensive tests

e0477ab

Co-authored-by: hawkeyexl <[email protected]>

Add XML sitemap support, URL limit test, and config validation handling

c529cbf

Co-authored-by: hawkeyexl <[email protected]>

Fix circular dependency by removing log import from crawler

7a6ec96

Co-authored-by: hawkeyexl <[email protected]>

Add security filtering for unsafe URL schemes (vbscript, data)

e4cb209

Co-authored-by: hawkeyexl <[email protected]>

Copilot AI changed the title ~~[WIP] Add URL crawling capability to Doc Detective~~ Add URL crawling for automatic documentation discovery Oct 30, 2025

Copilot AI requested a review from hawkeyexl October 30, 2025 13:34

Copilot finished work on behalf of hawkeyexl October 30, 2025 13:34

Copilot started work on behalf of hawkeyexl October 31, 2025 00:12 View session

Copilot finished work on behalf of hawkeyexl October 31, 2025 00:17

hawkeyexl requested changes Oct 31, 2025

View reviewed changes

src/config.js Outdated Show resolved Hide resolved

src/config.js Outdated Show resolved Hide resolved

src/utils.js Show resolved Hide resolved

Copilot started work on behalf of hawkeyexl October 31, 2025 00:35 View session

Copilot AI changed the title ~~Add URL crawling for automatic documentation discovery~~ Add sitemap.xml processing for automatic documentation discovery Oct 31, 2025

Copilot AI requested a review from hawkeyexl October 31, 2025 00:39

Copilot finished work on behalf of hawkeyexl October 31, 2025 00:39

Copilot started work on behalf of hawkeyexl November 4, 2025 13:42 View session

Copilot finished work on behalf of hawkeyexl November 4, 2025 13:49

hawkeyexl added 2 commits November 4, 2025 13:15

Merge branch 'main' into copilot/add-url-crawling-feature

6381504

Save as HTML

50dac4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add sitemap.xml processing for automatic documentation discovery #75

Add sitemap.xml processing for automatic documentation discovery #75

Copilot AI commented Oct 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 30, 2025 •

edited

Loading

Review skipped

Other AI code review bot(s) detected

Uh oh!

hawkeyexl commented Oct 31, 2025

Uh oh!

Copilot AI commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hawkeyexl commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add sitemap.xml processing for automatic documentation discovery #75

Are you sure you want to change the base?

Add sitemap.xml processing for automatic documentation discovery #75

Conversation

Copilot AI commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

URL Crawling Feature Implementation - Sitemap.xml Only

Completed

Implementation Summary

Goals

Non-Goals

User Stories

Functional Requirements

Core Crawling Behavior

Default Behavior

Configuration

Error Handling

Uh oh!

coderabbitai bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Other AI code review bot(s) detected

Uh oh!

hawkeyexl commented Oct 31, 2025

Uh oh!

Copilot AI commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hawkeyexl commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 30, 2025 •

edited

Loading

coderabbitai bot commented Oct 30, 2025 •

edited

Loading