-
-
Notifications
You must be signed in to change notification settings - Fork 0
Add sitemap.xml processing for automatic documentation discovery #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Comment |
Co-authored-by: hawkeyexl <[email protected]>
Co-authored-by: hawkeyexl <[email protected]>
Co-authored-by: hawkeyexl <[email protected]>
Co-authored-by: hawkeyexl <[email protected]>
|
@copilot Reduce the scope. I want to keep this integration clean and small. Limit support to inputs that begin with "http://" or "https://" and end in "sitemap.xml". If we want to expand to other crawling behavior in the future, that's fine. But we only need sitemap.xml support for initial release. |
- Remove HTML and Markdown URL extraction - Only process URLs ending with sitemap.xml - Simplify crawler to crawlSitemap function - Update tests to reflect reduced scope - 66 tests passing (down from 96) Co-authored-by: hawkeyexl <[email protected]>
Done in commit 6c5378f. Reduced scope to only support URLs that begin with "http://" or "https://" and end with "sitemap.xml". Changes:
|
…ound - Update package.json to use [email protected] - Remove code that strips/restores crawl field during validation (now in schema) - Simplify utils.js logic since config.crawl always defined (defaults to false) - Update tests to reflect new default behavior (crawl defaults to false) - All 66 tests passing Co-authored-by: hawkeyexl <[email protected]>
|
@copilot check this works against www.doc-detective.com/sitemap.xml. |
- Update crawlSitemap to track final URL after axios redirects - Use responseUrl from axios response for origin comparison - Fixes issue where www.doc-detective.com redirects to doc-detective.com - Now correctly discovers 125 URLs from doc-detective.com sitemap - All 66 tests passing Co-authored-by: hawkeyexl <[email protected]>
Tested and working in commit b11bd50. The sitemap at www.doc-detective.com/sitemap.xml successfully processes:
Example discovered URLs: |
URL Crawling Feature Implementation - Sitemap.xml Only
Completed
<a>tag extractionImplementation Summary
Core Functionality:
extractXmlSitemapUrls()- Extracts URLs from<loc>tags in XML sitemapsisSameOrigin()- Strict origin matching (protocol, hostname, port)crawlSitemap()- Processes a single sitemap and returns same-origin URLsIntegration:
qualifyFiles()insrc/utils.js(http|https)://...sitemap.xmlConfiguration:
crawlboolean field: defaults tofalsein [email protected] schematrueto enable sitemap processingcrawl: truein config or--crawlvia CLITesting:
Dependencies:
crawlfield schema supportOriginal prompt
This section details on the original issue you should resolve
<issue_title>Input Crawling</issue_title>
<issue_description>This feature adds URL crawling capability to Doc Detective, enabling automatic discovery of additional test inputs by following links found in initial input documents. The crawler respects same-origin restrictions and provides configurable opt-in/opt-out behavior based on protocol.
Goals
Non-Goals
User Stories
As a technical writer, I want Doc Detective to automatically discover and test all pages in my documentation site so that I don't have to manually list every URL in my config.
As a technical writer, I want Doc Detective to find all linked local files when I specify a local path to crawl.
As a technical writer, I want crawling to respect my site's boundaries so that tests don't accidentally follow external links or navigate to unrelated content.
As a technical writer, I want to disable crawling for specific test runs so that I can test individual pages during development without processing the entire site.
Functional Requirements
Core Crawling Behavior
Requirement 1.1: URL Pattern Recognition
<a>tags withhrefattributes[text](url)syntaxRequirement 1.2: Origin Matching
https://example.com:443/page1can crawl tohttps://example.com:443/page2but NOT to:http://example.com/page2(different protocol)https://subdomain.example.com/page2(different domain)https://example.com:8080/page2(different port)Requirement 1.3: Relative Link Resolution
originfield, the crawler MUST resolve relative URLs against that originoriginfield, the crawler MUST:Requirement 1.4: Deduplication
Requirement 1.5: Input Array Management
Requirement 1.6: Crawl Limits
Requirement 1.7: Parallelization
Default Behavior
Requirement 2.1: Protocol-Based Defaults
http://orhttps://protocolsfile://)Configuration
Requirement 3.1: Config File Field
crawlboolean fieldcrawl: true, crawling is enabled regardless of protocolcrawl: false, crawling is disabled regardless of protocolcrawlis not specified, protocol-based defaults apply (Requirement 2.1)Requirement 3.2: CLI Arguments
--crawlargument that setscrawl: true--no-crawlargument that setscrawl: falsecrawlfieldRequirement 3.3: Configuration Precedence
The order of precedence from highest to lowest:
--crawlor--no-crawl)crawlfieldError Handling
Requirement 4.1: Failed URL Fetches
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.