-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
π Feature Description
Add a skill-seekers sync-config subcommand that crawls a docs site's navigation and automatically diffs/updates the start_urls in an existing config file against what's currently live on the site.
π‘ Use Case
Docs sites add new pages over time. Keeping a config's start_urls in sync currently requires manually visiting the live site, collecting all nav links, and comparing them against the config β a tedious, error-prone process. For example, the Claude Code docs recently added 11 new pages (how-claude-code-works,
features-overview, agent-teams, fast-mode, etc.) and renamed iam β authentication, none of which were reflected in the existing config until manually audited.
π Proposed Solution
Add a sync-config subcommand (and corresponding skill-seekers-sync-config entry point) that:
- Loads an existing config JSON
- Visits a set of "nav seed" pages β either from a new optional
nav_seed_urlsconfig field, or defaulting to the existingstart_urls - Collects all internal links matching
url_patterns.includefrom those pages - Diffs discovered URLs against the config's
start_urls - Reports added/removed pages to stdout
- With
--apply, writes the updatedstart_urlsback to the config file (preserving formatting and other fields)
# Dry-run: show what's changed
skill-seekers sync-config --config configs/claude-code.json
# Apply changes to the config file
skill-seekers sync-config --config configs/claude-code.json --apply
# Example output:
# + /docs/en/agent-teams (new)
# + /docs/en/fast-mode (new)
# - /docs/en/iam (removed, replaced by /docs/en/authentication)
# Config has 3 new pages and 1 removed page.
# Run with --apply to update configs/claude-code.jsonThe new optional config field:
{
"nav_seed_urls": [
"https://code.claude.com/docs/en/overview",
"https://code.claude.com/docs/en/sub-agents",
"https://code.claude.com/docs/en/setup",
"https://code.claude.com/docs/en/settings",
"https://code.claude.com/docs/en/cli-reference"
]
}π Alternatives Considered
- Sitemap parsing: Ideal, but many docs sites (including Claude Code's) don't publish a sitemap.
- Rely on BFS scraper alone: The existing scraper already discovers all linked pages via BFS, so
start_urlsdon't strictly need to enumerate every page. However, explicitly listing pages in the config is useful for documentation and for catching orphaned/unlinked pages. - Manual auditing: What we do today β visit section pages, extract links, compare manually. Works but doesn't scale.
π Expected Impact
- Priority: Medium
- Effort: S
- Users Affected: Anyone maintaining configs for long-lived docs sites that update regularly (e.g. Claude Code, React, Django, FastAPI docs).
π Additional Context
Discovered while auditing configs/claude-code.json against the live site. The Claude Code docs added 11 new pages and renamed one (iam β authentication) since the config was last updated β none detectable without a manual browser audit.
The scraper's existing BFS logic in doc_scraper.py (scrape_all()) already handles link extraction; sync-config would reuse or factor out that link-collection logic in a lightweight, non-scraping pass.
β Acceptance Criteria
-
skill-seekers sync-config --config <path>prints a diff of new/removed URLs vs the config'sstart_urls -
--applyflag writes updatedstart_urlsback to the config JSON in-place - Respects
url_patterns.include/url_patterns.excludefrom the config when filtering discovered links - Supports optional
nav_seed_urlsconfig field; falls back tostart_urlsas seeds if not set - MCP tool
sync_configadded alongside existing config tools - Tests added in
tests/test_sync_config.py