Skip to content

[FEATURE] Add sync-config command to detect and update config start_urls from live docs sitesΒ #306

@miethe

Description

@miethe

πŸš€ Feature Description

Add a skill-seekers sync-config subcommand that crawls a docs site's navigation and automatically diffs/updates the start_urls in an existing config file against what's currently live on the site.

πŸ’‘ Use Case

Docs sites add new pages over time. Keeping a config's start_urls in sync currently requires manually visiting the live site, collecting all nav links, and comparing them against the config β€” a tedious, error-prone process. For example, the Claude Code docs recently added 11 new pages (how-claude-code-works,
features-overview, agent-teams, fast-mode, etc.) and renamed iam β†’ authentication, none of which were reflected in the existing config until manually audited.

πŸ“‹ Proposed Solution

Add a sync-config subcommand (and corresponding skill-seekers-sync-config entry point) that:

  1. Loads an existing config JSON
  2. Visits a set of "nav seed" pages β€” either from a new optional nav_seed_urls config field, or defaulting to the existing start_urls
  3. Collects all internal links matching url_patterns.include from those pages
  4. Diffs discovered URLs against the config's start_urls
  5. Reports added/removed pages to stdout
  6. With --apply, writes the updated start_urls back to the config file (preserving formatting and other fields)
# Dry-run: show what's changed
skill-seekers sync-config --config configs/claude-code.json

# Apply changes to the config file
skill-seekers sync-config --config configs/claude-code.json --apply

# Example output:
# + /docs/en/agent-teams        (new)
# + /docs/en/fast-mode          (new)
# - /docs/en/iam                (removed, replaced by /docs/en/authentication)
# Config has 3 new pages and 1 removed page.
# Run with --apply to update configs/claude-code.json

The new optional config field:

{
  "nav_seed_urls": [
    "https://code.claude.com/docs/en/overview",
    "https://code.claude.com/docs/en/sub-agents",
    "https://code.claude.com/docs/en/setup",
    "https://code.claude.com/docs/en/settings",
    "https://code.claude.com/docs/en/cli-reference"
  ]
}

πŸ”„ Alternatives Considered

  • Sitemap parsing: Ideal, but many docs sites (including Claude Code's) don't publish a sitemap.
  • Rely on BFS scraper alone: The existing scraper already discovers all linked pages via BFS, so start_urls don't strictly need to enumerate every page. However, explicitly listing pages in the config is useful for documentation and for catching orphaned/unlinked pages.
  • Manual auditing: What we do today β€” visit section pages, extract links, compare manually. Works but doesn't scale.

πŸ“Š Expected Impact

  • Priority: Medium
  • Effort: S
  • Users Affected: Anyone maintaining configs for long-lived docs sites that update regularly (e.g. Claude Code, React, Django, FastAPI docs).

πŸ“ Additional Context

Discovered while auditing configs/claude-code.json against the live site. The Claude Code docs added 11 new pages and renamed one (iam β†’ authentication) since the config was last updated β€” none detectable without a manual browser audit.

The scraper's existing BFS logic in doc_scraper.py (scrape_all()) already handles link extraction; sync-config would reuse or factor out that link-collection logic in a lightweight, non-scraping pass.

βœ… Acceptance Criteria

  • skill-seekers sync-config --config <path> prints a diff of new/removed URLs vs the config's start_urls
  • --apply flag writes updated start_urls back to the config JSON in-place
  • Respects url_patterns.include / url_patterns.exclude from the config when filtering discovered links
  • Supports optional nav_seed_urls config field; falls back to start_urls as seeds if not set
  • MCP tool sync_config added alongside existing config tools
  • Tests added in tests/test_sync_config.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions