Skip to content

Add option to write JSONL file with data on URLs not queued#966

Draft
tw4l wants to merge 15 commits intomainfrom
issue-965-urls-not-queued-list
Draft

Add option to write JSONL file with data on URLs not queued#966
tw4l wants to merge 15 commits intomainfrom
issue-965-urls-not-queued-list

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Feb 5, 2026

Fixes #965

Add --listNotQueued argument, which will result in a reports/notQueued.jsonl file with the following elements for each URL encountered that was not queued:

  • url
  • seedUrl
  • depth
  • reason (one of outOfScope, pageLimit, or robotsTxt)
  • ts

The reports/ directory is new but could be expanded with other crawl-time reporting moving forward. I also considered using a directory like excludedPages/, but making a directory for a single file felt silly, and we already use "exclude" to mean a specific thing that is just a subset of why a URL might be encountered but not queued.

@tw4l tw4l changed the title Add option to write page JSONL file with all pages not queued Add option to write JSONL file with data on URLs not queued Feb 5, 2026
@tw4l tw4l force-pushed the issue-965-urls-not-queued-list branch from 25a80f2 to 4f78dde Compare February 10, 2026 21:19
@tw4l tw4l marked this pull request as ready for review February 10, 2026 22:52
@tw4l tw4l requested a review from ikreymer February 10, 2026 22:52
@tw4l tw4l marked this pull request as draft February 11, 2026 15:11
@tw4l
Copy link
Member Author

tw4l commented Feb 11, 2026

Moving back to draft while I work on adding the new reports dir to the WACZ

@tw4l tw4l force-pushed the issue-965-urls-not-queued-list branch from 220b267 to 9897393 Compare February 11, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Keep track of URLs not crawled/queued

1 participant