Skip to content

Handle Browsertrix's now-normalized URLs when importing WARCs#909

Merged
Mr0grog merged 2 commits intomainfrom
browsertrix-made-some-impactful-changes
Feb 24, 2026
Merged

Handle Browsertrix's now-normalized URLs when importing WARCs#909
Mr0grog merged 2 commits intomainfrom
browsertrix-made-some-impactful-changes

Conversation

@Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Feb 24, 2026

In v1.11.2, Browsertrix started normalizing seed URLs, so it no longer necessarily requests the actual configured URLs from the seed file. That means we can no longer just look for the seeds in the resulting WARCs when we want to extract page data, the request/response records in the WARCs will sometimes use URLs that are not semantically equivalent to the seeds.

For more on this, see:

We already do some normalization, but only that which is strictly equivalent in a URL (e.g. lower-casing hostnames). This adds a second function for making “matchable” URLs that is used instead of the normalized URL in places where we need to find WARC records that match seed URLs. (Crucially, this is only used for matching — when we import data into other systems, we always use the actual URL from the WARC record.)

Right now matchable_url() does pretty minimal changes (just sorts the query parameters), but we may want to make it more aggressive (at the extreme end, we could just use SURT). At current, we’d have to change this again if Browsertrix made its normalizations more aggressive.

Instead of RequestRecords and RequestIndexes we have HttpExchangeWhatevers, since these are about requests *and* responses.
@Mr0grog Mr0grog merged commit 021fe49 into main Feb 24, 2026
12 checks passed
@Mr0grog Mr0grog deleted the browsertrix-made-some-impactful-changes branch February 24, 2026 10:59
@github-project-automation github-project-automation bot moved this from Inbox to Done in Web Monitoring Feb 24, 2026
@Mr0grog
Copy link
Member Author

Mr0grog commented Feb 24, 2026

NOTE: another refactor I’d had sitting around came along for the ride here. It’s just renaming a couple models and variables to be a little clearer about what they represent.

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-crawler that referenced this pull request Feb 24, 2026
This is basically just pulling the following into the crawler import pipeline: edgi-govdata-archiving/web-monitoring-processing#909
Mr0grog added a commit that referenced this pull request Feb 24, 2026
This is another bit of Browsertrix normalization I missed in #909.
Mr0grog added a commit that referenced this pull request Feb 24, 2026
This is another bit of Browsertrix normalization I missed in #909. Makes WARC import a little more stable and reliable.
Mr0grog added a commit that referenced this pull request Feb 25, 2026
A couple more items following on from #909 and #910 that came up when I did a more exhaustive check of *all* our URLs in current Browsertrix, and of the source for the normalizer Browsertrix uses.

This should cover everything until Browsertrix releases an update that changes how it normalizes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

1 participant