Handle Browsertrix's now-normalized URLs when importing WARCs by Mr0grog · Pull Request #909 · edgi-govdata-archiving/web-monitoring-processing

Mr0grog · 2026-02-24T10:57:37Z

In v1.11.2, Browsertrix started normalizing seed URLs, so it no longer necessarily requests the actual configured URLs from the seed file. That means we can no longer just look for the seeds in the resulting WARCs when we want to extract page data, the request/response records in the WARCs will sometimes use URLs that are not semantically equivalent to the seeds.

For more on this, see:

bug(normalize): normalize urls for seeds, address array-valued headers webrecorder/browsertrix-crawler#959
Seed URL normalization in v1.11.2 changes what URLs are fetched and recorded webrecorder/browsertrix-crawler#982

We already do some normalization, but only that which is strictly equivalent in a URL (e.g. lower-casing hostnames). This adds a second function for making “matchable” URLs that is used instead of the normalized URL in places where we need to find WARC records that match seed URLs. (Crucially, this is only used for matching — when we import data into other systems, we always use the actual URL from the WARC record.)

Right now matchable_url() does pretty minimal changes (just sorts the query parameters), but we may want to make it more aggressive (at the extreme end, we could just use SURT). At current, we’d have to change this again if Browsertrix made its normalizations more aggressive.

Instead of RequestRecords and RequestIndexes we have HttpExchangeWhatevers, since these are about requests *and* responses.

Mr0grog · 2026-02-24T11:00:12Z

NOTE: another refactor I’d had sitting around came along for the ride here. It’s just renaming a couple models and variables to be a little clearer about what they represent.

This is basically just pulling the following into the crawler import pipeline: edgi-govdata-archiving/web-monitoring-processing#909

This is another bit of Browsertrix normalization I missed in #909.

This is another bit of Browsertrix normalization I missed in #909. Makes WARC import a little more stable and reliable.

A couple more items following on from #909 and #910 that came up when I did a more exhaustive check of *all* our URLs in current Browsertrix, and of the source for the normalizer Browsertrix uses. This should cover everything until Browsertrix releases an update that changes how it normalizes.

Mr0grog added 2 commits February 24, 2026 00:06

Rename models and variables for clarity

7cadeb7

Instead of RequestRecords and RequestIndexes we have HttpExchangeWhatevers, since these are about requests *and* responses.

Match Browsertrix's normalized URLs when importing

ed3ebe2

Mr0grog added this to Web Monitoring Feb 24, 2026

github-project-automation bot moved this to Inbox in Web Monitoring Feb 24, 2026

Mr0grog merged commit 021fe49 into main Feb 24, 2026
12 checks passed

Mr0grog deleted the browsertrix-made-some-impactful-changes branch February 24, 2026 10:59

github-project-automation bot moved this from Inbox to Done in Web Monitoring Feb 24, 2026

Mr0grog added a commit that referenced this pull request Feb 24, 2026

Trim trailing slashes when matching URLs

386932c

This is another bit of Browsertrix normalization I missed in #909.

Mr0grog mentioned this pull request Feb 24, 2026

Trim trailing slashes when matching URLs #910

Merged

Mr0grog added a commit that referenced this pull request Feb 24, 2026

Trim trailing slashes when matching URLs (#910)

47beb9f

This is another bit of Browsertrix normalization I missed in #909. Makes WARC import a little more stable and reliable.

Mr0grog mentioned this pull request Feb 25, 2026

Browsertrix normalization handling, take 3 #912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle Browsertrix's now-normalized URLs when importing WARCs#909

Handle Browsertrix's now-normalized URLs when importing WARCs#909
Mr0grog merged 2 commits intomainfrom
browsertrix-made-some-impactful-changes

Mr0grog commented Feb 24, 2026

Uh oh!

Uh oh!

Mr0grog commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Mr0grog commented Feb 24, 2026

Uh oh!

Uh oh!

Mr0grog commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant