Handle Browsertrix's now-normalized URLs when importing WARCs#909
Merged
Handle Browsertrix's now-normalized URLs when importing WARCs#909
Conversation
Instead of RequestRecords and RequestIndexes we have HttpExchangeWhatevers, since these are about requests *and* responses.
Member
Author
|
NOTE: another refactor I’d had sitting around came along for the ride here. It’s just renaming a couple models and variables to be a little clearer about what they represent. |
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-crawler
that referenced
this pull request
Feb 24, 2026
This is basically just pulling the following into the crawler import pipeline: edgi-govdata-archiving/web-monitoring-processing#909
Mr0grog
added a commit
that referenced
this pull request
Feb 24, 2026
This is another bit of Browsertrix normalization I missed in #909.
Mr0grog
added a commit
that referenced
this pull request
Feb 24, 2026
This is another bit of Browsertrix normalization I missed in #909. Makes WARC import a little more stable and reliable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In v1.11.2, Browsertrix started normalizing seed URLs, so it no longer necessarily requests the actual configured URLs from the seed file. That means we can no longer just look for the seeds in the resulting WARCs when we want to extract page data, the request/response records in the WARCs will sometimes use URLs that are not semantically equivalent to the seeds.
For more on this, see:
We already do some normalization, but only that which is strictly equivalent in a URL (e.g. lower-casing hostnames). This adds a second function for making “matchable” URLs that is used instead of the normalized URL in places where we need to find WARC records that match seed URLs. (Crucially, this is only used for matching — when we import data into other systems, we always use the actual URL from the WARC record.)
Right now
matchable_url()does pretty minimal changes (just sorts the query parameters), but we may want to make it more aggressive (at the extreme end, we could just use SURT). At current, we’d have to change this again if Browsertrix made its normalizations more aggressive.