-
-
Notifications
You must be signed in to change notification settings - Fork 126
bug(normalize): normalize urls for seeds, address array-valued headers #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: add-click-links
Are you sure you want to change the base?
bug(normalize): normalize urls for seeds, address array-valued headers #953
Conversation
|
I ran into this issue last night while trying to archive some papers and studies hosted by EPA that we believe are at risk of deletion (e.g. https://ordspub.epa.gov/ords/eims/eimscomm.getfile?p_download_id=552173) and which contain duplicate This solution seems like it would work 🎉, BUT I kinda feel like this is an upstream issue in Warcio… concatenating with a comma is not quite faithful to the original response, and it seems like it would be better if Warcio was only choosy about which multi-valued WARC record headers were allowed, and not about the HTTP headers that are in the WARC record’s payload. It seems like the same test and multi-value function/method is being used for both cases, and that’s not ideal. |
|
Nevermind, I spoke too soon! Looks like it did get fixed upstream in 2.4.8: webrecorder/warcio.js#94 |
|
@badgersec thanks for the fix, I think the normalization definitely makes sense! We are in the process of fixing the header issue as well, @Mr0grog, there's actually a follow-up PR: |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Rebased on #952 -- test results inline. command: result: |
URL Normalizing for seeds
The crawler failed on URLs with query parameters because state.ts normalized them alphabetically when queueing but seeds.ts didn't normalize before matching, causing "Page no longer in scope" errors.
results in "Page not in scope":
whereas, in a previous version, 1.9.2 resulted in is success.
Different array-valued headers fix:
Running a crawl with additional extraHops results in "not a valid multi value header":
This was because
recorder.tscalledmultiValueHeader()on all array-valued headers, but warcio.js 2.4+ only allows set-cookie, warc-concurrent-to, and warc-protocol, causing "not a valid multi value header" errors on Facebook and other sites that return different array-valued headers.