Conversation
Include the actual failure reason (timeout, connection error, HTTP status code) in retry and warning messages so it's clear why archive.org requests failed. Increase CDX timeout from +30s to +60s.
| assert "archive_url" in finding.data, ( | ||
| f"Hunt FINDING should have archive_url for provenance, got: {finding.data}" | ||
| ) | ||
| assert "web.archive.org" in finding.data["archive_url"], ( |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 10 days ago
In general, the way to fix incomplete URL substring sanitization is to parse the URL using a standard library, extract the hostname, and then compare that hostname (or a suffix of it) to the expected allowed host, instead of checking for a substring in the raw URL string.
In this specific case, we should change the assertion that currently does assert "web.archive.org" in finding.data["archive_url"] so that it parses archive_url with urllib.parse.urlparse, extracts .hostname, and asserts that the hostname is exactly web.archive.org. This preserves the intended functionality (“archive_url should be archive.org URL”) while avoiding arbitrary substring matches. Concretely, within TestWaybackParameters.check, around lines 309–315, we will introduce a local variable such as archive_url_host = urlparse(finding.data["archive_url"]).hostname and assert archive_url_host == "web.archive.org". To do this, we must import urlparse from urllib.parse at the top of the test file, alongside the existing unquote import. No other behavior in the tests needs to change.
| @@ -1,5 +1,5 @@ | ||
| import re | ||
| from urllib.parse import unquote | ||
| from urllib.parse import unquote, urlparse | ||
|
|
||
| from werkzeug.wrappers import Response | ||
|
|
||
| @@ -310,8 +310,10 @@ | ||
| assert "archive_url" in finding.data, ( | ||
| f"Hunt FINDING should have archive_url for provenance, got: {finding.data}" | ||
| ) | ||
| assert "web.archive.org" in finding.data["archive_url"], ( | ||
| f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}" | ||
| archive_url_host = urlparse(finding.data["archive_url"]).hostname | ||
| assert archive_url_host == "web.archive.org", ( | ||
| f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, " | ||
| f"full URL: {finding.data['archive_url']}" | ||
| ) | ||
|
|
||
| # WEB_PARAMETERs from archived content should also have archive_url |
There was a problem hiding this comment.
bro its a draft step off
📊 Performance Benchmark Report
📈 Detailed Results (All Benchmarks)
🎯 Performance Summary✅ No significant performance changes detected (all changes <10%) 🐍 Python Version 3.11.14 |
- Add max_records option (default 100000) for CDX API limit - Only retry archive fetches on connection errors/429, not on definitive HTTP status codes - Change "Loading archived URLs" message from hugeinfo to verbose - Update retry test to use ReadError instead of 503
Paddingoracle fix
TBA