Wayback upgrade by liquidsec · Pull Request #2909 · blacklanternsecurity/bbot

liquidsec · 2026-02-19T03:01:24Z

TBA

Include the actual failure reason (timeout, connection error, HTTP status code) in retry and warning messages so it's clear why archive.org requests failed. Increase CDX timeout from +30s to +60s.

…etching

bbot/test/test_step_2/module_tests/test_module_wayback.py

liquidsec · 2026-02-19T20:27:26Z

bbot/test/test_step_2/module_tests/test_module_wayback.py

+            assert "archive_url" in finding.data, (
+                f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
+            )
+            assert "web.archive.org" in finding.data["archive_url"], (


In general, the way to fix incomplete URL substring sanitization is to parse the URL using a standard library, extract the hostname, and then compare that hostname (or a suffix of it) to the expected allowed host, instead of checking for a substring in the raw URL string.

In this specific case, we should change the assertion that currently does assert "web.archive.org" in finding.data["archive_url"] so that it parses archive_url with urllib.parse.urlparse, extracts .hostname, and asserts that the hostname is exactly web.archive.org. This preserves the intended functionality (“archive_url should be archive.org URL”) while avoiding arbitrary substring matches. Concretely, within TestWaybackParameters.check, around lines 309–315, we will introduce a local variable such as archive_url_host = urlparse(finding.data["archive_url"]).hostname and assert archive_url_host == "web.archive.org". To do this, we must import urlparse from urllib.parse at the top of the test file, alongside the existing unquote import. No other behavior in the tests needs to change.

bro its a draft step off

github-actions · 2026-02-19T03:52:19Z

📊 Performance Benchmark Report

Comparing 3.0 (baseline) vs wayback-upgrade (current)

📈 Detailed Results (All Benchmarks)

📋 Complete results for all benchmarks - includes both significant and insignificant changes

🧪 Test Name	📏 Base	📏 Current	📈 Change	🎯 Status
Bloom Filter Dns Mutation Tracking Performance	`3.79ms`	`3.81ms`	+0.7% ⚪	✅
Bloom Filter Large Scale Dns Brute Force	`16.80ms`	`16.97ms`	+1.0% ⚪	✅
Large Closest Match Lookup	`318.71ms`	`309.47ms`	-2.9% ⚪	✅
Realistic Closest Match Workload	`171.60ms`	`170.95ms`	-0.4% ⚪	✅
Event Validation Full Scan Startup Small Batch	`461.24ms`	`458.09ms`	-0.7% ⚪	✅
Event Validation Full Scan Startup Large Batch	`821.50ms`	`811.37ms`	-1.2% ⚪	✅
Make Event Autodetection Small	`26.35ms`	`26.03ms`	-1.2% ⚪	✅
Make Event Autodetection Large	`265.94ms`	`268.10ms`	+0.8% ⚪	✅
Make Event Explicit Types	`11.44ms`	`11.54ms`	+0.9% ⚪	✅
Excavate Single Thread Small	`3.462s`	`3.460s`	-0.1% ⚪	✅
Excavate Single Thread Large	`9.438s`	`9.566s`	+1.4% ⚪	✅
Excavate Parallel Tasks Small	`3.634s`	`3.657s`	+0.6% ⚪	✅
Excavate Parallel Tasks Large	`7.022s`	`7.115s`	+1.3% ⚪	✅
Is Ip Performance	`2.93ms`	`2.97ms`	+1.3% ⚪	✅
Make Ip Type Performance	`10.78ms`	`10.87ms`	+0.8% ⚪	✅
Mixed Ip Operations	`4.21ms`	`4.25ms`	+0.8% ⚪	✅
Typical Queue Shuffle	`54.61µs`	`54.80µs`	+0.3% ⚪	✅
Priority Queue Shuffle	`613.46µs`	`593.13µs`	-3.3% ⚪	✅

🎯 Performance Summary

✅ No significant performance changes detected (all changes <10%)

🐍 Python Version 3.11.14

- Add max_records option (default 100000) for CDX API limit - Only retry archive fetches on connection errors/429, not on definitive HTTP status codes - Change "Loading archived URLs" message from hugeinfo to verbose - Update retry test to use ReadError instead of 503

Paddingoracle fix

…updates

liquidsec added 11 commits February 17, 2026 14:02

add parameter emmision to wayback

82d05a7

mods to the wayback parameter extraction

855db97

more features / bug fixes for new wayback

7f8c645

allow from-wayback tag to propagate

560f47c

update docs for wayback

de1c851

add waf string 4xx filtering

0020cda

add Akamai WAF string to waf_strings helper

c791b98

add directory listing excavate submodule

5d7363f

improve wayback CDX error logging and increase timeout

202af81

Include the actual failure reason (timeout, connection error, HTTP status code) in retry and warning messages so it's clear why archive.org requests failed. Increase CDX timeout from +30s to +60s.

add rate limiting, retry, and bloom filter dedup to wayback archive f…

5978644

…etching

add CDX server-side filters and 100k URL limit to wayback module

c58babe

liquidsec marked this pull request as draft February 19, 2026 03:01

liquidsec mentioned this pull request Feb 19, 2026

Wayback misbehaving #1478

Open

github-advanced-security bot found potential problems Feb 19, 2026

View reviewed changes

liquidsec added 14 commits February 18, 2026 23:01

fixing wayback rate limiting

14cbb11

improving wayback delay system

b51aaab

make cpu heavy processing non-blocking

d8b17a8

Merge branch 'dev' into wayback-upgrade

0201f69

Merge branch 'wayback-upgrade' into paddingoracle-fix

cdf1000

Merge pull request #2912 from blacklanternsecurity/paddingoracle-fix

addd832

Paddingoracle fix

fix _event_host() using resolved IP instead of URL hostname

2a77064

skip URL collapse when there are no URLs to process

f150610

ruff format

67bbdf1

add timeout and recovery protections to run_in_executor_mp

ccb1c94

Merge branch 'dev' into wayback-upgrade

72fab96

Merge branch 'dev' into wayback-upgrade

231ddf0

ruff check fixes

93df9dd

liquidsec changed the base branch from dev to 3.0 February 28, 2026 18:31

Fix 3.0 merge compatibility: whitelist removal, FINDING schema, test …

8adc83f

…updates

@@ -1,5 +1,5 @@
             import re
-            from urllib.parse import unquote
+            from urllib.parse import unquote, urlparse
             from werkzeug.wrappers import Response
@@ -310,8 +310,10 @@
                         assert "archive_url" in finding.data, (
                             f"Hunt FINDING should have archive_url for provenance, got: {finding.data}"
                         )
-                        assert "web.archive.org" in finding.data["archive_url"], (
-                            f"Hunt FINDING archive_url should be archive.org URL, got: {finding.data['archive_url']}"
+                        archive_url_host = urlparse(finding.data["archive_url"]).hostname
+                        assert archive_url_host == "web.archive.org", (
+                            f"Hunt FINDING archive_url should be archive.org URL, got host: {archive_url_host}, "
+                            f"full URL: {finding.data['archive_url']}"
                         )
                     # WEB_PARAMETERs from archived content should also have archive_url

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wayback upgrade#2909

Wayback upgrade#2909
liquidsec wants to merge 26 commits into3.0from
wayback-upgrade

liquidsec commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Check failure

Copilot Autofix

liquidsec Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

liquidsec commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Check failure

Uh oh!

Copilot Autofix

liquidsec Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Performance Benchmark Report

🎯 Performance Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Feb 19, 2026 •

edited

Loading