Add Browser processor using WebDriver BiDi by ato · Pull Request #653 · internetarchive/heritrix3

ato · 2025-05-23T11:52:12Z

The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links.

This needs some more testing and more error handling but is working for small crawls with Firefox and ChromeDriver.

This differs from my previous attempt (ExtractorChrome #403) in a number of ways:

Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. It doesn't quite do everything CDP does, but most of the important stuff is there.
Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs (like request bodies). But in practice I anyway found the proxy method loads pages faster and more reliably. Perhaps because responses can be streamed incrementally, which helps a lot for large resources and server-sent events. It does unfortunately make the remote browser support a little more difficult (but should still be quite doable).
This builds on the new FetchHTTP2 module (FetchHTTP2: A new fetch module for HTTP/2 and HTTP/3 #649) which does connection pooling making subrequests a lot faster. The original FetchHTTP opens a new connection for every request which is quite slow for browsing.
The Browser processor can be configured with a list of behavior beans making it extensible.
The BiDi protocol is described with typed interfaces and records implemented by a Java Proxy. It's mostly synchronous for ease of use but when useful you can define an asynchronous method by returning CompletableFuture.

Some obvious areas for future development:

More Behavior beans: take screenshots, save the rendered DOM, run Browsertrix-compatible behavior scripts
Support for remote WebDrivers (e.g. Selenium Server or cloud browser services)
UI integration for monitoring the browser
I just did basic record mapping on top of the json.org library Heritrix is already using, but as Heritrix's JSON needs evolve it might make sense to switch to something like Jackson.

The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links. This differs from my previous attempt (ExtractorChrome) in a few ways: - Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. - Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs. Even so in practice I found the proxy method loads pages faster and more reliably, likely because responses can be streamed incrementally, which helps a lot for large resources or server-sent events. - Even when HTTP/2 is unavailable, the new FetchHTTP2 module does connection pooling which makes loading browser requests a lot faster. The original FetchHTTP opened a new connection for every request. - The Browser processor can be configured with a list of behavior beans making it more customizable and extensible. Obvious areas for future development: - More Behavior beans: take screenshots, saveg the rendered DOM, run Browsertrix-compatible behavior scripts - Support for remote WebDrivers (e.g. Selenium Server or cloud services)

This hopefully will stop us filling up ~/Downloads with random junk.

ato added 4 commits June 4, 2025 17:44

Browser: Disable downloads in Firefox and Chrome

5131ea4

This hopefully will stop us filling up ~/Downloads with random junk.

Browser: Handle navigation abort from downloads starting

2210cb7

Browser: Add processor report

1161d87

ato force-pushed the bidi branch from ad3b99e to 1161d87 Compare June 5, 2025 01:48

ato merged commit 52bbd80 into master Jun 9, 2025
7 checks passed

ato deleted the bidi branch June 9, 2025 00:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Browser processor using WebDriver BiDi#653

Add Browser processor using WebDriver BiDi#653
ato merged 4 commits intomasterfrom
bidi

ato commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ato commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ato commented May 23, 2025 •

edited

Loading