Skip to content

Add Browser processor using WebDriver BiDi#653

Merged
ato merged 4 commits intomasterfrom
bidi
Jun 9, 2025
Merged

Add Browser processor using WebDriver BiDi#653
ato merged 4 commits intomasterfrom
bidi

Conversation

@ato
Copy link
Copy Markdown
Collaborator

@ato ato commented May 23, 2025

The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links.

This needs some more testing and more error handling but is working for small crawls with Firefox and ChromeDriver.

This differs from my previous attempt (ExtractorChrome #403) in a number of ways:

  • Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. It doesn't quite do everything CDP does, but most of the important stuff is there.

  • Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs (like request bodies). But in practice I anyway found the proxy method loads pages faster and more reliably. Perhaps because responses can be streamed incrementally, which helps a lot for large resources and server-sent events. It does unfortunately make the remote browser support a little more difficult (but should still be quite doable).

  • This builds on the new FetchHTTP2 module (FetchHTTP2: A new fetch module for HTTP/2 and HTTP/3 #649) which does connection pooling making subrequests a lot faster. The original FetchHTTP opens a new connection for every request which is quite slow for browsing.

  • The Browser processor can be configured with a list of behavior beans making it extensible.

  • The BiDi protocol is described with typed interfaces and records implemented by a Java Proxy. It's mostly synchronous for ease of use but when useful you can define an asynchronous method by returning CompletableFuture.

Some obvious areas for future development:

  • More Behavior beans: take screenshots, save the rendered DOM, run Browsertrix-compatible behavior scripts

  • Support for remote WebDrivers (e.g. Selenium Server or cloud browser services)

  • UI integration for monitoring the browser

  • I just did basic record mapping on top of the json.org library Heritrix is already using, but as Heritrix's JSON needs evolve it might make sense to switch to something like Jackson.

ato added 4 commits June 4, 2025 17:44
The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links.

This differs from my previous attempt (ExtractorChrome) in a few ways:

- Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable.

- Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs. Even so in practice I found the proxy method loads pages faster and more reliably, likely because responses can be streamed incrementally, which helps a lot for large resources or server-sent events.

- Even when HTTP/2 is unavailable, the new FetchHTTP2 module does connection pooling which makes loading browser requests a lot faster. The original FetchHTTP opened a new connection for every request.

- The Browser processor can be configured with a list of behavior beans making it more customizable and extensible.

Obvious areas for future development:

- More Behavior beans: take screenshots, saveg the rendered DOM, run Browsertrix-compatible behavior scripts

- Support for remote WebDrivers (e.g. Selenium Server or cloud services)
This hopefully will stop us filling up ~/Downloads with random junk.
@ato ato merged commit 52bbd80 into master Jun 9, 2025
7 checks passed
@ato ato deleted the bidi branch June 9, 2025 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant