Conversation
The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links. This differs from my previous attempt (ExtractorChrome) in a few ways: - Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. - Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs. Even so in practice I found the proxy method loads pages faster and more reliably, likely because responses can be streamed incrementally, which helps a lot for large resources or server-sent events. - Even when HTTP/2 is unavailable, the new FetchHTTP2 module does connection pooling which makes loading browser requests a lot faster. The original FetchHTTP opened a new connection for every request. - The Browser processor can be configured with a list of behavior beans making it more customizable and extensible. Obvious areas for future development: - More Behavior beans: take screenshots, saveg the rendered DOM, run Browsertrix-compatible behavior scripts - Support for remote WebDrivers (e.g. Selenium Server or cloud services)
This hopefully will stop us filling up ~/Downloads with random junk.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Browser processor can load a fetched page in a local web browser, record any requests the browser makes and run behaviors that interact with the page such as scrolling down and extracting links.
This needs some more testing and more error handling but is working for small crawls with Firefox and ChromeDriver.
This differs from my previous attempt (ExtractorChrome #403) in a number of ways:
Uses the new WebDriver BiDi standard instead of the Chrome Devtools Protocol. The new protocol is mostly browser-agnostic, more consistent and hopefully more stable. It doesn't quite do everything CDP does, but most of the important stuff is there.
Uses a MITM proxy instead of CDP request interception for recording sub-resources. That's partly because BiDi is still missing some key interception APIs (like request bodies). But in practice I anyway found the proxy method loads pages faster and more reliably. Perhaps because responses can be streamed incrementally, which helps a lot for large resources and server-sent events. It does unfortunately make the remote browser support a little more difficult (but should still be quite doable).
This builds on the new FetchHTTP2 module (FetchHTTP2: A new fetch module for HTTP/2 and HTTP/3 #649) which does connection pooling making subrequests a lot faster. The original FetchHTTP opens a new connection for every request which is quite slow for browsing.
The Browser processor can be configured with a list of behavior beans making it extensible.
The BiDi protocol is described with typed interfaces and records implemented by a Java Proxy. It's mostly synchronous for ease of use but when useful you can define an asynchronous method by returning CompletableFuture.
Some obvious areas for future development:
More Behavior beans: take screenshots, save the rendered DOM, run Browsertrix-compatible behavior scripts
Support for remote WebDrivers (e.g. Selenium Server or cloud browser services)
UI integration for monitoring the browser
I just did basic record mapping on top of the json.org library Heritrix is already using, but as Heritrix's JSON needs evolve it might make sense to switch to something like Jackson.