Skip to content

Conversation

@jhy
Copy link
Owner

@jhy jhy commented Sep 24, 2025

Added support for using the re2j regular expression engine for CSS selectors, which ensures linear-time performance for regex evaluation. This enables safe handling of arbitrary user-supplied query regexes.

To enable, add the com.google.re2j dependency to your classpath, e.g.:

<dependency>
  <groupId>com.google.re2j</groupId>
  <artifactId>re2j</artifactId>
  <version>1.8</version>
</dependency>

jhy added 2 commits September 24, 2025 16:26
Re2j is a linear-time regex engine. This enables jsoup users to now safely accept arbitrary CSS regex-based queries.
@jhy jhy marked this pull request as ready for review September 24, 2025 07:05
@jhy jhy merged commit f939ccf into master Sep 25, 2025
28 checks passed
@jhy jhy deleted the re2j branch September 25, 2025 04:43
anonyein added a commit to anonyein/jsoup that referenced this pull request Sep 25, 2025
Added option to use re2j for CSS query regexes (jhy#2407)

private final com.google.re2j.Pattern re2jPattern;

private Re2jRegex(com.google.re2j.Pattern re2jPattern) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concrete references to com.google.re2j.* classes means re2j is non-optional or the android r8 shrinker throws an error

This was discovered while attempting to adopt the new jsoup version in the AnkiDroid project, where we have at a non-trivial development cost implemented "release mode" emulator testing after having shrinker errors in the past. It surfaced the problem

ankidroid/Anki-Android#19985
https://github.com/ankidroid/Anki-Android/actions/runs/20636686551/job/59315193515#step:8:298

An initial attempt to solve the issue was made by simply adding the observationally-non-optional re2j transitive but that is unsatisfactory as we have also implemented an APK size comparison tool to see the impact of such changes, and re2j appears to add approximately 70 kilobytes to our APK size. That is not huge but is best avoided for something that is in our use case not going to result in a user-perceivable benefit

Can this intended-to-be-optional transitive be made actually optional here? Perhaps through reflective usage in the Re2jRegex helper similar to the "has re2j" detection elsewhere?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants