-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Added option to use re2j for CSS query regexes #2407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| package org.jsoup.helper; | ||
|
|
||
| /** | ||
| re2j-backed Regex implementation; must only be touched when re2j is on the classpath. | ||
| */ | ||
| final class Re2jRegex extends Regex { | ||
| private static final java.util.regex.Pattern unused = java.util.regex.Pattern.compile(""); | ||
|
|
||
| private final com.google.re2j.Pattern re2jPattern; | ||
|
|
||
| private Re2jRegex(com.google.re2j.Pattern re2jPattern) { | ||
| super(unused); | ||
| this.re2jPattern = re2jPattern; | ||
| } | ||
|
|
||
| public static Regex compile(String regex) { | ||
| try { | ||
| return new Re2jRegex(com.google.re2j.Pattern.compile(regex)); | ||
| } catch (RuntimeException e) { | ||
| throw new ValidationException("Pattern syntax error: " + e.getMessage()); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public Matcher matcher(CharSequence input) { | ||
| return new Re2jMatcher(re2jPattern.matcher(input)); | ||
| } | ||
|
|
||
| @Override | ||
| public String toString() { | ||
| return re2jPattern.toString(); | ||
| } | ||
|
|
||
| private static final class Re2jMatcher implements Matcher { | ||
| private final com.google.re2j.Matcher delegate; | ||
|
|
||
| Re2jMatcher(com.google.re2j.Matcher delegate) { | ||
| this.delegate = delegate; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean find() { | ||
| return delegate.find(); | ||
| } | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| package org.jsoup.helper; | ||
|
|
||
| import org.jsoup.internal.SharedConstants; | ||
|
|
||
| import java.lang.reflect.InvocationTargetException; | ||
| import java.lang.reflect.Method; | ||
| import java.util.regex.Pattern; | ||
| import java.util.regex.PatternSyntaxException; | ||
|
|
||
| /** | ||
| A regular expression abstraction. Allows jsoup to optionally use the re2j regular expression engine (linear time) | ||
| instead of the JDK's backtracking regex implementation. | ||
|
|
||
| <p>If the {@code com.google.re2j} library is found on the classpath, by default it will be used. You can override this | ||
| by setting {@code -Djsoup.useRe2j=false} to explicitly disable, and use the JDK regex engine.</p> | ||
|
|
||
| <p>(Currently this a simplified implementation for jsoup's specific use; can extend as required.)</p> | ||
| */ | ||
| public class Regex { | ||
| private static final boolean hasRe2j = hasRe2j(); | ||
|
|
||
| private final Pattern jdkPattern; | ||
|
|
||
| Regex(Pattern jdkPattern) { | ||
| this.jdkPattern = jdkPattern; | ||
| } | ||
|
|
||
| /** | ||
| Compile a regex, using re2j if enabled and available; otherwise JDK regex. | ||
|
|
||
| @param regex the regex to compile | ||
| @return the compiled regex | ||
| @throws ValidationException if the regex is invalid | ||
| */ | ||
| public static Regex compile(String regex) { | ||
| if (hasRe2j && wantsRe2j()) { | ||
| return Re2jRegex.compile(regex); | ||
| } | ||
|
|
||
| try { | ||
| return new Regex(Pattern.compile(regex)); | ||
| } catch (PatternSyntaxException e) { | ||
| throw new ValidationException("Pattern syntax error: " + e.getMessage()); | ||
| } | ||
| } | ||
|
|
||
| /** Wraps an existing JDK Pattern (for API compat); doesn't switch */ | ||
| public static Regex fromPattern(Pattern pattern) { | ||
| return new Regex(pattern); | ||
| } | ||
|
|
||
| static boolean wantsRe2j() { | ||
| return Boolean.parseBoolean(System.getProperty(SharedConstants.UseRe2j, "true")); | ||
| } | ||
|
|
||
| static void wantsRe2j(boolean use) { | ||
| System.setProperty(SharedConstants.UseRe2j, Boolean.toString(use)); | ||
| } | ||
|
|
||
| static boolean hasRe2j() { | ||
| try { | ||
| Class<?> re2 = Class.forName("com.google.re2j.Pattern", false, Regex.class.getClassLoader()); // check if re2j is in classpath | ||
| try { | ||
| // if it is, and we are on JVM9+, we need to dork around with modules, because re2j doesn't publish a module name. | ||
| // done via reflection so we can still run on JVM 8. | ||
| // todo remove if re2j publishes as a module | ||
| Class<?> moduleCls = Class.forName("java.lang.Module"); | ||
| Method getModule = Class.class.getMethod("getModule"); | ||
| Object jsoupMod = getModule.invoke(Regex.class); | ||
| Object re2Mod = getModule.invoke(re2); | ||
| boolean reads = (boolean) moduleCls.getMethod("canRead", moduleCls).invoke(jsoupMod, re2Mod); | ||
| if (!reads) moduleCls.getMethod("addReads", moduleCls).invoke(jsoupMod, re2Mod); | ||
| } catch (ClassNotFoundException ignore) { | ||
| // jvm8 - no Module class; so we can use as-is | ||
| } | ||
| return true; | ||
| } catch (ClassNotFoundException e) { | ||
| return false; // no re2j | ||
| } catch (ReflectiveOperationException e) { | ||
| // unexpectedly couldn’t wire modules on 9+; return false to avoid IllegalAccessError later | ||
| System.err.println("Warning: (bug? please report) couldn't access re2j from jsoup due to modules: " + e); | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| public Matcher matcher(CharSequence input) { | ||
| return new JdkMatcher(jdkPattern.matcher(input)); | ||
| } | ||
|
|
||
| @Override | ||
| public String toString() { | ||
| return jdkPattern.toString(); | ||
| } | ||
|
|
||
| public interface Matcher { | ||
| boolean find(); | ||
| } | ||
|
|
||
| private static final class JdkMatcher implements Matcher { | ||
| private final java.util.regex.Matcher delegate; | ||
|
|
||
| JdkMatcher(java.util.regex.Matcher delegate) { | ||
| this.delegate = delegate; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean find() { | ||
| return delegate.find(); | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concrete references to com.google.re2j.* classes means re2j is non-optional or the android r8 shrinker throws an error
This was discovered while attempting to adopt the new jsoup version in the AnkiDroid project, where we have at a non-trivial development cost implemented "release mode" emulator testing after having shrinker errors in the past. It surfaced the problem
ankidroid/Anki-Android#19985
https://github.com/ankidroid/Anki-Android/actions/runs/20636686551/job/59315193515#step:8:298
An initial attempt to solve the issue was made by simply adding the observationally-non-optional re2j transitive but that is unsatisfactory as we have also implemented an APK size comparison tool to see the impact of such changes, and re2j appears to add approximately 70 kilobytes to our APK size. That is not huge but is best avoided for something that is in our use case not going to result in a user-perceivable benefit
Can this intended-to-be-optional transitive be made actually optional here? Perhaps through reflective usage in the Re2jRegex helper similar to the "has re2j" detection elsewhere?
Thanks!