Skip to content

Commit 8f5d947

Browse files
committed
automata: make behavior more consistent when WhichCaptures::None is used
Specifically, when used with `meta::Regex`. Before this PR, if callers built a `meta::Regex` with `WhichCaptures::None`, then it was possible for `find` to _sometimes_ return `Some` and _sometimes_ return `None`, just based on the sequence of previous search calls. In particular, when `WhichCaptures::None` is used, some regex engines (like the `PikeVM`) cannot report match offsets while some (like the lazy DFA) can. This meant that if the meta regex engine _happened_ to select the lazy DFA for a `Regex::find` call, then it would return `Some`. But if it _happened_ to select the Pike VM, then it would return `None`. Since engine selection can be influenced by the haystack itself, this leads to the behavior of `find` being tied to the contents of the haystack. Instead, what we should do is make it so anything that returns match offsets on a `meta::Regex` will always return `None` when `WhichCaptures::None` is used, _even_ if `Regex::is_match` returns `true`. (Yes, this is a weird option and it's crazy that `Regex::is_match` can return `true` while `Regex::find` can return `None`. This was already true before this PR and is a result of a very low level option that optimizes for memory usage in specific circumstances. This sort of whacky behavior can't be observed in the `regex` crate API. Only in `regex-automata`.)
1 parent 977feeb commit 8f5d947

File tree

2 files changed

+86
-7
lines changed

2 files changed

+86
-7
lines changed

regex-automata/src/meta/regex.rs

Lines changed: 64 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -917,7 +917,9 @@ impl Regex {
917917
/// ```
918918
#[inline]
919919
pub fn search(&self, input: &Input<'_>) -> Option<Match> {
920-
if self.imp.info.is_impossible(input) {
920+
if self.imp.info.captures_disabled()
921+
|| self.imp.info.is_impossible(input)
922+
{
921923
return None;
922924
}
923925
let mut guard = self.pool.get();
@@ -973,7 +975,9 @@ impl Regex {
973975
/// ```
974976
#[inline]
975977
pub fn search_half(&self, input: &Input<'_>) -> Option<HalfMatch> {
976-
if self.imp.info.is_impossible(input) {
978+
if self.imp.info.captures_disabled()
979+
|| self.imp.info.is_impossible(input)
980+
{
977981
return None;
978982
}
979983
let mut guard = self.pool.get();
@@ -1128,7 +1132,9 @@ impl Regex {
11281132
input: &Input<'_>,
11291133
slots: &mut [Option<NonMaxUsize>],
11301134
) -> Option<PatternID> {
1131-
if self.imp.info.is_impossible(input) {
1135+
if self.imp.info.captures_disabled()
1136+
|| self.imp.info.is_impossible(input)
1137+
{
11321138
return None;
11331139
}
11341140
let mut guard = self.pool.get();
@@ -1242,7 +1248,9 @@ impl Regex {
12421248
cache: &mut Cache,
12431249
input: &Input<'_>,
12441250
) -> Option<Match> {
1245-
if self.imp.info.is_impossible(input) {
1251+
if self.imp.info.captures_disabled()
1252+
|| self.imp.info.is_impossible(input)
1253+
{
12461254
return None;
12471255
}
12481256
self.imp.strat.search(cache, input)
@@ -1284,7 +1292,9 @@ impl Regex {
12841292
cache: &mut Cache,
12851293
input: &Input<'_>,
12861294
) -> Option<HalfMatch> {
1287-
if self.imp.info.is_impossible(input) {
1295+
if self.imp.info.captures_disabled()
1296+
|| self.imp.info.is_impossible(input)
1297+
{
12881298
return None;
12891299
}
12901300
self.imp.strat.search_half(cache, input)
@@ -1437,7 +1447,9 @@ impl Regex {
14371447
input: &Input<'_>,
14381448
slots: &mut [Option<NonMaxUsize>],
14391449
) -> Option<PatternID> {
1440-
if self.imp.info.is_impossible(input) {
1450+
if self.imp.info.captures_disabled()
1451+
|| self.imp.info.is_impossible(input)
1452+
{
14411453
return None;
14421454
}
14431455
self.imp.strat.search_slots(cache, input, slots)
@@ -1982,6 +1994,19 @@ impl RegexInfo {
19821994
self.props_union().look_set_suffix().contains(Look::End)
19831995
}
19841996

1997+
/// Returns true when the regex's NFA lacks capture states.
1998+
///
1999+
/// In this case, some regex engines (like the PikeVM) are unable to report
2000+
/// match offsets, while some (like the lazy DFA can). To avoid whether a
2001+
/// match or not is reported based on engine selection, routines that
2002+
/// return match offsets will _always_ report `None` when this is true.
2003+
///
2004+
/// Yes, this is a weird case and it's a little fucked up. But
2005+
/// `WhichCaptures::None` comes with an appropriate warning.
2006+
fn captures_disabled(&self) -> bool {
2007+
matches!(self.config().get_which_captures(), WhichCaptures::None)
2008+
}
2009+
19852010
/// Returns true if and only if it is known that a match is impossible
19862011
/// for the given input. This is useful for short-circuiting and avoiding
19872012
/// running the regex engine if it's known no match can be reported.
@@ -2645,7 +2670,12 @@ impl Config {
26452670
/// Setting this to `WhichCaptures::None` is usually not the right thing to
26462671
/// do. When no capture states are compiled, some regex engines (such as
26472672
/// the `PikeVM`) won't be able to report match offsets. This will manifest
2648-
/// as no match being found.
2673+
/// as no match being found. Indeed, in order to enforce consistent
2674+
/// behavior, the meta regex engine will always report `None` for routines
2675+
/// that return match offsets even if one of its regex engines could
2676+
/// service the request. This avoids "match or not" behavior from being
2677+
/// influenced by user input (since user input can influence the selection
2678+
/// of the regex engine).
26492679
///
26502680
/// # Example
26512681
///
@@ -2694,6 +2724,33 @@ impl Config {
26942724
///
26952725
/// Ok::<(), Box<dyn std::error::Error>>(())
26962726
/// ```
2727+
///
2728+
/// # Example: strange `Regex::find` behavior
2729+
///
2730+
/// As noted above, when using [`WhichCaptures::None`], this means that
2731+
/// `Regex::is_match` could return `true` while `Regex::find` returns
2732+
/// `None`:
2733+
///
2734+
/// ```
2735+
/// use regex_automata::{
2736+
/// meta::Regex,
2737+
/// nfa::thompson::WhichCaptures,
2738+
/// Input,
2739+
/// Match,
2740+
/// Span,
2741+
/// };
2742+
///
2743+
/// let re = Regex::builder()
2744+
/// .configure(Regex::config().which_captures(WhichCaptures::None))
2745+
/// .build(r"foo([0-9]+)bar")?;
2746+
/// let hay = "foo123bar";
2747+
///
2748+
/// assert!(re.is_match(hay));
2749+
/// assert_eq!(re.find(hay), None);
2750+
/// assert_eq!(re.search_half(&Input::new(hay)), None);
2751+
///
2752+
/// Ok::<(), Box<dyn std::error::Error>>(())
2753+
/// ```
26972754
pub fn which_captures(mut self, which_captures: WhichCaptures) -> Config {
26982755
self.which_captures = Some(which_captures);
26992756
self

regex-automata/src/nfa/thompson/compiler.rs

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -563,6 +563,28 @@ pub enum WhichCaptures {
563563
/// This is useful when capture states are either not needed (for example,
564564
/// if one is only trying to build a DFA) or if they aren't supported (for
565565
/// example, a reverse NFA).
566+
///
567+
/// # Warning
568+
///
569+
/// Callers must be exceedingly careful when using this
570+
/// option. In particular, not all regex engines support
571+
/// reporting match spans when using this option (for example,
572+
/// [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM) or
573+
/// [`BoundedBacktracker`](crate::nfa::thompson::backtrack::BoundedBacktracker)).
574+
///
575+
/// Perhaps more confusingly, using this option with such an
576+
/// engine means that an `is_match` routine could report `true`
577+
/// when `find` reports `None`. This is generally not something
578+
/// that _should_ happen, but the low level control provided by
579+
/// this crate makes it possible.
580+
///
581+
/// Similarly, any regex engines (like [`meta::Regex`](crate::meta::Regex))
582+
/// should always return `None` from `find` routines when this option is
583+
/// used, even if _some_ of its internal engines could find the match
584+
/// boundaries. This is because inputs from user data could influence
585+
/// engine selection, and thus influence whether a match is found or not.
586+
/// Indeed, `meta::Regex::find` will always return `None` when configured
587+
/// with this option.
566588
None,
567589
}
568590

0 commit comments

Comments
 (0)