Skip to content

Commit 55daf8e

Browse files
RobinMalfaitadamwathanthecrypticace
authored
Ensure the oxide parser has feature parity with the stable RegEx parser (#11389)
* WIP * use `parse` instead of `defaultExtractor` * skip `Vue` describe block * add a few more dedicated arbitrary values/properties tests * use parallel parsing * splitup Vue tests * add some Rust specific tests * setup parse candidate strings test system These tests will run against the `Regex` and `Rust` based parsers. We have groups of classes of various shapes and forms + variants and rendered in various template situation (plain, html, Vue, ...) + enable all skipped tests * ensure we also validate the classes with variants The classes with variants are built in the `templateTable` function, so we get them out again by using the potional arguments of the `test.each` cb function. * cleanup test suite * add "anti-test" tests To make sure that we are _not_ parsing out certain values given a certain input. * Add ParseAction enum * Restart parsing following an arbitrary parse failure * Split variants off before validating the uility part * Collapse candidate from the end when validation fails * Support `<`, and `>` in variant position * fix error * format parser.rs * Refactor * Update editorconfig * wip * wip * Refactor * Refactor * Simplify * wip * wip * wip * wip * wip * wip * wip * run `cargo clippy --fix` * run `cargo fmt` * implement `cargo clippy` suggestions These were not applied using `cargo clippy --fix` * only allow `.` in the candidate part when surrounded by 0-9 This is only in the candidate part, not the arbitrary part. * % characters can only appear at the end after digits * > and < should only be part of variants (start OR end) It can technically be inside the candidate when we have stacked variants: ``` dark:<sm:underline dark:md>:underline ``` * handle parsing utilities within quotes, parans or brackets * mark `pt-1.5` as an expected value sliced out from `["pt-1.5"]` * Add cursor abstraction * wip * disable the oxideParser if using a custom `prefix` or `separator` * update tests * Use cursor abstraction * Refactor more code toward use of global cursor * wip * simplify * Simplify * Simplify * Simplify * Cleanup * wip * Simplify * wip * Simplify * Handle candidates ending with % sign * Tweak code a bit * fmt * Simplify * Add cursor details to trace * cargo fmt * use preferred `zoom-0.5` name instead of `zoom-.5` * drop over-extracted utilities in oxide parser The RegEx parser does extract `underline` from ```html <div class="peer-aria-[labelledby='a_b']:underline"></div> ``` ... but that's not needed and is not happening in the oxide parser This means that we have to make the output check a little bit different but they are explicit based on the feature flag. * allow extracting variants+utilities inside `{}` for the oxide parser * characters in candidates such as `group-${id}` should not be allowed * do not extract any of the following candidate `w-[foo-bar]w-[bar-baz]` * ensure we can consume the full candidate and discard it * Add fast skipping of whitespace * Use fast skipping whenever possible * Add fast skipping to benchmark * Hand-tune to generate more optimized assembly * Move code around a bit This makes sure all the fancy SIMD stuff is as early as possible. This results in an extremely minor perf increase. * Undo tweak no meaningful perf difference in real world scenarios * Disable fast skipping for now It needs to be done in a different spot so it doesn’t affect how things are returned * Change test names * Fix normalize config error * cleanup a bit * Cleanup * Extract validation result enum * Cleanup comments * Simplify * Fix formatting * Run clippy * wip * add `md>` under the special characters test set --------- Co-authored-by: Adam Wathan <[email protected]> Co-authored-by: Jordan Pittman <[email protected]>
1 parent e572dc6 commit 55daf8e

File tree

12 files changed

+1379
-217
lines changed

12 files changed

+1379
-217
lines changed

.editorconfig

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,11 @@ end_of_line = lf
77
charset = utf-8
88
trim_trailing_whitespace = true
99
insert_final_newline = true
10+
11+
[*.rs]
12+
indent_style = space
13+
indent_size = 4
14+
end_of_line = lf
15+
charset = utf-8
16+
trim_trailing_whitespace = true
17+
insert_final_newline = true

oxide/crates/core/benches/parse_candidates.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,18 @@ pub fn criterion_benchmark(c: &mut Criterion) {
3131
c.bench_function("parse_candidate_strings (real world)", |b| {
3232
b.iter(|| parse(include_bytes!("./fixtures/template-499.html")))
3333
});
34+
35+
let mut group = c.benchmark_group("sample-size-example");
36+
group.sample_size(10);
37+
38+
group.bench_function("parse_candidate_strings (fast space skipping)", |b| {
39+
let count = 10_000;
40+
let crazy1 = format!("{}underline", " ".repeat(count));
41+
let crazy2 = crazy1.repeat(count);
42+
let crazy3 = crazy2.as_bytes();
43+
44+
b.iter(|| parse(black_box(crazy3)))
45+
});
3446
}
3547

3648
criterion_group!(benches, criterion_benchmark);

oxide/crates/core/src/cursor.rs

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
use std::{ascii::escape_default, fmt::Display};
2+
3+
#[derive(Debug, Clone)]
4+
pub struct Cursor<'a> {
5+
// The input we're scanning
6+
pub input: &'a [u8],
7+
8+
// The location of the cursor in the input
9+
pub pos: usize,
10+
11+
/// Is the cursor at the start of the input
12+
pub at_start: bool,
13+
14+
/// Is the cursor at the end of the input
15+
pub at_end: bool,
16+
17+
/// The previously consumed character
18+
/// If `at_start` is true, this will be NUL
19+
pub prev: u8,
20+
21+
/// The current character
22+
pub curr: u8,
23+
24+
/// The upcoming character (if any)
25+
/// If `at_end` is true, this will be NUL
26+
pub next: u8,
27+
}
28+
29+
impl<'a> Cursor<'a> {
30+
pub fn new(input: &'a [u8]) -> Self {
31+
let mut cursor = Self {
32+
input,
33+
pos: 0,
34+
at_start: true,
35+
at_end: false,
36+
prev: 0x00,
37+
curr: 0x00,
38+
next: 0x00,
39+
};
40+
cursor.move_to(0);
41+
cursor
42+
}
43+
44+
pub fn rewind_by(&mut self, amount: usize) {
45+
self.move_to(self.pos.saturating_sub(amount));
46+
}
47+
48+
pub fn advance_by(&mut self, amount: usize) {
49+
self.move_to(self.pos.saturating_add(amount));
50+
}
51+
52+
pub fn move_to(&mut self, pos: usize) {
53+
let len = self.input.len();
54+
let pos = pos.clamp(0, len);
55+
56+
self.pos = pos;
57+
self.at_start = pos == 0;
58+
self.at_end = pos + 1 >= len;
59+
60+
self.prev = if pos > 0 { self.input[pos - 1] } else { 0x00 };
61+
self.curr = if pos < len { self.input[pos] } else { 0x00 };
62+
self.next = if pos + 1 < len {
63+
self.input[pos + 1]
64+
} else {
65+
0x00
66+
};
67+
}
68+
}
69+
70+
impl<'a> Display for Cursor<'a> {
71+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
72+
let len = self.input.len().to_string();
73+
74+
let pos = format!("{: >len_count$}", self.pos, len_count = len.len());
75+
write!(f, "{}/{} ", pos, len)?;
76+
77+
if self.at_start {
78+
write!(f, "S ")?;
79+
} else if self.at_end {
80+
write!(f, "E ")?;
81+
} else {
82+
write!(f, "M ")?;
83+
}
84+
85+
fn to_str(c: u8) -> String {
86+
if c == 0x00 {
87+
"NUL".into()
88+
} else {
89+
format!("{:?}", escape_default(c).to_string())
90+
}
91+
}
92+
93+
write!(
94+
f,
95+
"[{} {} {}]",
96+
to_str(self.prev),
97+
to_str(self.curr),
98+
to_str(self.next)
99+
)
100+
}
101+
}
102+
103+
#[cfg(test)]
104+
mod test {
105+
use super::*;
106+
107+
#[test]
108+
fn test_cursor() {
109+
let mut cursor = Cursor::new(b"hello world");
110+
assert_eq!(cursor.pos, 0);
111+
assert!(cursor.at_start);
112+
assert!(!cursor.at_end);
113+
assert_eq!(cursor.prev, 0x00);
114+
assert_eq!(cursor.curr, b'h');
115+
assert_eq!(cursor.next, b'e');
116+
117+
cursor.advance_by(1);
118+
assert_eq!(cursor.pos, 1);
119+
assert!(!cursor.at_start);
120+
assert!(!cursor.at_end);
121+
assert_eq!(cursor.prev, b'h');
122+
assert_eq!(cursor.curr, b'e');
123+
assert_eq!(cursor.next, b'l');
124+
125+
// Advancing too far should stop at the end
126+
cursor.advance_by(10);
127+
assert_eq!(cursor.pos, 11);
128+
assert!(!cursor.at_start);
129+
assert!(cursor.at_end);
130+
assert_eq!(cursor.prev, b'd');
131+
assert_eq!(cursor.curr, 0x00);
132+
assert_eq!(cursor.next, 0x00);
133+
134+
// Can't advance past the end
135+
cursor.advance_by(1);
136+
assert_eq!(cursor.pos, 11);
137+
assert!(!cursor.at_start);
138+
assert!(cursor.at_end);
139+
assert_eq!(cursor.prev, b'd');
140+
assert_eq!(cursor.curr, 0x00);
141+
assert_eq!(cursor.next, 0x00);
142+
143+
cursor.rewind_by(1);
144+
assert_eq!(cursor.pos, 10);
145+
assert!(!cursor.at_start);
146+
assert!(cursor.at_end);
147+
assert_eq!(cursor.prev, b'l');
148+
assert_eq!(cursor.curr, b'd');
149+
assert_eq!(cursor.next, 0x00);
150+
151+
cursor.rewind_by(10);
152+
assert_eq!(cursor.pos, 0);
153+
assert!(cursor.at_start);
154+
assert!(!cursor.at_end);
155+
assert_eq!(cursor.prev, 0x00);
156+
assert_eq!(cursor.curr, b'h');
157+
assert_eq!(cursor.next, b'e');
158+
}
159+
}

oxide/crates/core/src/fast_skip.rs

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
use crate::cursor::Cursor;
2+
3+
const STRIDE: usize = 16;
4+
type Mask = [bool; STRIDE];
5+
6+
#[inline(always)]
7+
pub fn fast_skip(cursor: &Cursor) -> Option<usize> {
8+
// If we don't have enough bytes left to check then bail early
9+
if cursor.pos + STRIDE >= cursor.input.len() {
10+
return None;
11+
}
12+
13+
if !cursor.curr.is_ascii_whitespace() {
14+
return None;
15+
}
16+
17+
let mut offset = 1;
18+
19+
// SAFETY: We've already checked (indirectly) that this index is valid
20+
let remaining = unsafe { cursor.input.get_unchecked(cursor.pos..) };
21+
22+
// NOTE: This loop uses primitives designed to be auto-vectorized
23+
// Do not change this loop without benchmarking the results
24+
// And checking the generated assembly using godbolt.org
25+
for (i, chunk) in remaining.chunks_exact(STRIDE).enumerate() {
26+
let value = load(chunk);
27+
let is_whitespace = is_ascii_whitespace(value);
28+
let is_all_whitespace = all_true(is_whitespace);
29+
30+
if is_all_whitespace {
31+
offset = (i + 1) * STRIDE;
32+
} else {
33+
break;
34+
}
35+
}
36+
37+
Some(cursor.pos + offset)
38+
}
39+
40+
#[inline(always)]
41+
fn load(input: &[u8]) -> [u8; STRIDE] {
42+
let mut value = [0u8; STRIDE];
43+
value.copy_from_slice(input);
44+
value
45+
}
46+
47+
#[inline(always)]
48+
fn eq(input: [u8; STRIDE], val: u8) -> Mask {
49+
let mut res = [false; STRIDE];
50+
for n in 0..STRIDE {
51+
res[n] = input[n] == val
52+
}
53+
res
54+
}
55+
56+
#[inline(always)]
57+
fn or(a: [bool; STRIDE], b: [bool; STRIDE]) -> [bool; STRIDE] {
58+
let mut res = [false; STRIDE];
59+
for n in 0..STRIDE {
60+
res[n] = a[n] | b[n];
61+
}
62+
res
63+
}
64+
65+
#[inline(always)]
66+
fn all_true(a: [bool; STRIDE]) -> bool {
67+
let mut res = true;
68+
for item in a.iter().take(STRIDE) {
69+
res &= item;
70+
}
71+
res
72+
}
73+
74+
#[inline(always)]
75+
fn is_ascii_whitespace(value: [u8; STRIDE]) -> [bool; STRIDE] {
76+
let whitespace_1 = eq(value, b'\t');
77+
let whitespace_2 = eq(value, b'\n');
78+
let whitespace_3 = eq(value, b'\x0C');
79+
let whitespace_4 = eq(value, b'\r');
80+
let whitespace_5 = eq(value, b' ');
81+
82+
or(
83+
or(
84+
or(or(whitespace_1, whitespace_2), whitespace_3),
85+
whitespace_4,
86+
),
87+
whitespace_5,
88+
)
89+
}

oxide/crates/core/src/lib.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ use tracing::event;
99
use walkdir::WalkDir;
1010

1111
pub mod candidate;
12+
pub mod cursor;
13+
pub mod fast_skip;
1214
pub mod glob;
1315
pub mod location;
1416
pub mod modifier;

0 commit comments

Comments
 (0)