Skip to content

Commit ac25155

Browse files
robertbastiantaj-p
andauthored
Use ICU4X built-in data (#482)
The data that is currently being generated in `parley_data` is the same data that ICU4X ships with. However, using `try_new_unstable` constructors with custom data providers can be less efficient than enabling the `compiled_data` feature, as these constructors do runtime lookups and branching, whereas most `compiled_data` constructors are `const`. Benchmarks look neutral: ``` Default Style - arabic 20 characters [ 9.9 us ... 9.7 us ] -1.46%* Default Style - latin 20 characters [ 4.5 us ... 4.3 us ] -4.27%* Default Style - japanese 20 characters [ 9.1 us ... 8.9 us ] -2.30%* Default Style - arabic 1 paragraph [ 55.5 us ... 55.6 us ] +0.13% Default Style - latin 1 paragraph [ 18.2 us ... 17.9 us ] -1.49%* Default Style - japanese 1 paragraph [ 76.8 us ... 76.9 us ] +0.16% Default Style - arabic 4 paragraph [ 234.0 us ... 235.1 us ] +0.48% Default Style - latin 4 paragraph [ 69.0 us ... 68.2 us ] -1.05%* Default Style - japanese 4 paragraph [ 131.9 us ... 136.0 us ] +3.11% Styled - arabic 20 characters [ 11.3 us ... 11.3 us ] -0.43% Styled - latin 20 characters [ 6.3 us ... 6.3 us ] -0.99% Styled - japanese 20 characters [ 9.9 us ... 9.7 us ] -1.80%* Styled - arabic 1 paragraph [ 59.4 us ... 58.5 us ] -1.40% Styled - latin 1 paragraph [ 23.7 us ... 23.3 us ] -1.82%* Styled - japanese 1 paragraph [ 86.6 us ... 87.5 us ] +1.05%* Styled - arabic 4 paragraph [ 251.7 us ... 252.5 us ] +0.32% Styled - latin 4 paragraph [ 90.4 us ... 89.1 us ] -1.45%* Styled - japanese 4 paragraph [ 123.7 us ... 124.0 us ] +0.24% ``` --------- Co-authored-by: Taj Pereira <taj@canva.com>
1 parent 4f25d22 commit ac25155

25 files changed

+110
-1814
lines changed

Cargo.lock

Lines changed: 24 additions & 763 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,10 @@ hashbrown = { version = "0.16.1", default-features = false, features = [
3939
] }
4040
icu_codepointtrie_builder = { version = "0.5.1", default-features = false, features = ["wasm"] }
4141
icu_collections = { version = "2.1.1", default-features = false }
42-
icu_locale = { version = "2.1.1", default-features = false }
4342
icu_locale_core = { version = "2.1.1", default-features = false }
44-
icu_normalizer = { version = "~2.1.1", default-features = false }
45-
icu_properties = { version = "~2.1.2", default-features = false }
46-
icu_provider = { version = "2.1.1", default-features = false }
47-
icu_provider_adapters = { version = "2.1.1", default-features = false }
48-
icu_provider_export = { version = "2.1.1", default-features = false }
49-
icu_provider_source = { version = "2.1.1", default-features = false }
50-
icu_segmenter = { version = "~2.1.1", default-features = false }
43+
icu_normalizer = { version = "2.1.1", default-features = false }
44+
icu_properties = { version = "2.1.2", default-features = false }
45+
icu_segmenter = { version = "2.1.2", default-features = false }
5146
linebender_resource_handle = { version = "0.1.1", default-features = false }
5247
parley = { version = "0.7.0", default-features = false, path = "parley" }
5348
parley_data = { path = "parley_data", default-features = false }

parley/Cargo.toml

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,13 +34,9 @@ parley_data = { workspace = true, features = ["baked"] }
3434
accesskit = { workspace = true, optional = true }
3535
hashbrown = { workspace = true }
3636
harfrust = { workspace = true }
37-
icu_collections = { workspace = true }
38-
icu_normalizer = { workspace = true }
39-
icu_properties = { workspace = true }
40-
icu_provider = { workspace = true }
41-
icu_segmenter = { workspace = true, features = ["auto"] }
42-
# Used in ICU4X baked data sources
43-
zerovec = { workspace = true }
37+
icu_normalizer = { workspace = true, features = ["compiled_data"] }
38+
icu_properties = { workspace = true, features = ["compiled_data"] }
39+
icu_segmenter = { workspace = true, features = ["compiled_data"] }
4440

4541
[dev-dependencies]
4642
parley_dev = { workspace = true }

parley/src/analysis/cluster.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ impl CharCluster {
117117

118118
#[inline(always)]
119119
fn contributes_to_shaping(ch: char, analysis_data_sources: &AnalysisDataSources) -> bool {
120-
let props = analysis_data_sources.composite.properties(ch as u32);
120+
let props = analysis_data_sources.properties(ch);
121121
crate::analysis::contributes_to_shaping(props.general_category(), props.script())
122122
}
123123

parley/src/analysis/mod.rs

Lines changed: 47 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,10 @@
22
// SPDX-License-Identifier: Apache-2.0 OR MIT
33

44
pub(crate) mod cluster;
5-
mod provider;
65

76
use alloc::vec::Vec;
87
use core::marker::PhantomData;
98

10-
use crate::analysis::provider::PROVIDER;
119
use crate::resolve::{RangedStyle, ResolvedStyle};
1210
use crate::{Brush, LayoutContext, WordBreak};
1311

@@ -19,107 +17,80 @@ use icu_properties::props::{BidiMirroringGlyph, GeneralCategory, GraphemeCluster
1917
use icu_properties::{
2018
CodePointMapData, CodePointMapDataBorrowed, PropertyNamesShort, PropertyNamesShortBorrowed,
2119
};
22-
use icu_segmenter::options::{LineBreakOptions, LineBreakWordOption, WordBreakOptions};
20+
use icu_segmenter::options::{LineBreakOptions, LineBreakWordOption, WordBreakInvariantOptions};
2321
use icu_segmenter::{
2422
GraphemeClusterSegmenter, GraphemeClusterSegmenterBorrowed, LineSegmenter,
2523
LineSegmenterBorrowed, WordSegmenter, WordSegmenterBorrowed,
2624
};
27-
use parley_data::CompositeProps;
28-
29-
pub(crate) struct AnalysisDataSources {
30-
grapheme_segmenter: GraphemeClusterSegmenter,
31-
word_segmenter: WordSegmenter,
32-
line_segmenters: LineSegmenters,
33-
composing_normalizer: CanonicalComposition,
34-
decomposing_normalizer: CanonicalDecomposition,
35-
script_short_name: PropertyNamesShort<Script>,
36-
brackets: CodePointMapData<BidiMirroringGlyph>,
37-
38-
composite: CompositeProps,
39-
}
40-
41-
#[derive(Default)]
42-
struct LineSegmenters {
43-
normal: Option<LineSegmenter>,
44-
keep_all: Option<LineSegmenter>,
45-
break_all: Option<LineSegmenter>,
46-
}
25+
use parley_data::Properties;
4726

48-
impl LineSegmenters {
49-
fn get(&mut self, word_break_strength: WordBreak) -> LineSegmenterBorrowed<'_> {
50-
let segmenter = match word_break_strength {
51-
WordBreak::Normal => &mut self.normal,
52-
WordBreak::KeepAll => &mut self.keep_all,
53-
WordBreak::BreakAll => &mut self.break_all,
54-
};
55-
56-
segmenter
57-
.get_or_insert_with(|| {
58-
let mut line_break_opts = LineBreakOptions::default();
59-
let word_break_strength_icu = match word_break_strength {
60-
WordBreak::Normal => LineBreakWordOption::Normal,
61-
WordBreak::BreakAll => LineBreakWordOption::BreakAll,
62-
WordBreak::KeepAll => LineBreakWordOption::KeepAll,
63-
};
64-
line_break_opts.word_option = Some(word_break_strength_icu);
65-
LineSegmenter::try_new_auto_unstable(&PROVIDER, line_break_opts)
66-
.expect("Failed to create LineSegmenter")
67-
})
68-
.as_borrowed()
69-
}
70-
}
27+
pub(crate) struct AnalysisDataSources;
7128

7229
impl AnalysisDataSources {
7330
pub(crate) fn new() -> Self {
74-
Self {
75-
grapheme_segmenter: GraphemeClusterSegmenter::try_new_unstable(&PROVIDER).unwrap(),
76-
word_segmenter: WordSegmenter::try_new_lstm_unstable(
77-
&PROVIDER,
78-
WordBreakOptions::default(),
79-
)
80-
.unwrap(),
81-
line_segmenters: LineSegmenters::default(),
82-
composing_normalizer: CanonicalComposition::try_new_unstable(&PROVIDER).unwrap(),
83-
decomposing_normalizer: CanonicalDecomposition::try_new_unstable(&PROVIDER).unwrap(),
84-
script_short_name: PropertyNamesShort::<Script>::try_new_unstable(&PROVIDER).unwrap(),
85-
brackets: CodePointMapData::<BidiMirroringGlyph>::try_new_unstable(&PROVIDER).unwrap(),
86-
composite: CompositeProps,
87-
}
31+
Self
8832
}
8933

9034
#[inline(always)]
91-
pub(crate) fn composite(&self) -> &CompositeProps {
92-
&self.composite
35+
pub(crate) fn properties(&self, c: char) -> Properties {
36+
Properties::get(c)
9337
}
9438

9539
#[inline(always)]
9640
pub(crate) fn grapheme_segmenter(&self) -> GraphemeClusterSegmenterBorrowed<'_> {
97-
self.grapheme_segmenter.as_borrowed()
41+
const { GraphemeClusterSegmenter::new() }
9842
}
9943

10044
#[inline(always)]
101-
fn word_segmenter(&self) -> WordSegmenterBorrowed<'_> {
102-
self.word_segmenter.as_borrowed()
45+
fn word_segmenter(&self) -> WordSegmenterBorrowed<'static> {
46+
const { WordSegmenter::new_for_non_complex_scripts(WordBreakInvariantOptions::default()) }
47+
}
48+
49+
#[inline(always)]
50+
fn line_segmenter(&self, word_break_strength: WordBreak) -> LineSegmenterBorrowed<'static> {
51+
match word_break_strength {
52+
WordBreak::Normal => {
53+
const {
54+
let mut opt = LineBreakOptions::default();
55+
opt.word_option = Some(LineBreakWordOption::Normal);
56+
LineSegmenter::new_for_non_complex_scripts(opt)
57+
}
58+
}
59+
WordBreak::BreakAll => {
60+
const {
61+
let mut opt = LineBreakOptions::default();
62+
opt.word_option = Some(LineBreakWordOption::BreakAll);
63+
LineSegmenter::new_for_non_complex_scripts(opt)
64+
}
65+
}
66+
WordBreak::KeepAll => {
67+
const {
68+
let mut opt = LineBreakOptions::default();
69+
opt.word_option = Some(LineBreakWordOption::KeepAll);
70+
LineSegmenter::new_for_non_complex_scripts(opt)
71+
}
72+
}
73+
}
10374
}
10475

10576
#[inline(always)]
10677
fn composing_normalizer(&self) -> CanonicalCompositionBorrowed<'_> {
107-
self.composing_normalizer.as_borrowed()
78+
const { CanonicalComposition::new() }
10879
}
10980

11081
#[inline(always)]
11182
fn decomposing_normalizer(&self) -> CanonicalDecompositionBorrowed<'_> {
112-
self.decomposing_normalizer.as_borrowed()
83+
const { CanonicalDecomposition::new() }
11384
}
11485

11586
#[inline(always)]
116-
pub(crate) fn script_short_name(&self) -> PropertyNamesShortBorrowed<'_, Script> {
117-
self.script_short_name.as_borrowed()
87+
pub(crate) fn script_short_name(&self) -> PropertyNamesShortBorrowed<'static, Script> {
88+
PropertyNamesShort::new()
11889
}
11990

12091
#[inline(always)]
121-
pub(crate) fn brackets(&self) -> CodePointMapDataBorrowed<'_, BidiMirroringGlyph> {
122-
self.brackets.as_borrowed()
92+
fn brackets(&self) -> CodePointMapDataBorrowed<'_, BidiMirroringGlyph> {
93+
const { CodePointMapData::new() }
12394
}
12495
}
12596

@@ -354,8 +325,7 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, mut text: &str)
354325
if substring_index == 0 && last {
355326
let mut lb_iter = lcx
356327
.analysis_data_sources
357-
.line_segmenters
358-
.get(word_break_strength)
328+
.line_segmenter(word_break_strength)
359329
.segment_str(substring);
360330

361331
let _first = lb_iter.next();
@@ -378,8 +348,7 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, mut text: &str)
378348

379349
let line_boundaries_iter = lcx
380350
.analysis_data_sources
381-
.line_segmenters
382-
.get(word_break_strength)
351+
.line_segmenter(word_break_strength)
383352
.segment_str(substring);
384353

385354
let mut substring_chars = substring.chars();
@@ -455,7 +424,7 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, mut text: &str)
455424
(boundary, ch)
456425
});
457426

458-
let composite = lcx.analysis_data_sources.composite();
427+
let properties = |c| lcx.analysis_data_sources.properties(c);
459428

460429
let mut needs_bidi_resolution = false;
461430

@@ -466,7 +435,7 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, mut text: &str)
466435
// character's index, but we need our iterators to align, and the rest are simply
467436
// character-indexed.
468437
.fold(false, |is_mandatory_linebreak, (boundary, ch)| {
469-
let properties = composite.properties(ch as u32);
438+
let properties = properties(ch);
470439
let script = properties.script();
471440
let grapheme_cluster_break = properties.grapheme_cluster_break();
472441
let bidi_class = properties.bidi_class();
@@ -496,7 +465,7 @@ pub(crate) fn analyze_text<B: Brush>(lcx: &mut LayoutContext<B>, mut text: &str)
496465
};
497466

498467
needs_bidi_resolution |= crate::bidi::needs_bidi_resolution(bidi_class);
499-
// TODO: maybe extend CompositeProps to u64 to fit BidiMirroringGlyph
468+
// TODO: maybe extend Properties to u64 to fit BidiMirroringGlyph
500469
let bracket = lcx.analysis_data_sources.brackets().get(ch);
501470

502471
lcx.info.push((

parley/src/analysis/provider.rs

Lines changed: 0 additions & 41 deletions
This file was deleted.

parley_data/Cargo.toml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,11 @@ publish = false
1010

1111
[features]
1212
default = ["baked"]
13-
baked = ["dep:zerovec"]
14-
datagen = []
13+
baked = ["dep:icu_collections", "dep:zerovec"]
1514

1615
[dependencies]
1716
icu_properties = { workspace = true }
18-
icu_collections = { workspace = true }
17+
icu_collections = { workspace = true, optional = true }
1918
zerovec = { workspace = true, optional = true }
2019

2120
[lints]

parley_data/README.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,14 @@
11
# Unicode Data
22

3-
`parley_data` packages the Unicode data that Parley's text analysis and shaping pipeline needs at runtime. It exposes a locale-invariant `CompositePropsV1` provider backed by a compact `CodePointTrie`, allowing the engine to obtain all required character properties with a single lookup.
3+
`parley_data` packages the Unicode data that Parley's text analysis and shaping pipeline needs at runtime. It exposes a locale-invariant `CompositeProps` data backed by a compact `CodePointTrie`, allowing the engine to obtain all required character properties with a single lookup.
44

55
## What is included
66

7-
- `CompositePropsV1Data`, a trie that holds script, general category, grapheme cluster break, bidi class, and several emoji-related flags per scalar value.
8-
- Re-exported ICU4X data providers for grapheme, word, and line breaking, plus Unicode normalization tables used by Parley.
9-
- An implementation of `unicode_bidi::BidiDataSource`, making the composite provider plug directly into Unicode bidi processing.
7+
- `CompositeProps`, a trie that holds script, general category, grapheme cluster break, bidi class, and several emoji-related flags per scalar value.
108

119
## Cargo features
1210

1311
- `baked` *(default)* embeds pre-generated ICU4X and composite data from `src/generated`, enabling use in `no_std` targets without a filesystem.
14-
- `datagen` enables serialization, `databake`, and ICU provider export traits so the crate can participate in regeneration workflows. This feature is intended for developer use when refreshing the baked data and principally used by `../parley_data_gen`.
1512

1613
## Regenerating the baked data
1714

@@ -26,7 +23,7 @@ The generator downloads the latest ICU4X upstream data and recomputes the compos
2623

2724
## Why have this crate?
2825

29-
You may wonder why we can't simply run `parley_data_gen` within a `build.rs` file of `Parley`. Although being possible, that option increases build time by over a minute and requires a `std` compatible environment.
26+
You may wonder why we can't simply run `parley_data_gen` within a `build.rs` file of `Parley`. Although being possible, that option increases build time and requires a `std` compatible environment.
3027

3128
## License
3229

parley_data/src/generated/composite/mod.rs

Lines changed: 0 additions & 8 deletions
This file was deleted.

0 commit comments

Comments
 (0)