Using the ICU library and CLDR data for tokenization? #6913

jack-pappas · 2021-02-03T17:41:32Z

jack-pappas
Feb 3, 2021

Have the spaCy authors considered utilizing the ICU (International Components for Unicode) library for implementing at least some parts of the language support / tokenization? I'm new to spaCy and NLP in general, but at a glance it seems like there's a lot of overlap between the Language implementations in spaCy and what's provided by ICU; utilizing ICU in spaCy would mean it'd be able to support every language (incl. dialects) supported by Unicode.

A few other benefits I can see for spaCy if using ICU:

icu4c has a fast tokenizer implementation (in C)
There's a lot of additional formatting data available for every locale, such as differentiating between ordinal/cardinal numbers, currency formatting, currency names and words for various units-of-measure. spaCy could use this to provide richer tokenization or part-of-speech tagging so users of the library don't need to implement that themselves. For example, spaCy might now do a simple tokenization of 123 미국 달러 into ['123', '미국', '달러'], but using the CLDR data provided through ICU it might be possible to parse that string into an object like CurrencyAmount(value=123.0, iso4217='USD').

I've already seen at least one project where someone's using ICU to implement a "universal" language tokenizer (in Python): https://github.com/mingruimingrui/ICU-tokenizer

Alternatively, if spaCy doesn't want to take a dependency on ICU, the CLDR data (the data that drives the various ICU libraries) could be preprocessed during the spaCy CI pipeline to extract relevant information and codegen parts of the various Language implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using the ICU library and CLDR data for tokenization? #6913

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Using the ICU library and CLDR data for tokenization? #6913

Uh oh!

jack-pappas Feb 3, 2021

Replies: 0 comments

jack-pappas
Feb 3, 2021