Using the ICU library and CLDR data for tokenization? #6913
jack-pappas
started this conversation in
Language Support
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Have the spaCy authors considered utilizing the ICU (International Components for Unicode) library for implementing at least some parts of the language support / tokenization? I'm new to spaCy and NLP in general, but at a glance it seems like there's a lot of overlap between the
Language
implementations in spaCy and what's provided by ICU; utilizing ICU in spaCy would mean it'd be able to support every language (incl. dialects) supported by Unicode.A few other benefits I can see for spaCy if using ICU:
123 미국 달러
into['123', '미국', '달러']
, but using the CLDR data provided through ICU it might be possible to parse that string into an object likeCurrencyAmount(value=123.0, iso4217='USD')
.I've already seen at least one project where someone's using ICU to implement a "universal" language tokenizer (in Python): https://github.com/mingruimingrui/ICU-tokenizer
Alternatively, if spaCy doesn't want to take a dependency on ICU, the CLDR data (the data that drives the various ICU libraries) could be preprocessed during the spaCy CI pipeline to extract relevant information and codegen parts of the various
Language
implementations.Beta Was this translation helpful? Give feedback.
All reactions