Problems:
Using python to regex split texts is slow, while std::regex does not support Unicode property shortcuts (e.g. \p{L})
Suggested alternative:
- Implement your own word counter in your favourite language to obtain word count dicts and use
train_new_from_counts
- Increase
workers in train_new_from_iterator
Potential fix:
Implement the word counter in rust (unlikely due to different build-systems)
Rewrite everything in Rust (when I have the time π)