Preprocessing of text to build token sets can be faster

**Problems:** 

Using python to regex split texts is slow, while std::regex does not support Unicode property shortcuts (e.g. \p{L})

**Suggested alternative**:
1) Implement your own word counter in your favourite language to obtain word count dicts and use ```train_new_from_counts```
2) Increase ```workers``` in ```train_new_from_iterator```

**Potential fix:**

~~Implement the word counter in rust~~ (unlikely due to different build-systems)
Rewrite everything in Rust (when I have the time 😔) 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preprocessing of text to build token sets can be faster #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preprocessing of text to build token sets can be faster #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions