Skip to content

Preprocessing of text to build token sets can be fasterΒ #3

@jararap

Description

@jararap

Problems:

Using python to regex split texts is slow, while std::regex does not support Unicode property shortcuts (e.g. \p{L})

Suggested alternative:

  1. Implement your own word counter in your favourite language to obtain word count dicts and use train_new_from_counts
  2. Increase workers in train_new_from_iterator

Potential fix:

Implement the word counter in rust (unlikely due to different build-systems)
Rewrite everything in Rust (when I have the time πŸ˜”)

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions