Skip to content
Discussion options

You must be logged in to vote

It's hard to say what the right tokenization is for complex sequences like this. The English tokenizer was designed to handle slashes with a particular awareness of dates, like 12/01/71, which are easier to handle if treated as a single token.

If slashes are used in a particular way in your data, or less predictably, it could certainly make sense to customer the tokenizer behavior there.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@mapadofu
Comment options

@mapadofu
Comment options

@adrianeboyd
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / en English language data and models feat / tokenizer Feature: Tokenizer
3 participants