Skip to content
Discussion options

You must be logged in to vote

The tokenizer and tokenizer.explain results are the same here:

assert [t.text for t in nlp(test_string)] == [x[1] for x in nlp.tokenizer.explain(test_string)]

With (t, nlp.tokenizer.explain(t.text)) you're running each individual token text through the tokenizer again, which may not produce the same results as tokenizing once. A similar example is with French, where retokenizing the intended token l' produces l '.

Infixes are applied after prefixes/suffixes, so 100mL isn't split until you apply the infix pattern for / and then it doesn't look for suffixes again. You'd have to add units_denom as suffixes to have this split into 10mg/100 mL by the suffix patterns before it gets to the infixes.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@gkennos
Comment options

Answer selected by gkennos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants