tokenizer explain lists two tokens, although tokenizer returns one #10569
-
I am writing special cases to handle units in medical text, and I have come across an issue that I cannot understand. My plan is to tokenize much more aggressively than the default rules, and then reassemble/retokenize with matching rules, in order to handle a lot of inconsistencies in how different clinicians document units. I have removed units from default suffixes as defined in punctuation.py in my custom language, and then added a number of additional rules - example below
I have confirmed that there are no special cases that account for the last token (100mL) not being split into (100, mL), and I can see using the explain function that indeed the prefix is being tokenized as expected, but cannot figure out what rule is causing the merging of these two tokens when calling the language object. I also tried [(t, nlp.tokenizer.explain(t.text)) for t in nlp.tokenizer(test_string)] to see if there was a later step in the pipeline that was retokenizing, but the output is unchanged. When I do not remove the default suffix I do not want, however, to include all possible units in the suffix match, as the subset I provided above is only a small example set, and the real list will generate a lot of false-positives, so prefer to handle with a Matcher, where there is more flexibility to look backwards and forwards past any whitespace present. It also feels like it should be possible to force the tokenizer to split here, as the prefix is detected properly without it, but just can't work out where it is being retokenized / merged. Any help greatly appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The tokenizer and assert [t.text for t in nlp(test_string)] == [x[1] for x in nlp.tokenizer.explain(test_string)] With Infixes are applied after prefixes/suffixes, so If you haven't seen it, have a look at the steps at the bottom of this expandable box that describes the order in which the regexes are applied: https://spacy.io/usage/linguistic-features#how-tokenizer-works. If you don't see |
Beta Was this translation helpful? Give feedback.
The tokenizer and
tokenizer.explain
results are the same here:With
(t, nlp.tokenizer.explain(t.text))
you're running each individual token text through the tokenizer again, which may not produce the same results as tokenizing once. A similar example is with French, where retokenizing the intended tokenl'
producesl '
.Infixes are applied after prefixes/suffixes, so
100mL
isn't split until you apply the infix pattern for/
and then it doesn't look for suffixes again. You'd have to addunits_denom
as suffixes to have this split into10mg/100 mL
by the suffix patterns before it gets to the infixes.