tokenizer explain lists two tokens, although tokenizer returns one #10569

gkennos · 2022-03-29T03:55:31Z

gkennos
Mar 29, 2022

I am writing special cases to handle units in medical text, and I have come across an issue that I cannot understand.

My plan is to tokenize much more aggressively than the default rules, and then reassemble/retokenize with matching rules, in order to handle a lot of inconsistencies in how different clinicians document units.

I have removed units from default suffixes as defined in punctuation.py in my custom language, and then added a number of additional rules - example below

from spacy.lang.en import English

units_num = ['mg', 'mcg', 'g', 'ug', 'mG', 'G', 'mcG']
units_denom = ['L', 'l', 'mL', 'ml']

unit_suffix = []
for a in units_num:
    for b in units_denom:
        if a != b:
            unit_suffix.append(f'{a}/{b}')

class myLangDefaults(English.Defaults):
    infixes = (unit_suffix + list(English.Defaults.infixes) + ['/'])
    suffixes = (unit_suffix + [s for s in (English.Defaults.suffixes) if 'gb' not in s] + ['/'])
    prefixes = (unit_suffix + list(English.Defaults.prefixes) + ['/', '\d+'])

class myLang(English):
    lang = 'demo'
    Defaults = myLangDefaults

nlp = myLang()
test_string = '10mg 10mg/L 10mg/100mL'
[t for t in nlp(test_string)]

# output: [10, mg, 10, mg/L, 10, mg, /, 100mL]
# expected output = [10, mg, 10, mg/L, 10, mg, /, 100, mL]

[(t, nlp.tokenizer.explain(t.text)) for t in nlp(test_string)]

# output
# [(10, [('PREFIX', '10')]),
# (mg, [('TOKEN', 'mg')]),
# (10, [('PREFIX', '10')]),
# (mg/L, [('PREFIX', 'mg/L')]),
# (10, [('PREFIX', '10')]),
# (mg, [('TOKEN', 'mg')]),
# (/, [('PREFIX', '/')]),
# (100mL, [('PREFIX', '100'), ('TOKEN', 'mL')])]

I have confirmed that there are no special cases that account for the last token (100mL) not being split into (100, mL), and I can see using the explain function that indeed the prefix is being tokenized as expected, but cannot figure out what rule is causing the merging of these two tokens when calling the language object.

I also tried [(t, nlp.tokenizer.explain(t.text)) for t in nlp.tokenizer(test_string)] to see if there was a later step in the pipeline that was retokenizing, but the output is unchanged.

When I do not remove the default suffix '(?<=[0-9])(?:km|km²|km³|m|m²|m³|dm|dm²|dm³|cm|cm²|cm³|mm|mm²|mm³|ha|µm|nm|yd|in|ft|kg|g|mg|µg|t|lb|oz|m/s|km/h|kmh|mph|hPa|Pa|mbar|mb|MB|kb|KB|gb|GB|tb|TB|T|G|M|K|%|км|км²|км³|м|м²|м³|дм|дм²|дм³|см|см²|см³|мм|мм²|мм³|нм|кг|г|мг|м/с|км/ч|кПа|Па|мбар|Кб|КБ|кб|Мб|МБ|мб|Гб|ГБ|гб|Тб|ТБ|тбكم|كم²|كم³|م|م²|م³|سم|سم²|سم³|مم|مم²|مم³|كم|غرام|جرام|جم|كغ|ملغ|كوب|اكواب)', then the string '10mg/100g' is tokenized as I expect, but not '10mg/100mL', as it does not match the unit-specific suffix string

I do not want, however, to include all possible units in the suffix match, as the subset I provided above is only a small example set, and the real list will generate a lot of false-positives, so prefer to handle with a Matcher, where there is more flexibility to look backwards and forwards past any whitespace present. It also feels like it should be possible to force the tokenizer to split here, as the prefix is detected properly without it, but just can't work out where it is being retokenized / merged.

Any help greatly appreciated.

Answered by adrianeboyd

Mar 29, 2022

The tokenizer and tokenizer.explain results are the same here:

assert [t.text for t in nlp(test_string)] == [x[1] for x in nlp.tokenizer.explain(test_string)]

With (t, nlp.tokenizer.explain(t.text)) you're running each individual token text through the tokenizer again, which may not produce the same results as tokenizing once. A similar example is with French, where retokenizing the intended token l' produces l '.

Infixes are applied after prefixes/suffixes, so 100mL isn't split until you apply the infix pattern for / and then it doesn't look for suffixes again. You'd have to add units_denom as suffixes to have this split into 10mg/100 mL by the suffix patterns before it gets to the infixes.

View full answer

adrianeboyd · 2022-03-29T06:44:10Z

adrianeboyd
Mar 29, 2022

The tokenizer and tokenizer.explain results are the same here:

assert [t.text for t in nlp(test_string)] == [x[1] for x in nlp.tokenizer.explain(test_string)]

With (t, nlp.tokenizer.explain(t.text)) you're running each individual token text through the tokenizer again, which may not produce the same results as tokenizing once. A similar example is with French, where retokenizing the intended token l' produces l '.

Infixes are applied after prefixes/suffixes, so 100mL isn't split until you apply the infix pattern for / and then it doesn't look for suffixes again. You'd have to add units_denom as suffixes to have this split into 10mg/100 mL by the suffix patterns before it gets to the infixes.

If you haven't seen it, have a look at the steps at the bottom of this expandable box that describes the order in which the regexes are applied: https://spacy.io/usage/linguistic-features#how-tokenizer-works.

If you don't see SPECIAL in the tokenizer.explain then there's no chance that anything was merged. The other regexes only split, and what's left over is TOKEN.

1 reply

gkennos Mar 29, 2022
Author

thanks for this - I had thought the explainer needed to be applied at the token level not across the whole string, which was why I was chasing down a fictional merging step - in this instance adding a digit as an infix is a better call than adding my units as suffixes I think :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

tokenizer explain lists two tokens, although tokenizer returns one #10569

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

tokenizer explain lists two tokens, although tokenizer returns one #10569

Uh oh!

gkennos Mar 29, 2022

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Mar 29, 2022

Uh oh!

gkennos Mar 29, 2022 Author

gkennos
Mar 29, 2022

Replies: 1 comment 1 reply

adrianeboyd
Mar 29, 2022

gkennos Mar 29, 2022
Author