Incorrect infix tokenization of /" #10001

djmechanic · 2022-01-07T15:13:32Z

djmechanic
Jan 7, 2022

I came across a case where tokenization seems to fail on an adjacent slash quote between two words, such as ALPHA/"BRAVO" or ALPHA/"BRAVO CHARLIE". See example below.

This is easy enough to preprocess (insert a space between adjacent slash and quote), I'm just unsure if this is a bug or expected behaviour. If it looks like a bug I'm happy to submit a full bug report, just don't want to waste anybody's time with a false alarm.

lines = ["ALPHA/BRAVO", "ALPHA/BRAVO CHARLIE", 'ALPHA/"BRAVO"', 'ALPHA/"BRAVO CHARLIE"', '"ALPHA/BRAVO CHARLIE"']
nlp = spacy.load("en_core_web_sm")
for line in lines:
    doc = nlp(line)
    print (line)
    print([token for token in doc])
    print()

    # ALPHA/BRAVO
    # [ALPHA, /, BRAVO]

    # ALPHA/BRAVO CHARLIE
    # [ALPHA, /, BRAVO, CHARLIE]

    # ALPHA/"BRAVO"
    # [ALPHA/"BRAVO, "]  <-- wrong

    # ALPHA/"BRAVO CHARLIE"
    # [ALPHA/"BRAVO, CHARLIE, "]  <-- wrong

    # "ALPHA/BRAVO CHARLIE"
    # [", ALPHA, /, BRAVO, CHARLIE, "]

Answered by adrianeboyd

Jan 11, 2022

This is the expected behavior for the current English tokenizer defaults. It currently only splits on / as an infix between alpha+digit / alpha:

spaCy/spacy/lang/punctuation.py

Line 44 in 5ba4171

r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),

I think these defaults are intended to treat dates like 01/01/2022 differently from ABC/DEF.

You can certainly customize these settings for your own model, see: https://spacy.io/usage/linguistic-features#native-tokenizer-additions,

Since this isn't the exact same tokenization the pipeline was trained on you might see a few more errors in tags and parses if you change this for en_core_web_sm, but it probably only leads to minor di…

View full answer

adrianeboyd · 2022-01-11T13:31:26Z

adrianeboyd
Jan 11, 2022

This is the expected behavior for the current English tokenizer defaults. It currently only splits on / as an infix between alpha+digit / alpha:

spaCy/spacy/lang/punctuation.py

Line 44 in 5ba4171

r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),

I think these defaults are intended to treat dates like 01/01/2022 differently from ABC/DEF.

You can certainly customize these settings for your own model, see: https://spacy.io/usage/linguistic-features#native-tokenizer-additions,

Since this isn't the exact same tokenization the pipeline was trained on you might see a few more errors in tags and parses if you change this for en_core_web_sm, but it probably only leads to minor differences for this particular change.

If you're training a model from scratch, you'd want to modify the tokenizer settings like this: https://spacy.io/usage/training#custom-tokenizer

1 reply

djmechanic Jan 11, 2022
Author

Thank you Adriane, its clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect infix tokenization of /" #10001

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Incorrect infix tokenization of /" #10001

Uh oh!

Uh oh!

djmechanic Jan 7, 2022

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Jan 11, 2022

Uh oh!

djmechanic Jan 11, 2022 Author

djmechanic
Jan 7, 2022

Replies: 1 comment 1 reply

adrianeboyd
Jan 11, 2022

djmechanic Jan 11, 2022
Author