Trailing dot handling #12930

lsmith77 · 2023-08-23T09:00:39Z

lsmith77
Aug 23, 2023

Trailing dots on numbers are handled differently for English and German. English splits the trailing dot into its own token. German does not.

Both do not split off the trailing dot from "m.". Not sure if this is due to some tweaks to handle cases like Mr. and Dr..

In our case we would however always want the . to be its own token.

I saw the example for https://spacy.io/usage/linguistic-features#tokenization for "Let's go to N.Y.!" but I was hoping that there is a simpler way to define a special case for a trailing dot to be split into it's own token.

How to reproduce the behaviour

def custom_tokenizer(lang, nlp):
    if lang == "de":
        infixes = (
            LIST_ELLIPSES
            + LIST_ICONS
            + [
                r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
                r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
                # removed : [:<>=]
                r"(?<=[{a}])[<>=](?=[{a}])".format(a=ALPHA),
                r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
                r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
                r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(
                    a=ALPHA, q=CONCAT_QUOTES.replace("'", "")
                ),
                r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
                r"(?<=[0-9])-(?=[0-9])",
            ]
        )

        infix_re = compile_infix_regex(infixes)
    else:
        # https://spacy.io/usage/linguistic-features#tokenization
        infixes = (
            LIST_ELLIPSES
            + LIST_ICONS
            + [
                r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
                r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
                    al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
                ),
                r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
                # ✅ Commented out regex that splits on hyphens between letters:
                # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
                r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
            ]
        )
        infix_re = compile_infix_regex(infixes)

    return Tokenizer(
        nlp.vocab,
        prefix_search=nlp.tokenizer.prefix_search,
        suffix_search=nlp.tokenizer.suffix_search,
        infix_finditer=infix_re.finditer,
        token_match=nlp.tokenizer.token_match,
        rules=nlp.Defaults.tokenizer_exceptions,
    )

Note: The differences in the infixes between English and German do not seem to impact the differences in how it handles 1..

    texts = [
        "Nummer 1.",
        "m.",
        "Mr.",
    ]
    for text in texts:
        tokens = nlp(text)
        for token in tokens:
            print(token.text)

For English I get:

Nummer
1
.
m.
Mr.

For German I get:

Nummer
1.
m.
Mr.

Your Environment

Operating System: OSX/Linux
Python Version Used: 3.11
spaCy Version Used: 3.6

Answered by lsmith77

Aug 23, 2023

I managed to fix the difference between English and German by adding after I found #7303:

    suffixes = nlp.Defaults.suffixes + [r"\."]
    suffix_regex = spacy.util.compile_suffix_regex(suffixes)
    nlp.tokenizer.suffix_search = suffix_regex.search

View full answer

lsmith77 · 2023-08-23T11:01:06Z

lsmith77
Aug 23, 2023
Author

I managed to fix the difference between English and German by adding after I found #7303:

    suffixes = nlp.Defaults.suffixes + [r"\."]
    suffix_regex = spacy.util.compile_suffix_regex(suffixes)
    nlp.tokenizer.suffix_search = suffix_regex.search

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Trailing dot handling #12930

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Trailing dot handling #12930

Uh oh!

Uh oh!

lsmith77 Aug 23, 2023

How to reproduce the behaviour

Your Environment

Replies: 1 comment

Uh oh!

lsmith77 Aug 23, 2023 Author

lsmith77
Aug 23, 2023

lsmith77
Aug 23, 2023
Author