Tokenizing Special cases with whitespace #13118

Gitclop · 2023-11-08T11:29:15Z

Gitclop
Nov 8, 2023

I want to tell the tokenizer to not split Tokens like pi 123 into two seperate tokens, but treat it as one token. I tried a custom infix 'PI \d+', but apperantly infixes are called after whitespace tokenisation. What worked was adding pi 123 into a special_cases file:

special_cases = load_special_cases('special_cases.txt')

        tokenizer_exceptions = nlp.Defaults.tokenizer_exceptions.copy()
        for word in special_cases:
            case = {word: [{"ORTH": word}]}
            tokenizer_exceptions.update(case)

But since there are a lot of combination i need to use regular expressions

Answered by adrianeboyd

Nov 8, 2023

The tokenizer only supports exact string matches as special cases. Instead, you can add a custom component at the beginning of your pipeline that matches based on regexes and uses the retokenizer to adjust the tokenization before the rest of the components are run: https://spacy.io/usage/linguistic-features#retokenization

View full answer

adrianeboyd · 2023-11-08T13:34:38Z

adrianeboyd
Nov 8, 2023

The tokenizer only supports exact string matches as special cases. Instead, you can add a custom component at the beginning of your pipeline that matches based on regexes and uses the retokenizer to adjust the tokenization before the rest of the components are run: https://spacy.io/usage/linguistic-features#retokenization

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenizing Special cases with whitespace #13118

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tokenizing Special cases with whitespace #13118

Uh oh!

Gitclop Nov 8, 2023

Replies: 1 comment

Uh oh!

adrianeboyd Nov 8, 2023

Gitclop
Nov 8, 2023

adrianeboyd
Nov 8, 2023