add_special_case to Tokenizer based on regex #10824

dennymarcels · 2022-05-19T13:58:40Z

dennymarcels
May 19, 2022

Hi y'all,
Can I do it? I would like that any case starting with some symbol (say ¶) would return the whole case as its ORTH (currently, some of them as split, for example ¶kvhMjNNUwjbuHkVjj91G becomes ¶kvhMjNNUwjbuHkVjj91 + G).

Answered by adrianeboyd

May 19, 2022

You can define a token_match pattern that is matched against each full token as described here:

https://spacy.io/usage/linguistic-features#how-tokenizer-works

View full answer

adrianeboyd · 2022-05-19T15:48:52Z

adrianeboyd
May 19, 2022

You can define a token_match pattern that is matched against each full token as described here:

https://spacy.io/usage/linguistic-features#how-tokenizer-works

0 replies

dennymarcels · 2022-05-19T22:03:29Z

dennymarcels
May 19, 2022
Author

Sweet!

In case anyone else wonders, this is what I did:

spacy_model.tokenizer = Tokenizer(spacy_model.vocab, token_match=re.compile(r'^¶+').search)

Works beautifully!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

add_special_case to Tokenizer based on regex #10824

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

add_special_case to Tokenizer based on regex #10824

Uh oh!

dennymarcels May 19, 2022

Replies: 2 comments

Uh oh!

adrianeboyd May 19, 2022

Uh oh!

dennymarcels May 19, 2022 Author

dennymarcels
May 19, 2022

adrianeboyd
May 19, 2022

dennymarcels
May 19, 2022
Author