Skip to content
Discussion options

You must be logged in to vote

This is the expected behavior for the current English tokenizer defaults. It currently only splits on / as an infix between alpha+digit / alpha:

r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),

I think these defaults are intended to treat dates like 01/01/2022 differently from ABC/DEF.

You can certainly customize these settings for your own model, see: https://spacy.io/usage/linguistic-features#native-tokenizer-additions,

Since this isn't the exact same tokenization the pipeline was trained on you might see a few more errors in tags and parses if you change this for en_core_web_sm, but it probably only leads to minor di…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@djmechanic
Comment options

Answer selected by djmechanic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants