English tokenizer's behavior with "/" characters #10425

mapadofu · 2022-03-03T23:37:30Z

mapadofu
Mar 3, 2022

Can someone explain the logic in how spaCy's English tokenizer handles "words" that have '/' characters in them?

Consider this example:

In [1]: import spacy

In [2]: nlp = spacy.load('en_core_web_md')

In [3]: text = "12/AB/3456-78"

In [4]: doc = nlp(text)

In [5]: print(' '.join( f'[t.text]' for t in doc)) 
[t.text] [t.text] [t.text] [t.text] [t.text]
# oops

In [6]: print(' '.join( f'[{t.text}]' for t in doc))
[12] [/] [AB/3456] [-] [78]

In [7]: spacy.__version__
Out[7]: '3.2.2'

It splits the serial number on the first '/', but not on the second; Having observed that "12/AB/A3456-7" => [12] [/] [AB] [/] [A3456] [-] [7], my current hypothesis is that the shape "XX/d*" won't be split while "XX/[xX]*" will be. What is the motivation behind this behavior?

I also have to deal with "//" as separators, which don't get seem to be split at all, e.g.

In [14]: text = "12//AB//A3456-A78"

In [15]: doc = nlp(text)

In [16]: print(' '.join( f'[{t.text}]' for t in doc))
[12//AB//A3456] [-] [A78]

so I guess I'll have to set up a custom tokenizer to handle this kind of case.

Answered by polm

Mar 6, 2022

It's hard to say what the right tokenization is for complex sequences like this. The English tokenizer was designed to handle slashes with a particular awareness of dates, like 12/01/71, which are easier to handle if treated as a single token.

If slashes are used in a particular way in your data, or less predictably, it could certainly make sense to customer the tokenizer behavior there.

View full answer

polm · 2022-03-06T06:46:08Z

polm
Mar 6, 2022

It's hard to say what the right tokenization is for complex sequences like this. The English tokenizer was designed to handle slashes with a particular awareness of dates, like 12/01/71, which are easier to handle if treated as a single token.

If slashes are used in a particular way in your data, or less predictably, it could certainly make sense to customer the tokenizer behavior there.

3 replies

mapadofu Mar 9, 2022
Author

To some degree I'm less concerned with the "right" tokenization than I am with being able to predict how the tokenizer will treat different strings.

What is the underlying reason why "AB/2345" does not get split? If it's triggering the "date" logic then, I'd consider this to be a bug.

Here are some more examples to consider:

In [11]: res = nlp.tokenizer.explain('AB/12/34568-90')

In [12]: for r  in res:
    ...:     print(r)
    ...: 
('TOKEN', 'AB/12/34568')
('INFIX', '-')
('TOKEN', '90')

In [13]: res = nlp.tokenizer.explain('AB-123/12/34568-90')

In [14]: for r  in res:
    ...:     print(r)
    ...: 
('TOKEN', 'AB-123/12/34568')
('INFIX', '-')
('TOKEN', '90')

I was mucking around, and thought "maybe it is that it splits the on the first slash, but not subsequent ones", but that was proven false.

The second example is one that doesn't make sense to me -- not splitting on the first - but splitting on the second.

mapadofu Mar 9, 2022
Author

What would be the drawbacks to (figuring out how to) just turn off splitting on /'s ? There'd be some "new words", like "he/she" that will be rare or require custom rules, but dates want to be kept together, and URLs want to be kept together -- I can't think of other cases where it'd be a problem.

adrianeboyd Mar 10, 2022

For a new pipeline, the main thing is that you want your tokenization to match the tokenization in your training data as much as possible. While training, spacy will automatically ignore cases where the tokenization is misaligned, but the performance of the final model may not be as good as it would be with better tokenization.

Existing trained pipelines like en_core_web_sm may not perform as well once you start modifying the tokenization because they didn't see tokens like this during training, so you have to be prepared to handle new types of errors in the annotation. (You might get PROPN as the POS tag for he/she, for instance.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

English tokenizer's behavior with "/" characters #10425

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

English tokenizer's behavior with "/" characters #10425

Uh oh!

Uh oh!

mapadofu Mar 3, 2022

Replies: 1 comment · 3 replies

Uh oh!

polm Mar 6, 2022

Uh oh!

Uh oh!

mapadofu Mar 9, 2022 Author

Uh oh!

mapadofu Mar 9, 2022 Author

Uh oh!

adrianeboyd Mar 10, 2022

mapadofu
Mar 3, 2022

Replies: 1 comment 3 replies

polm
Mar 6, 2022

mapadofu Mar 9, 2022
Author

mapadofu Mar 9, 2022
Author