Pipeline issues when text has extra spaces between words #10726

jordireinsma · 2022-04-28T18:48:34Z

jordireinsma
Apr 28, 2022

I'm currently stuck in this problem which comes from the way Tokenization works (pt_core_news_sm model):

Some of our text inputs have extra spaces between words, e.g.

Eu   tenho  12 anos.

We verified that spaCy pipelines outputs different POS tag, NER, etcetera when text contains such extra spaces, with this test:

def filter_spaces(doc: Doc) -> List[Token]:
    return [token for token in doc if not token.is_space]

def get_infos(token: Token) -> tuple:
    return (
        token.text,
        token.pos_,
        token.tag_,
        token.dep_,
        token.morph,
        token.ent_type_,
    )

text1 = "Bolha d'água?!?"
text2 = "  ".join(text1.split())
text3 = "   ".join(text1.split())

doc1 = filter_spaces(nlp(text1))
doc2 = filter_spaces(nlp(text2))
doc3 = filter_spaces(nlp(text3))

for t1, t2, t3 in zip(doc1, doc2, doc3):
    x, y, z = get_infos(t1), get_infos(t2), get_infos(t3)
    if x != y or x != z or y != z:
        print(x, y, z)

We cannot edit out the extra spaces as a preprocess step, because we really want to point out the position in the original text where certain attributes (ent_type for example) are located and return to the end user.

How do we customize the spaCy Tokenizer to use that filter_spaces, so that the following pipelines are not affected by the SPACE token in the Doc object? Or, in other words, how to be sure that get_info returns the same results for non-space tokens regardless of extra spaces in text?

adrianeboyd · 2022-04-29T06:15:59Z

adrianeboyd
Apr 29, 2022

The only way to get the exact same results is to preprocess the text for spacy while keeping track of the modifications, and then map the annotations back to the version of the text with whitespace.

2 replies

jordireinsma Apr 30, 2022
Author

Hey Adriane, thanks for the fast reply! I'm currently going into the path of preprocessing and keeping track of the mapping of original positions, even if it sound like monkey patching.

Just another related question: is there any way to have a custom Tokenizer which outputs a Doc where the space tokens are skipped, without changing the original token.idx and span.start/end and all other position-related attributes?

adrianeboyd May 2, 2022

No, it's a pretty fundamental design choice that the spacy Doc tokens are stored directly based on the character offsets from the text, and that space tokens are always included. (It's part of the reason behind the library name!)

You could write a custom tokenizer that modifies the input text while storing the original offsets / trailing whitespace as custom attributes in order to be able to reconstruct the original form you want just from within the Doc, but the final output would need to be something other than a spacy Doc object. (Something like Doc._.original_format as a custom getter that uses Token._.trailing_whitespace to reinsert the whitespace without annotation.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pipeline issues when text has extra spaces between words #10726

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Pipeline issues when text has extra spaces between words #10726

Uh oh!

Uh oh!

jordireinsma Apr 28, 2022

Replies: 1 comment · 2 replies

Uh oh!

adrianeboyd Apr 29, 2022

Uh oh!

jordireinsma Apr 30, 2022 Author

Uh oh!

adrianeboyd May 2, 2022

jordireinsma
Apr 28, 2022

Replies: 1 comment 2 replies

adrianeboyd
Apr 29, 2022

jordireinsma Apr 30, 2022
Author