Spacy-Stanza document creation fails. #10478

SiamakShams · 2022-03-10T22:00:32Z

SiamakShams
Mar 10, 2022

Hi,
I've successfully created a spacy_stanza pipeline with this command: nlp = spacy_stanza.load_pipeline(name='fa', lang='fa', dir=STANZA_MODEL_DIR). But when I attempt to create a doc object using: doc = nlp(text) , on some certain occasions I hit a problem.
If the text used for creation of the doc object in doc = nlp(text) happens to contain � which is U+FFFD REPLACEMENT CHARACTER the document creation process halts and dumps the document tokens up to and including the �.

I've repeated the same process with vanilla spacy (having trained the pipeline as we don't have an official spacy FA pipeline) with the same document and have created a doc object successfully .
I've tried every possible way that I could think of to remove the U+FFFD REPLACEMENT CHARACTER from my text including replacing it with an empty space with no avail.

Can someone shed some light on this please.

Thanks

polm · 2022-03-11T04:44:24Z

polm
Mar 11, 2022

Small note about formatting: code should be enclosed in backticks to make it monospace (doc = nlp(text)), not underscores/asterisks to make it bold and italic.

0 replies

polm · 2022-03-11T04:48:52Z

polm
Mar 11, 2022

If the text used for creation of the doc object in doc = nlp(text) happens to contain � which is U+FFFD REPLACEMENT CHARACTER the document creation process halts and dumps the document tokens up to and including the �.

Are you getting an error message? If so could you share it?

I've tried every possible way that I could think of to remove the U+FFFD REPLACEMENT CHARACTER from my text including replacing it with an empty space with no avail.

If you're replacing it with a space and it's not working that sounds very odd. Is it possible that your source text doesn't actually contain U+FFFD, but instead another character that is just being rendered that way on your system?

I was able to execute this code without issue.

import stanza
import spacy_stanza

# Download the stanza model if necessary
stanza.download("en")

# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("en")

doc = nlp("test �")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)

So it doesn't seem to be a general error with that character or anything.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spacy-Stanza document creation fails. #10478

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spacy-Stanza document creation fails. #10478

Uh oh!

SiamakShams Mar 10, 2022

Replies: 2 comments

Uh oh!

polm Mar 11, 2022

Uh oh!

polm Mar 11, 2022

SiamakShams
Mar 10, 2022

polm
Mar 11, 2022

polm
Mar 11, 2022