Spacy-Stanza document creation fails. #10478
Replies: 2 comments
-
Small note about formatting: code should be enclosed in backticks to make it monospace ( |
Beta Was this translation helpful? Give feedback.
-
Are you getting an error message? If so could you share it?
If you're replacing it with a space and it's not working that sounds very odd. Is it possible that your source text doesn't actually contain U+FFFD, but instead another character that is just being rendered that way on your system? I was able to execute this code without issue.
So it doesn't seem to be a general error with that character or anything. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I've successfully created a spacy_stanza pipeline with this command: nlp = spacy_stanza.load_pipeline(name='fa', lang='fa', dir=STANZA_MODEL_DIR). But when I attempt to create a doc object using: doc = nlp(text) , on some certain occasions I hit a problem.
If the text used for creation of the doc object in doc = nlp(text) happens to contain � which is U+FFFD REPLACEMENT CHARACTER the document creation process halts and dumps the document tokens up to and including the �.
I've repeated the same process with vanilla spacy (having trained the pipeline as we don't have an official spacy FA pipeline) with the same document and have created a doc object successfully .
I've tried every possible way that I could think of to remove the U+FFFD REPLACEMENT CHARACTER from my text including replacing it with an empty space with no avail.
Can someone shed some light on this please.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions