Text tokenizer is classifying the letter "O" as punctuation. #13226
Replies: 2 comments
-
Hi! If I run this sample code:
It gives me
Which is what you want: the punctuation is removed, but no other characters are. On your end, I can imagine that either you have a text that contains non-standard characters (that may look like an "o") or perhaps you're using a different model as tagger. BTW - note that your code replaces all instances of the token text with |
Beta Was this translation helpful? Give feedback.
-
The text is being read in from a pdf, so the text looks like an “o” but is being treated as punctuation- what would a punctuation mark be that would display like an “o”? I don’t follow this part- “BTW - note that your code replaces all instances of the token text with '', instead of just the one token that is classified as PUNCT. This is pretty error-prone, as one wrong FP by the tagger could lead to many FP hits in your text.”Your help is much appreciated
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using the following code:
with the following raw text, read from a PDF using pyPDF
it is being converted to
I noted this behavior to the creator of the package I am using Resume Matcher, but I can keep the letter "O" in the output using this workaround:
There may be an issue as to how the text is being read in from pyPDF, but looking at the results when using the pyPDF function, the text looks correct.
Info about spaCy
Python 3.9.0
Windows 10
Beta Was this translation helpful? Give feedback.
All reactions