Text tokenizer is classifying the letter "O" as punctuation. #13226

dorncg18 · 2024-01-05T17:32:24Z

dorncg18
Jan 5, 2024

I am using the following code:

doc = nlp(text)
for token in doc:
    if token.pos_ == 'PUNCT':
        text = text.replace(token.text, '')

with the following raw text, read from a PDF using pyPDF

"with a proven track record of delivering strategic financial solutions for clients. Highly accomplished"

it is being converted to

"with a prven track recrd f delivering strategic financial slutins fr clients Highly accmplished"

I noted this behavior to the creator of the package I am using Resume Matcher, but I can keep the letter "O" in the output using this workaround:

doc = nlp(text)
for token in doc:
    if token.pos_ == 'PUNCT' and token.text != 'o':
        text = text.replace(token.text, '')

There may be an issue as to how the text is being read in from pyPDF, but looking at the results when using the pyPDF function, the text looks correct.

Info about spaCy

Python 3.9.0
Windows 10

spaCy version: 3.6.0
Platform: Windows-10-10.0.19041-SP0
Python version: 3.9.0
Pipelines: en_core_web_md (3.6.0), en_core_web_sm (3.6.0)

svlandeg · 2024-01-08T15:56:06Z

svlandeg
Jan 8, 2024

Hi!

If I run this sample code:

    nlp = spacy.load('en_core_web_lg')
    text = "with a proven track record of delivering strategic financial solutions for clients. Highly accomplished"
    doc = nlp(text)
    for token in doc:
        if token.pos_ == 'PUNCT':
            text = text.replace(token.text, '')
    print(text)

It gives me

with a proven track record of delivering strategic financial solutions for clients Highly accomplished

Which is what you want: the punctuation is removed, but no other characters are.

On your end, I can imagine that either you have a text that contains non-standard characters (that may look like an "o") or perhaps you're using a different model as tagger. BTW - note that your code replaces all instances of the token text with '', instead of just the one token that is classified as PUNCT. This is pretty error-prone, as one wrong FP by the tagger could lead to many FP hits in your text.

0 replies

dorncg18 · 2024-01-09T05:48:35Z

dorncg18
Jan 9, 2024
Author

The text is being read in from a pdf, so the text looks like an “o” but is being treated as punctuation- what would a punctuation mark be that would display like an “o”? I don’t follow this part- “BTW - note that your code replaces all instances of the token text with '', instead of just the one token that is classified as PUNCT. This is pretty error-prone, as one wrong FP by the tagger could lead to many FP hits in your text.”Your help is much appreciated

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Text tokenizer is classifying the letter "O" as punctuation. #13226

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Text tokenizer is classifying the letter "O" as punctuation. #13226

Uh oh!

dorncg18 Jan 5, 2024

Info about spaCy

Replies: 2 comments

Uh oh!

svlandeg Jan 8, 2024

Uh oh!

Uh oh!

dorncg18 Jan 9, 2024 Author

dorncg18
Jan 5, 2024

svlandeg
Jan 8, 2024

dorncg18
Jan 9, 2024
Author