Word "id" represented as two tokens #11154

sillentkill21 · 2022-07-18T15:14:27Z

sillentkill21
Jul 18, 2022

How to reproduce the behaviour

`import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("id of something")

for token in doc:
print(token.text, token.pos_)`

Your Environment

Operating System: Windows 10 64bit/Ubuntu 20.04
Python Version Used: 3.8
spaCy Version Used: 3.2.3
Language models: en_core_web_sm, en_core_web_md

Additional info

After testing the word "id" in multiple situations, the results were the same. Word "id" was represented as two tokens "i" and "d". Is this the expected behavior?

Answered by polm

Jul 19, 2022

This is intended behavior, see #10455. The English tokenizer treats "id" as a typo for "I'd" by default, though you can change it.

View full answer

polm · 2022-07-19T03:48:16Z

polm
Jul 19, 2022

This is intended behavior, see #10455. The English tokenizer treats "id" as a typo for "I'd" by default, though you can change it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Word "id" represented as two tokens #11154

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Word "id" represented as two tokens #11154

Uh oh!

sillentkill21 Jul 18, 2022

How to reproduce the behaviour

Your Environment

Additional info

Replies: 1 comment

Uh oh!

polm Jul 19, 2022

sillentkill21
Jul 18, 2022

polm
Jul 19, 2022