Need Help on Custom Tokenization using Spacy! #9721

karndeepsingh · 2021-11-22T11:08:41Z

karndeepsingh
Nov 22, 2021

Hi,
I have been trying to tokenize the text sentences and I am facing problem in separating the two words(that are present together without space) into two separate words. For example: in my sentence the word ""Borrower"is" present together. I want it to be separated into two different words like "Borrower" as one word and "is" as different word. Please help me to build such a tokenization rule so that I can separate such together words into two different words. Please, help me to find a better tokenization so that it can look into every doc object and separate it into two words. Following image show the rule I wrote to tokenize the text:

Answered by adrianeboyd

Nov 24, 2021

Sorry, I can't tell what the input text looks like from the formatting in the question. If you have run-on words like borroweris then you might want look into libraries related to spell-checking that identify run-ons and either run this as a preprocessing step before spacy (inserting whitespace), or potentially as a postprocessing step after the tokenizer that retokenizes those tokens (which you could do while preserving the original whitespace). The rule-based tokenizer can handle predictable cases like ca n't, but not any potential run-on words.

View full answer

adrianeboyd · 2021-11-24T10:22:40Z

adrianeboyd
Nov 24, 2021

Sorry, I can't tell what the input text looks like from the formatting in the question. If you have run-on words like borroweris then you might want look into libraries related to spell-checking that identify run-ons and either run this as a preprocessing step before spacy (inserting whitespace), or potentially as a postprocessing step after the tokenizer that retokenizes those tokens (which you could do while preserving the original whitespace). The rule-based tokenizer can handle predictable cases like ca n't, but not any potential run-on words.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Need Help on Custom Tokenization using Spacy! #9721

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Need Help on Custom Tokenization using Spacy! #9721

Uh oh!

karndeepsingh Nov 22, 2021

Replies: 1 comment

Uh oh!

adrianeboyd Nov 24, 2021

karndeepsingh
Nov 22, 2021

adrianeboyd
Nov 24, 2021