Does spaCy cover all the functionalities that i.e. UiPath Document Understanding, Google Document AI & Microsoft Syntex has? #12880

Moorky · 2023-08-02T11:43:12Z

Moorky
Aug 2, 2023

Hi everyone, I'm completely new to this Framework but I love the idea behind digitizing documents and extracting important data from text and imagetext (i.e. scanned documents). This Framework is huge and I'm not entirely sure if I'm at the right place for the task I want to automate.

Basically the title is my question, but I can also give an use example:

I have multiple unstructured/semistructured documents that consist of free text (letter).
I want to extract specific data from a document and save it in a database.
The specific data is a specific name of a person that always comes somewhere in the beginning of the free text and the birth date that is also mentioned close to the name of that person. The birth date should not be confused with the date of the letter which is somewhere at the top right of the letter.
I also want to extract the name of the person that wrote the letter which is usually at the end of the letter but not always and can sometimes be somewhere at the beginning of the document.
Then I want to extract whole paragraphs from the letter and label/classify them.

This was possible by using UiPath Document Understanding, but I want to know if this would also be possible by using spaCy?

I appreciate any help! :)

Answered by adrianeboyd

Aug 3, 2023

No, not 100%. spaCy is an NLP library that focuses just on text analysis. It doesn't include all the functionality for this kind of document processing out of the box, like representing the location of the text on a page or dealing with OCR errors.

Many of the tasks you mentioned could probably be implemented using spaCy, but you would be writing rules and/or training statistical models from scratch for these particular tasks.

Please see: https://spacy.io/usage/spacy-101

View full answer

adrianeboyd · 2023-08-03T05:32:52Z

adrianeboyd
Aug 3, 2023

No, not 100%. spaCy is an NLP library that focuses just on text analysis. It doesn't include all the functionality for this kind of document processing out of the box, like representing the location of the text on a page or dealing with OCR errors.

Many of the tasks you mentioned could probably be implemented using spaCy, but you would be writing rules and/or training statistical models from scratch for these particular tasks.

Please see: https://spacy.io/usage/spacy-101

1 reply

Moorky Aug 3, 2023
Author

Thank you for your answer.

If I get the Document structured with for example unstructured-io or deepdoctection, can I use spaCy to extract the key data or am I still missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Does spaCy cover all the functionalities that i.e. UiPath Document Understanding, Google Document AI & Microsoft Syntex has? #12880

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Does spaCy cover all the functionalities that i.e. UiPath Document Understanding, Google Document AI & Microsoft Syntex has? #12880

Uh oh!

Uh oh!

Moorky Aug 2, 2023

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Aug 3, 2023

Uh oh!

Uh oh!

Moorky Aug 3, 2023 Author

Moorky
Aug 2, 2023

Replies: 1 comment 1 reply

adrianeboyd
Aug 3, 2023

Moorky Aug 3, 2023
Author