Review of legal documents in PDF - suggestion needed. #12361

DoubleCortado · 2023-03-03T13:05:59Z

DoubleCortado
Mar 3, 2023

Hello everyone.
I would like to set up model to review documents generated by legal dept. Could you let me know which spacy tool would most suitable to solve below problems? If my description of issues is not sufficient please let me know- I will provide more details.

Recognizing tables in PDF and checking if rows with total values are correct summing all rows which are above. Table are coming from excel and then pasted to Word file which is converted to PDF. I have to make sure that always rows with totals are correct.
Date format consistency throughout report. Let’s say expected format date is December 31, 2022, I need to know if there are cases where date is in different format. I presume here I can use en_core_web_lg model and pull dates using DATE label.
Company name consistency - meaning if in one place company name is EVENTS S.A. and somewhere else is EVENT SA What would be the best way to perform such check?

I hope above can be solved using spaCy as I would be really excited to explore this universe.

Below is screen of tables mentioned in the first point with total rows in bold:

Answered by thomashacker

Mar 6, 2023

Hey,
Thanks for your question. Here are my thoughts on your three use cases:

Unfortunately, spaCy isn't capable of reading in PDFs yet.
Sure, you can use the pretrained models to extract the DATE entities and check their formatting. But I think using regex to find all date formats might be more helpful.
Finding and extracting company names is a great use case for spaCy. You can use the pretrained NER models to extract company names and check if their names are different. You can also fine-tune models with your own data, or even train a model from scratch. You can find more information here in our docs.

I hope this was helpful!

View full answer

thomashacker · 2023-03-06T14:54:44Z

thomashacker
Mar 6, 2023

Hey,
Thanks for your question. Here are my thoughts on your three use cases:

Unfortunately, spaCy isn't capable of reading in PDFs yet.
Sure, you can use the pretrained models to extract the DATE entities and check their formatting. But I think using regex to find all date formats might be more helpful.
Finding and extracting company names is a great use case for spaCy. You can use the pretrained NER models to extract company names and check if their names are different. You can also fine-tune models with your own data, or even train a model from scratch. You can find more information here in our docs.

I hope this was helpful!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Review of legal documents in PDF - suggestion needed. #12361

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Review of legal documents in PDF - suggestion needed. #12361

Uh oh!

Uh oh!

DoubleCortado Mar 3, 2023

Replies: 1 comment

Uh oh!

thomashacker Mar 6, 2023

DoubleCortado
Mar 3, 2023

thomashacker
Mar 6, 2023