Help with parsing financial statements using NER and spaCy #7733

jamiehannaford · 2021-04-09T18:40:02Z

jamiehannaford
Apr 9, 2021

Hi folks 👋

I'm new to spaCy and NLP in general and would really appreciate some help approaching a specific problem I'm trying to solve.

Problem

I would like to design a NLP model to extract various kinds of "hidden" expenses from 10-K and 10-Q financial statements. I've come up with about 7 different expense categories (e.g. restructuring costs, merger and acquisitions costs, etc.) and for each one I have a list of terms/synonyms that different companies call them. I'm thinking that I can create a new Entity for each expense, and then train up a model to recognize each one.

In terms of the data, values are usually hidden in two different areas of the financial statement:

Type 1: Free-form text (footnotes)

Values are nested in sentences. Here are some examples, with the Expense Type and Monetary value indicated.

Exploratory dry-hole costs were $12.7 million, $1.3 million, and $1.0 million for the years ended December 31, 2012, 2011, and 2010, respectively.

2012 includes the recognition of a $3,340 million impairment charge related to the carrying value of Citi's remaining 35% interest in the Morgan Stanley Smith Barney joint venture

During the year ended December 31, 2017, we decided to discontinue the internal development of AMG 899, resulting in an impairment charge of $400 million for the IPR&D asset

Type 2: Table data

SEC statements also contain "structured" data in HTML tables. Some line items, like the first row below, correspond to the expense type I'm looking for:

Item	2020	2019	2018
impairment related to real estate assets(2):	398.2	200	0
research and development	100	200	300
other expenses	20	30	40

Correct value = 398.2

Potential solution

I'm thinking about a two-model approach using spaCy:

The first model uses NER and defines a new entity for each expense type. Each entity is then trained on all the respecive synonyms I already know (e.g. restructuring costs are called "dry-hole costs", "impairment charges"). I would need to manually annotate extracts from historic statements using prodigy that contain these terms for the training set.
- For free-form text, it would match the sentence and pass it on for further processing (see 2).
- For table data, I would loop over each row using beautifulsoup and pandas, check the first column for a match (e.g. using spaCy's comparison function), and then grab that year's value from the dataframe and finish.
For free-form matches, I still need to grab the monetary value (like $100 million) for the correct year (sometimes multiple values are given for various years, see the first example above).

One big, potential pitfall with this approach is that sentences like this would cause problems:

We gained $100 million this year, despite facing restructuring charges.

If the NLP algorithm is split into the above two-model process, model 1 would pass (because it contains a known term like "restructuring charges"), and model 2 would extract $100 million, which is incorrect because it doesn't actually correspond to the expense itself.

Is there a better solution here? As I said, I'm new to NLP and spaCy so would really appreciate any advice or resources to learn more about solving these types of key/value problems 🙏

Answered by matthew-e-thomas

Apr 9, 2021

If you wanted to go the NER route, you could do it as you propose and use spaCy to label as 'ents' the examples you want it to capture and not label the examples you don't want it to capture (like your example sentence), and that becomes your training data and hopefully the model can learn to tell the difference. I think what you have to watch out for are, first, that you have enough examples. You'd be training the entity recognizer from scratch so you would need quite a large training set. Second, if the the pitfall sentences like the example you gave are rare, the model may have trouble learning to recognize them. You would know better obviously, but if the sentence structure in the exa…

View full answer

matthew-e-thomas · 2021-04-09T19:10:25Z

matthew-e-thomas
Apr 9, 2021

If you wanted to go the NER route, you could do it as you propose and use spaCy to label as 'ents' the examples you want it to capture and not label the examples you don't want it to capture (like your example sentence), and that becomes your training data and hopefully the model can learn to tell the difference. I think what you have to watch out for are, first, that you have enough examples. You'd be training the entity recognizer from scratch so you would need quite a large training set. Second, if the the pitfall sentences like the example you gave are rare, the model may have trouble learning to recognize them. You would know better obviously, but if the sentence structure in the examples you want to keep are consistently different than the ones you don't, it should work.

2 replies

jamiehannaford Apr 10, 2021
Author

Yeah, the more I think about it, the more using NER makes sense to recognise the expense type aliases (restructuring costs, impairment charge, impairment cost). I could even use the EntityRuler with a few initial phrases. I can further optimize accuracy by tagging more aliases in Prodigy.

To answer your point:

Second, if the the pitfall sentences like the example you gave are rare, the model may have trouble learning to recognize them.

I think the structure of the sentences will be incredibly varied, they won't ever have a consistent morphology, so I need a smart way to grab the right elements out. I looked into whether EntityLinker could somehow link a given NER label to a year and monetary value, but it seems that the use case for EntityLinker is unique entities (like celebrities or company names). It wouldn't be able to disambiguate sentences like:

We faced impairment charges of $200 million in 2020, $190 million in 2019, and $100 million in 2018 respectively.

For the custom entity label RESTRUCTURING_COST (which maps to "impairment charges") and a given DATE value ("2020"), I want to grab $200 million. Perhaps dependency parsing would be a good fit here? But that seems like I would need to know all the potential sentence structures up front...

👋 @svlandeg Do you think entity linker would be a good fit here? I basically want to match a custom NER label I've trained to a monetary value for a given year.

svlandeg Apr 30, 2021

Hi! Apologies for the late follow-up. But to be honest, I'm really not sure this is the ideal use-case for entity linking. Entity linking is mainly about disambiguating textual mentions according to the context.

Maybe the dependency matcher could be useful in this type of scenario?

DoubleCortado · 2023-04-07T06:11:03Z

DoubleCortado
Apr 7, 2023

@jamiehannaford did you come up with solution? I have similar problem also in context of financial statements so very curious.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Help with parsing financial statements using NER and spaCy #7733

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Help with parsing financial statements using NER and spaCy #7733

Uh oh!

jamiehannaford Apr 9, 2021

Problem

Type 1: Free-form text (footnotes)

Type 2: Table data

Potential solution

Replies: 2 comments · 2 replies

Uh oh!

matthew-e-thomas Apr 9, 2021

Uh oh!

Uh oh!

jamiehannaford Apr 10, 2021 Author

Uh oh!

svlandeg Apr 30, 2021

Uh oh!

DoubleCortado Apr 7, 2023

jamiehannaford
Apr 9, 2021

Replies: 2 comments 2 replies

matthew-e-thomas
Apr 9, 2021

jamiehannaford Apr 10, 2021
Author

DoubleCortado
Apr 7, 2023