Help with parsing financial statements using NER and spaCy #7733
-
Hi folks 👋 I'm new to spaCy and NLP in general and would really appreciate some help approaching a specific problem I'm trying to solve. ProblemI would like to design a NLP model to extract various kinds of "hidden" expenses from 10-K and 10-Q financial statements. I've come up with about 7 different expense categories (e.g. restructuring costs, merger and acquisitions costs, etc.) and for each one I have a list of terms/synonyms that different companies call them. I'm thinking that I can create a new Entity for each expense, and then train up a model to recognize each one. In terms of the data, values are usually hidden in two different areas of the financial statement: Type 1: Free-form text (footnotes)Values are nested in sentences. Here are some examples, with the Expense Type and Monetary value indicated.
Type 2: Table dataSEC statements also contain "structured" data in HTML tables. Some line items, like the first row below, correspond to the expense type I'm looking for:
Correct value = Potential solutionI'm thinking about a two-model approach using spaCy:
One big, potential pitfall with this approach is that sentences like this would cause problems:
If the NLP algorithm is split into the above two-model process, model 1 would pass (because it contains a known term like "restructuring charges"), and model 2 would extract Is there a better solution here? As I said, I'm new to NLP and spaCy so would really appreciate any advice or resources to learn more about solving these types of key/value problems 🙏 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
If you wanted to go the NER route, you could do it as you propose and use spaCy to label as 'ents' the examples you want it to capture and not label the examples you don't want it to capture (like your example sentence), and that becomes your training data and hopefully the model can learn to tell the difference. I think what you have to watch out for are, first, that you have enough examples. You'd be training the entity recognizer from scratch so you would need quite a large training set. Second, if the the pitfall sentences like the example you gave are rare, the model may have trouble learning to recognize them. You would know better obviously, but if the sentence structure in the examples you want to keep are consistently different than the ones you don't, it should work. |
Beta Was this translation helpful? Give feedback.
-
@jamiehannaford did you come up with solution? I have similar problem also in context of financial statements so very curious. |
Beta Was this translation helpful? Give feedback.
If you wanted to go the NER route, you could do it as you propose and use spaCy to label as 'ents' the examples you want it to capture and not label the examples you don't want it to capture (like your example sentence), and that becomes your training data and hopefully the model can learn to tell the difference. I think what you have to watch out for are, first, that you have enough examples. You'd be training the entity recognizer from scratch so you would need quite a large training set. Second, if the the pitfall sentences like the example you gave are rare, the model may have trouble learning to recognize them. You would know better obviously, but if the sentence structure in the exa…