Handling meta content in documents #10426
-
I'd like to find articles/resources about how to extract and use "meta-content" (see below) from documents. However, I don't know how to do the search since, in NLP, metadata seems to always be "attach extra metadata to the content of the document" rather than dealing with in-document content. The meta-content I'm talking about is intra-document labels, categories, annotations etc. For example, consider the top of a typical memo:
the "To", "From" etc. are labels that refer to and describe other content in the document. [This is a bit different from syntactic co- or cross- reference in that they live "outside" the actual text content of the document]. I just want to do a literature search to see if/how people have thought about handling this part of the content in NLP processing, but it's hard to search for something when you don't know what people tend to call it. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
If I understand your question correctly, the closest term for this in machine learning literature is "multi modal data". Most research is focused on the specific case of text + image features, but most techniques map to this case as well. It's also known as "multi-field data", but that's a less used term. It is a hard topic to search for. Generally what you do in this case is create an embedding for each field of your document and combine those somehow (concatenation is the default approach) to create a representation. How the fields are vectorized is up to you - for fields like title, using the same methods as normal spaCy Docs is likely useful, but for other fields a categorical representation is often appropriate. We don't have a guide or example for doing something like that in spaCy, and at the moment it would require more work than typical models. You would probably want to attach metadata to the Doc (using underscore attributes, for example), use a custom tok2vec to encode your attributes, and put the tok2vec on a pipelines that takes in pre-constructed Docs from another pipeline that handles tokenization and vectorizing the main text. Something that doesn't work as well as building a proper representation, but is really easy to set up, is to use magic tokens to represent metadata. So you would put a token like |
Beta Was this translation helpful? Give feedback.
If I understand your question correctly, the closest term for this in machine learning literature is "multi modal data". Most research is focused on the specific case of text + image features, but most techniques map to this case as well. It's also known as "multi-field data", but that's a less used term. It is a hard topic to search for.
Generally what you do in this case is create an embedding for each field of your document and combine those somehow (concatenation is the default approach) to create a representation. How the fields are vectorized is up to you - for fields like title, using the same methods as normal spaCy Docs is likely useful, but for other fields a categorical represe…