Handling meta content in documents #10426

mapadofu · 2022-03-03T23:49:06Z

mapadofu
Mar 3, 2022

I'd like to find articles/resources about how to extract and use "meta-content" (see below) from documents. However, I don't know how to do the search since, in NLP, metadata seems to always be "attach extra metadata to the content of the document" rather than dealing with in-document content.

The meta-content I'm talking about is intra-document labels, categories, annotations etc. For example, consider the top of a typical memo:

To: Alice
From: Bob
Date: 3 March 2022
Re: 1st Quarter Sales
....

the "To", "From" etc. are labels that refer to and describe other content in the document. [This is a bit different from syntactic co- or cross- reference in that they live "outside" the actual text content of the document].

I just want to do a literature search to see if/how people have thought about handling this part of the content in NLP processing, but it's hard to search for something when you don't know what people tend to call it.

Answered by polm

Mar 6, 2022

If I understand your question correctly, the closest term for this in machine learning literature is "multi modal data". Most research is focused on the specific case of text + image features, but most techniques map to this case as well. It's also known as "multi-field data", but that's a less used term. It is a hard topic to search for.

Generally what you do in this case is create an embedding for each field of your document and combine those somehow (concatenation is the default approach) to create a representation. How the fields are vectorized is up to you - for fields like title, using the same methods as normal spaCy Docs is likely useful, but for other fields a categorical represe…

View full answer

polm · 2022-03-06T04:12:23Z

polm
Mar 6, 2022

If I understand your question correctly, the closest term for this in machine learning literature is "multi modal data". Most research is focused on the specific case of text + image features, but most techniques map to this case as well. It's also known as "multi-field data", but that's a less used term. It is a hard topic to search for.

Generally what you do in this case is create an embedding for each field of your document and combine those somehow (concatenation is the default approach) to create a representation. How the fields are vectorized is up to you - for fields like title, using the same methods as normal spaCy Docs is likely useful, but for other fields a categorical representation is often appropriate.

We don't have a guide or example for doing something like that in spaCy, and at the moment it would require more work than typical models. You would probably want to attach metadata to the Doc (using underscore attributes, for example), use a custom tok2vec to encode your attributes, and put the tok2vec on a pipelines that takes in pre-constructed Docs from another pipeline that handles tokenization and vectorizing the main text.

Something that doesn't work as well as building a proper representation, but is really easy to set up, is to use magic tokens to represent metadata. So you would put a token like FROM_Bob in your text, and the model would learn a representation for it. This was common in the bag-of-words days, and I wouldn't expect it to play nice with the way spaCy works, but it is very easy to try.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Handling meta content in documents #10426

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Handling meta content in documents #10426

Uh oh!

mapadofu Mar 3, 2022

Replies: 1 comment

Uh oh!

polm Mar 6, 2022

mapadofu
Mar 3, 2022

polm
Mar 6, 2022