Mapping transformer vectors for long (and thus chunked) documents to spaCy tokens #9705

tomateit · 2021-11-19T13:51:36Z

tomateit
Nov 19, 2021

I open a separate discussion for my question from another thread.

The question is what actually transformer do with the chunks
to obtain the resulting doc tensor after the original text was chunked by strided_spans?

I've studied the doc._.trf_data, it contains vectors of the chunks. The model_output dimensions are [n_chunks][padded by longest n of spacy_tokens][transformer dimension]. The trf_data.align contains only 1d vectior, so no information about in which chunk the token vector is.

Yet in this spaCy tutorial it appears we do get some sort of a tensor, which would respect spaCy span's [start-end ]-to-[transformer vectors] mapping (even though a span can be inside two chunks, if we use a stride), so we can then pool it into one vector and thus get a span embedding.

I understand, that you can possibly mean_pool overlapping parts of the chunks' tensors and thus achieve a full-document tensor, but I believe a component that does it would need strided_spans's window and stride information to decide, which parts of the resulting span vectors shall be "mean reduced" (not to mention that they are padded!) to get from the overlapping 'transformer'ed doc chunks' tensors the whole doc tensor (which respects spacy tokenization), but I see none.

Then how do I get a spacy-tokenwise tensor for each and every doc token from a chunked input text?

Answered by adrianeboyd

Nov 19, 2021

This looks like a useful example: https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/05_embeddings_continued.html

View full answer

tomateit · 2021-11-19T14:22:37Z

tomateit
Nov 19, 2021
Author

I'm not sure If it is related to this topic of chunk merging and I shall create an issue, but my implementation of wordpiece-aware span getter leads to a random crashes with an error IndexError('index 259 is out of bounds for axis 0 with size 259')
As I can see in debugger, the chunks are created fine, and my transformer is capable of turning them into vectors, but I cannot understand all the internals.
This error is emitted at the line https://github.com/explosion/spaCy/blob/master/spacy/ml/_precomputable_affine.py#L49

0 replies

adrianeboyd · 2021-11-19T14:42:57Z

adrianeboyd
Nov 19, 2021

This looks like a useful example: https://applied-language-technology.readthedocs.io/en/latest/notebooks/part_iii/05_embeddings_continued.html

1 reply

tomateit Nov 19, 2021
Author

As far as I see, in this example we flatten our model_output, so we do not need to know, in which of the chunks our token is. Seems it is build into aligner function (as I could notice, it does not use spacy_alignments, but instead the functionality is re-implemented). This answers the question about knowing the window length. To get how the duplication by striding works I checked the function https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/align.py#L103

It appears that a token which got into two chunks gets aligned to x2 resulting transformer vectors, which, in my opinion, is a great decision.

Thanks a lot for your help!
(btw, now I see, that my span_getter shall not affect the alignment, so I will do more research on that)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Mapping transformer vectors for long (and thus chunked) documents to spaCy tokens #9705

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Mapping transformer vectors for long (and thus chunked) documents to spaCy tokens #9705

Uh oh!

tomateit Nov 19, 2021

Replies: 2 comments · 1 reply

Uh oh!

tomateit Nov 19, 2021 Author

Uh oh!

adrianeboyd Nov 19, 2021

Uh oh!

tomateit Nov 19, 2021 Author

tomateit
Nov 19, 2021

Replies: 2 comments 1 reply

tomateit
Nov 19, 2021
Author

adrianeboyd
Nov 19, 2021

tomateit Nov 19, 2021
Author