How do the nlp.tokenizer and subword tokenizers interact? #8683

BramVanroy · 2021-07-11T11:46:26Z

BramVanroy
Jul 11, 2021

Looking at the config of en_core_web_trf, I was wondering how the set spaCy tokenizer works together with the subword tokenizer used for the language model.

In the case of en_core_web_trf uses RoBERTa, which uses BPE under the hood if I remember correctly. But the default spaCy tokenizer does not work with subword units AFAIK. Yet, there is this line in the [nlp] section

tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

So how does this work in practice? My guess is the LM outputs logits for each BPE token, they are then converted back into (word) token-based representation as per components.*.model.tok2vec.pooling = {"@layers":"reduce_mean.v1"}, so that the components can simply work with the (word) tokens.

If this is correct, how does spaCy know these alignments? (I know that HF Fast Tokenizers output this mapping, but I don't think the slow ones do.)

Can you confirm/correct? Thank you.

Answered by polm

Jul 12, 2021

Your understanding of how embeddings is converted is correct. spaCy generates the alignment from two sequences of tokens; see the aligner section in the docs for notes on that. The alignments are generated using the spacy-alignments package.

View full answer

polm · 2021-07-12T04:54:48Z

polm
Jul 12, 2021

Your understanding of how embeddings is converted is correct. spaCy generates the alignment from two sequences of tokens; see the aligner section in the docs for notes on that. The alignments are generated using the spacy-alignments package.

2 replies

BramVanroy Jul 12, 2021
Author

It seems that the aligner is agnostic to which type of tokenization is done, that is pretty cool. I did some tests and it seems quite robust. E.g. it works well with WordPieces Goo ##d morn ##ing and inconsistent casing, too!

Thanks!

polm Jul 12, 2021

The author wrote in more detail about it here:

https://gist.github.com/tamuhey/af6cbb44a703423556c32798e1e1b704

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How do the nlp.tokenizer and subword tokenizers interact? #8683

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How do the nlp.tokenizer and subword tokenizers interact? #8683

Uh oh!

BramVanroy Jul 11, 2021

Replies: 1 comment · 2 replies

Uh oh!

polm Jul 12, 2021

Uh oh!

BramVanroy Jul 12, 2021 Author

Uh oh!

polm Jul 12, 2021

BramVanroy
Jul 11, 2021

Replies: 1 comment 2 replies

polm
Jul 12, 2021

BramVanroy Jul 12, 2021
Author