How do the nlp.tokenizer and subword tokenizers interact? #8683
-
Looking at the config of In the case of tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} So how does this work in practice? My guess is the LM outputs logits for each BPE token, they are then converted back into (word) token-based representation as per If this is correct, how does spaCy know these alignments? (I know that HF Fast Tokenizers output this mapping, but I don't think the slow ones do.) Can you confirm/correct? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Your understanding of how embeddings is converted is correct. spaCy generates the alignment from two sequences of tokens; see the aligner section in the docs for notes on that. The alignments are generated using the spacy-alignments package. |
Beta Was this translation helpful? Give feedback.
Your understanding of how embeddings is converted is correct. spaCy generates the alignment from two sequences of tokens; see the aligner section in the docs for notes on that. The alignments are generated using the spacy-alignments package.