Merge pull request huggingface#922 from MrGeislinger/patch-2

julien-c · web-flow · commit e3f9e0da486f · 2025-05-13T09:48:18.000+02:00
Fix typo: Unigram section link
diff --git a/chapters/en/chapter6/4.mdx b/chapters/en/chapter6/4.mdx
@@ -104,7 +104,7 @@ Now that we've seen a little of how some different tokenizers process text, we c
 
 ## SentencePiece[[sentencepiece]]
 
-[SentencePiece](https://github.com/google/sentencepiece) is a tokenization algorithm for the preprocessing of text that you can use with any of the models we will see in the next three sections. It considers the text as a sequence of Unicode characters, and replaces spaces with a special character, `▁`. Used in conjunction with the Unigram algorithm (see [section 7](/course/chapter7/7)), it doesn't even require a pre-tokenization step, which is very useful for languages where the space character is not used (like Chinese or Japanese).
+[SentencePiece](https://github.com/google/sentencepiece) is a tokenization algorithm for the preprocessing of text that you can use with any of the models we will see in the next three sections. It considers the text as a sequence of Unicode characters, and replaces spaces with a special character, `▁`. Used in conjunction with the Unigram algorithm (see [section 7](/course/chapter6/7)), it doesn't even require a pre-tokenization step, which is very useful for languages where the space character is not used (like Chinese or Japanese).
 
 The other main feature of SentencePiece is *reversible tokenization*: since there is no special treatment of spaces, decoding the tokens is done simply by concatenating them and replacing the `_`s with spaces -- this results in the normalized text. As we saw earlier, the BERT tokenizer removes repeating spaces, so its tokenization is not reversible.
 
@@ -120,4 +120,4 @@ Training step | Merges the tokens corresponding to the most common pair | Merges
 Learns | Merge rules and a vocabulary | Just a vocabulary | A vocabulary with a score for each token
 Encoding | Splits a word into characters and applies the merges learned during training | Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word | Finds the most likely split into tokens, using the scores learned during training
 
-Now let's dive into BPE!
+Now let's dive into BPE!