Skip to content

Commit e3f9e0d

Browse files
authored
Merge pull request huggingface#922 from MrGeislinger/patch-2
Fix typo: Unigram section link
2 parents a12b762 + 1b100cb commit e3f9e0d

File tree

1 file changed

+2
-2
lines changed
  • chapters/en/chapter6

1 file changed

+2
-2
lines changed

chapters/en/chapter6/4.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ Now that we've seen a little of how some different tokenizers process text, we c
104104

105105
## SentencePiece[[sentencepiece]]
106106

107-
[SentencePiece](https://github.com/google/sentencepiece) is a tokenization algorithm for the preprocessing of text that you can use with any of the models we will see in the next three sections. It considers the text as a sequence of Unicode characters, and replaces spaces with a special character, ``. Used in conjunction with the Unigram algorithm (see [section 7](/course/chapter7/7)), it doesn't even require a pre-tokenization step, which is very useful for languages where the space character is not used (like Chinese or Japanese).
107+
[SentencePiece](https://github.com/google/sentencepiece) is a tokenization algorithm for the preprocessing of text that you can use with any of the models we will see in the next three sections. It considers the text as a sequence of Unicode characters, and replaces spaces with a special character, ``. Used in conjunction with the Unigram algorithm (see [section 7](/course/chapter6/7)), it doesn't even require a pre-tokenization step, which is very useful for languages where the space character is not used (like Chinese or Japanese).
108108

109109
The other main feature of SentencePiece is *reversible tokenization*: since there is no special treatment of spaces, decoding the tokens is done simply by concatenating them and replacing the `_`s with spaces -- this results in the normalized text. As we saw earlier, the BERT tokenizer removes repeating spaces, so its tokenization is not reversible.
110110

@@ -120,4 +120,4 @@ Training step | Merges the tokens corresponding to the most common pair | Merges
120120
Learns | Merge rules and a vocabulary | Just a vocabulary | A vocabulary with a score for each token
121121
Encoding | Splits a word into characters and applies the merges learned during training | Finds the longest subword starting from the beginning that is in the vocabulary, then does the same for the rest of the word | Finds the most likely split into tokens, using the scores learned during training
122122

123-
Now let's dive into BPE!
123+
Now let's dive into BPE!

0 commit comments

Comments
 (0)