[chapter 02]Why need special process for whitespace in SimpleTokenizerV1 #806
Unanswered
myme5261314
asked this question in
Q&A
Replies: 1 comment
-
That's a good point. To be honest, I just implemented something super simple here for illustration purposes. We don't end up using this tokenizer later in the book. Btw I have a BPE from-scratch implementation here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We could see SimpleTokenizerV1's code in 30th cell of
ch02/01_main-chapter-code/ch02.ipynb
.I've append two lines of code in 31th cell of
ch02/01_main-chapter-code/ch02.ipynb
.The running result is like below:
As we can see, the text does not equal to its result after
.encode.decode
, the differences lie in the whitespace and the newline characters.I don't know why we need to treat whitespace specially through
str.stirp()
and" ".join
in SimpleTokenizerV1.Maybe because whitespace has less meaning in English?
But consider some other languages which whitespace is much more meaningful, such as Chinese.
There's no default whitespace between words and punctuation marks.
Beta Was this translation helpful? Give feedback.
All reactions