[chapter 02]Why need special process for whitespace in SimpleTokenizerV1 #806

myme5261314 · 2025-09-04T09:33:11Z

myme5261314
Sep 4, 2025

We could see SimpleTokenizerV1's code in 30th cell of ch02/01_main-chapter-code/ch02.ipynb.

class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
                                
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

I've append two lines of code in 31th cell of ch02/01_main-chapter-code/ch02.ipynb.

tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)
print(tokenizer.decode(ids))
print(text)

The running result is like below:

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]
" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.
"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride.

As we can see, the text does not equal to its result after .encode.decode, the differences lie in the whitespace and the newline characters.

I don't know why we need to treat whitespace specially through str.stirp() and " ".join in SimpleTokenizerV1.
Maybe because whitespace has less meaning in English?
But consider some other languages which whitespace is much more meaningful, such as Chinese.
There's no default whitespace between words and punctuation marks.

rasbt · 2025-09-04T20:20:37Z

rasbt
Sep 4, 2025
Maintainer

That's a good point. To be honest, I just implemented something super simple here for illustration purposes. We don't end up using this tokenizer later in the book. Btw I have a BPE from-scratch implementation here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

1 reply

myme5261314 Sep 8, 2025
Author

Thanks for your advice, I'm reading the BPE notebook now.
I have a little confusion about a special condition i !=0, I don't know why it is special for whitespace at the first position of the train text.
Let's say we have a string test = " test ", after the preprocess, the result is processed_text = "testĠ".
The code comes from 23th cell of https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

    def train(self, text, vocab_size, allowed_special={"<|endoftext|>"}):
        """
        Train the BPE tokenizer from scratch.

        Args:
            text (str): The training text.
            vocab_size (int): The desired vocabulary size.
            allowed_special (set): A set of special tokens to include.
        """

        # Preprocess: Replace spaces with "Ġ"
        # Note that Ġ is a particularity of the GPT-2 BPE implementation
        # E.g., "Hello world" might be tokenized as ["Hello", "Ġworld"]
        # (GPT-4 BPE would tokenize it as ["Hello", " world"])
        processed_text = []
        for i, char in enumerate(text):
            if char == " " and i != 0:
                processed_text.append("Ġ")
            if char != " ":
                processed_text.append(char)
        processed_text = "".join(processed_text)

Following the comment note above, I found no lone "Ġ" character in the encoder.json, vocab.bpe, and bpe_openai_gpt2.py.
If there is no special meaning of beginning whitespace, then we can use the str.replace function for the implementation of preprocessing (or just in code comments), which might be simple and obvious.

preprocessed_text = text.replace(" ", "Ġ")

Here's the function description from python official documentation.
str.replace(old, new, count=-1) Return a copy of the string with all occurrences of substring old replaced by new. If count is given, only the first count occurrences are replaced. If count is not specified or -1, then all occurrences are replaced.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[chapter 02]Why need special process for whitespace in SimpleTokenizerV1 #806

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[chapter 02]Why need special process for whitespace in SimpleTokenizerV1 #806

Uh oh!

myme5261314 Sep 4, 2025

Replies: 1 comment · 1 reply

Uh oh!

rasbt Sep 4, 2025 Maintainer

Uh oh!

myme5261314 Sep 8, 2025 Author

myme5261314
Sep 4, 2025

Replies: 1 comment 1 reply

rasbt
Sep 4, 2025
Maintainer

myme5261314 Sep 8, 2025
Author