[Chatper 03]Why use context_length and num_tokens as different dimension integer variable? #766

myme5261314 · 2025-08-11T16:13:24Z

myme5261314
Aug 11, 2025

I'm trying to understand the MultiHeadAttention class code from chapter 03.
I found that the context_length variable is used only in __init__ function for definition of self.mask while num_tokens is used only in forward function for both @ and mask operations.
But the usage lines in the code snippet demonstrate the context_length is equivalent to num_tokens, so why we need to separate them to two different variables? Maybe in practice, the attention score matrix size is not compatible with the input tokens number?

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        # As in `CausalAttention`, for inputs where `num_tokens` exceeds `context_length`, 
        # this will result in errors in the mask creation further below. 
        # In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs  
        # do not exceed `context_length` before reaching this forwar

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

Answered by casinca

Aug 11, 2025

Hello,

You are right both are the same here, I presume Sebastian just did that as a quick dummy example to test the class, but it won't always be the case in practice.

Later in the chapters, you'll see that for example in SFT, the sequences in the batch are shorter than the model's context_length. Therefore num_tokens and context_length won't be the same but I don't want to spoil or go into more detail because it will make more sense as you progress and it'll be well explained.

Hope that helps

View full answer

casinca · 2025-08-11T19:44:30Z

casinca
Aug 11, 2025

Hello,

You are right both are the same here, I presume Sebastian just did that as a quick dummy example to test the class, but it won't always be the case in practice.

Later in the chapters, you'll see that for example in SFT, the sequences in the batch are shorter than the model's context_length. Therefore num_tokens and context_length won't be the same but I don't want to spoil or go into more detail because it will make more sense as you progress and it'll be well explained.

Hope that helps

2 replies

myme5261314 Aug 12, 2025
Author

Thanks for your hint!

rasbt Aug 12, 2025
Maintainer

Good question @myme5261314 and thanks for answering @casinca !

I can confirm, this was just a quick dummy example. Maybe setting context_length = 1024 (or any other number) would make it more clear.

The MHA is also initialized upfront once, and once that's initialized, you can't change context_length, so you want to make it as large as the model supports. But note that "batch" can have a smaller number of tokens. E.g., it could be just a small query like "What is the capital of Germany?" or so.

I hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Chatper 03]Why use context_length and num_tokens as different dimension integer variable? #766

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Chatper 03]Why use context_length and num_tokens as different dimension integer variable? #766

Uh oh!

myme5261314 Aug 11, 2025

Replies: 1 comment · 2 replies

Uh oh!

casinca Aug 11, 2025

Uh oh!

myme5261314 Aug 12, 2025 Author

Uh oh!

rasbt Aug 12, 2025 Maintainer

myme5261314
Aug 11, 2025

Replies: 1 comment 2 replies

casinca
Aug 11, 2025

myme5261314 Aug 12, 2025
Author

rasbt Aug 12, 2025
Maintainer