Different results for output after loading the weights of GPT2 #632

Monrother · 2025-04-23T11:38:21Z

Monrother
Apr 23, 2025

I cloned the repo, and the output after I loaded the GPT 2 weights were different than the one shown in the book.

For "Every effort moves you", my local output is
Every effort moves you toward an equal share for each vote plus half. Inequality is often not an accurate representation of human worth; to know the

which I'm not sure what it is talking about.

And from the book, it is:
Every effort moves you toward finding an ideal new way to practice something!
What makes us want to be on top of that?

Which makes more sense.

Is this expected?

casinca · 2025-04-25T17:37:27Z

casinca
Apr 25, 2025

I presume you ran the notebook on Win/Linux? your output isn't abnormal, I'm getting the same, if that reassures you and you'd expect the same output from the book on Mac.
Even though it's not directly linked to dropout since it's disabled for inference, I believe it's still related to that MKL issue: pytorch/pytorch#121595

As for the quality of the trajectory, it's just luck. You could change the seed and try to get something better. Even if weights were not properly loaded and failed silently, it would have been much worse.

On top of that, these hparams temp>1 and high topk, increase the entropy of the probability distributions. For larger models, we could say it increases "creativity" but here for a small model, it'll likely increase nonsenses too, compared to greedy decoding.

1 reply

Monrother May 1, 2025
Author

Thanks, @casinca. Yes I'm using Windows. I tested when using the same code with the repo so I had the feeling that it's because of something in the library. Thanks for sharing the MKL issue.

natwille1 · 2025-07-31T07:52:43Z

natwille1
Jul 31, 2025

I've seen the same on my Mac.. the output I get looks similar to the gibberish you posted above @Monrother . Even with the dropout issue mentioned, this doesn't make sense because the pre-trained GPT-2 weights should result in much more coherent output regardless of dropout right? I've also seen calling torch.manual_seed(123), running a function, and then calling torch.manual_seed(123) again, results in different outputs.. is this expected? I thought the output should be deterministic based on the values provided to the seed call, and therefore always reproduce the same output?

1 reply

casinca Jul 31, 2025

I've seen the same on my Mac.. the output I get looks similar to the gibberish you posted above @Monrother . Even with the dropout issue mentioned, this doesn't make sense because the pre-trained GPT-2 weights should result in much more coherent output regardless of dropout right? I've also seen calling torch.manual_seed(123), running a function, and then calling torch.manual_seed(123) again, results in different outputs.. is this expected? I thought the output should be deterministic based on the values provided to the seed call, and therefore always reproduce the same output?

Hello, to get a vanilla comparison, on your Mac if you clone the repo and run the notebook, you get a different generation than the one from the notebook? Are on a M series or Intel Mac?

rasbt · 2025-07-31T19:16:13Z

rasbt
Jul 31, 2025
Maintainer

I originally missed this discussion, and thanks for chiming in here @casinca. I remember getting the same results on Google Colab (same as on my Mac). Like you said, it could be due to the MKL/dropout behavior in PyTorch, but during inference, that should have been disabled via model.eval().

It's still weird though that some machines produce this. I am really curious what components causes this.

@Monrother and @natwille1

Do you encounter this discrepancy for generate_text_simple or generate or both?

2 replies

natwille1 Aug 1, 2025

@rasbt thanks for you reply, it's very cool that you're checking this out (big fan btw)! I get the same results with generate and generate_text_simple.

Another interesting observation is the following:

In the above, torch.manual_seed(123) is not called before the print_gradients function. In the below it is:

So for some reason, re-initialising the seed call before print_gradients has a dramatic affect on the output. It almost seems like the initialised tensors are not being updated if torch.manual_seed is not called again, and therefore the skip=True setting has almost no effect? It could be related to why the gpt-2 output is gibberish!

If I create a new notebook from scratch and run all the functions in a single code cell, I get more sensible output but it's still not great:

For some reason I cannot attached my notebook.. so here is the full code for reproducing the last figure:

from gpt_implementation import generate, text_to_token_ids, token_ids_to_text
from gpt_download import download_and_load_gpt2
from gpt_implementation import GPTModel
from gpt_implementation import load_weights_into_gpt
# from book_code.ch05 import GPTModel, load_weights_into_gpt

CHOOSE_MODEL = "gpt2-small (124M)"
INPUT_PROMPT = "Every effort moves"
BASE_CONFIG = {
"vocab_size": 50257,
"context_length": 1024,
"drop_rate": 0.0,
"qkv_bias": True
}
model_configs = {
"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])


model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(
model_size=model_size, models_dir="gpt2"
)
model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=BASE_CONFIG["context_length"],
    top_k=50,
    temperature=1.5,
    device=device
)
print("generate")
print(token_ids[:5])
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Of course it's exactly the same as what you've implemented in the chapter code..

rasbt Aug 24, 2025
Maintainer

Thanks for this very detailed report. It almost looks like to me that there's perhaps an issue with the generate_text_simple function as the output looks much more incoherent than that of the generate function.

The difference in the gradients you showed is expected. That's because each time you reinitialize the model it has different random weights.

E.g.,

The two model would have different weights here:

torch.set_manual_seed(123)
model1 = DeepNN(...)
model2 = DeepNN(...)

But in this case, model1 and model2 would be identical:

torch.set_manual_seed(123)
model1 = DeepNN(...)torch.set_manual_seed(123)

torch.set_manual_seed(123)
model2 = DeepNN(...)

I think that's what you have been seeing in the gradient example above.

If I create a new notebook from scratch and run all the functions in a single code cell, I get more sensible output but it's still not great:

Btw do you still have the notebook and could maybe upload it to Google Colab or a GitHub Gist by chance so I can take a look?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different results for output after loading the weights of GPT2 #632

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Different results for output after loading the weights of GPT2 #632

Uh oh!

Monrother Apr 23, 2025

Replies: 3 comments · 4 replies

Uh oh!

casinca Apr 25, 2025

Uh oh!

Monrother May 1, 2025 Author

Uh oh!

Uh oh!

natwille1 Jul 31, 2025

Uh oh!

casinca Jul 31, 2025

Uh oh!

rasbt Jul 31, 2025 Maintainer

Uh oh!

Uh oh!

natwille1 Aug 1, 2025

Uh oh!

rasbt Aug 24, 2025 Maintainer

Monrother
Apr 23, 2025

Replies: 3 comments 4 replies

casinca
Apr 25, 2025

Monrother May 1, 2025
Author

natwille1
Jul 31, 2025

rasbt
Jul 31, 2025
Maintainer

rasbt Aug 24, 2025
Maintainer