@@ -107,7 +107,8 @@ max_length = model.config.n_positions
107107stride = 512
108108seq_len = encodings.input_ids.size(1 )
109109
110- nlls = []
110+ nll_sum = 0.0
111+ n_tokens = 0
111112prev_end_loc = 0
112113for begin_loc in tqdm(range (0 , seq_len, stride)):
113114 end_loc = min (begin_loc + max_length, seq_len)
@@ -124,13 +125,19 @@ for begin_loc in tqdm(range(0, seq_len, stride)):
124125 # to the left by 1.
125126 neg_log_likelihood = outputs.loss
126127
127- nlls.append(neg_log_likelihood)
128+ # Accumulate the total negative log-likelihood and the total number of tokens
129+ num_valid_tokens = (target_ids != - 100 ).sum().item() # number of valid tokens in target_ids
130+ batch_size = target_ids.size(0 )
131+ num_loss_tokens = num_valid_tokens - batch_size # subtract batch_size due to internal label shift
132+ nll_sum += neg_log_likelihood * num_loss_tokens
133+ n_tokens += num_loss_tokens
128134
129135 prev_end_loc = end_loc
130136 if end_loc == seq_len:
131137 break
132138
133- ppl = torch.exp(torch.stack(nlls).mean())
139+ avg_nll = nll_sum / n_tokens # average negative log-likelihood per token
140+ ppl = torch.exp(avg_nll)
134141```
135142
136143Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window
@@ -139,5 +146,5 @@ and the better the reported perplexity will typically be.
139146
140147When we run the above with ` stride = 1024 ` , i.e. no overlap, the resulting PPL is ` 19.44 ` , which is about the same
141148as the ` 19.93 ` reported in the GPT-2 paper. By using ` stride = 512 ` and thereby employing our striding window
142- strategy, this jumps down to ` 16.45 ` . This is not only a more favorable score, but is calculated in a way that is
149+ strategy, this jumps down to ` 16.44 ` . This is not only a more favorable score, but is calculated in a way that is
143150closer to the true autoregressive decomposition of a sequence likelihood.
0 commit comments