-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Hello, I'm interested in the implementation of offline GRPO, does it work well by simply use the static prompt/completions/rewards? The formula of GRPO doesn't change
# For batch_size=1, unpack the single feature
prompt = features["prompt"][0]
completions_list = features["completions"][0]
rewards_list = features["rewards"][0]
reward_mean = np.mean(rewards_list)
reward_std = np.std(rewards_list)
tokenized_examples = defaultdict(list)
idx = indices[0] # Since batch_size=1, indices is a single-element list
for completion, reward in zip(completions_list, rewards_list):
batch = self._tokenize_single(prompt, completion)
# Append each tokenized example to the batch
for key in tokenized_examples:
tokenized_examples[key].append(batch[key])
tokenized_examples["group_id"].append(idx)
tokenized_examples["group_size"].append(len(completions_list))
advantage = (reward - reward_mean) / (reward_std + 1e-4)
tokenized_examples["advantage"].append(advantage)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels