Skip to content

Offline GRPO #10

@zihaolucky

Description

@zihaolucky

Hello, I'm interested in the implementation of offline GRPO, does it work well by simply use the static prompt/completions/rewards? The formula of GRPO doesn't change

        # For batch_size=1, unpack the single feature
        prompt = features["prompt"][0]
        completions_list = features["completions"][0]
        rewards_list = features["rewards"][0]

        reward_mean = np.mean(rewards_list)
        reward_std = np.std(rewards_list)

        tokenized_examples = defaultdict(list)

        idx = indices[0]  # Since batch_size=1, indices is a single-element list
        for completion, reward in zip(completions_list, rewards_list):
            batch = self._tokenize_single(prompt, completion)

            # Append each tokenized example to the batch
            for key in tokenized_examples:
                tokenized_examples[key].append(batch[key])

            tokenized_examples["group_id"].append(idx)
            tokenized_examples["group_size"].append(len(completions_list))

            advantage = (reward - reward_mean) / (reward_std + 1e-4)
            tokenized_examples["advantage"].append(advantage)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions