avnlp
diff --git a/‎src/grpo/andriy_burkov_lm_book/README.md‎
Lines changed: 89 additions & 0 deletions b/‎src/grpo/andriy_burkov_lm_book/README.md‎
Lines changed: 89 additions & 0 deletions
diff --git a/‎src/grpo/andriy_burkov_lm_book/__init__.py‎ b/‎src/grpo/andriy_burkov_lm_book/__init__.py‎
diff --git a/‎src/grpo/andriy_burkov_lm_book/completions.py‎
Lines changed: 232 additions & 0 deletions b/‎src/grpo/andriy_burkov_lm_book/completions.py‎
Lines changed: 232 additions & 0 deletions
@@ -0,0 +1,89 @@
+# GRPO from Scratch - LM Book by Andriy Burkov
+
+Implementation of the **GRPO (Group Relative Policy Optimization)** training algorithm from:
+[https://github.com/aburkov/theLMbook/blob/main/GRPO.py](https://github.com/aburkov/theLMbook/blob/main/GRPO.py).
+
+We refactor the original code into a **modular and easy-to-understand layout**.
+Each major step of the training process is separated into its own file for clarity and extensibility.
+
+### Steps in the Training Process
+
+1. **Initialization and Setup**:
+   - Set random seeds for reproducibility across Python, NumPy, and PyTorch.
+   - Load the pre-trained language model (Qwen2.5-0.5B-Instruct) and tokenizer.
+   - Configure the model: set pad token to EOS token, enable memory optimizations (gradient checkpointing, disable cache, and ensure input gradients).
+   - Prepare the GSM8K dataset: format each example with a system prompt and the question, and extract the answer.
+2. **Initial Evaluation**:
+   - Evaluate the model's initial performance on a small subset of the training data (before any fine-tuning).
+   - The evaluation function generates responses for each prompt, extracts answers, and compares them to the expected answers using multiple methods (exact match, single number extraction, last number extraction).
+3. **GRPO Training Loop**:
+   - The training is divided into iterations (outer loop). In each iteration:
+     - Create a reference model by making a deep copy of the current policy model (and freeze its parameters).
+     - Set up an optimizer for the policy model.
+     - For a specified number of steps (inner loop):
+        - Sample a batch of training examples.
+        - Generate multiple completions (rollouts) for each prompt in the batch using the current policy model.
+        - For each generated completion, compute the log probabilities under the current policy and the reference model (without gradients).
+        - Format the completions for reward computation.
+        - Perform multiple GRPO updates (μ times) on the same batch of rollouts:
+            - Compute rewards using the combined reward function (correctness and format).
+            - Calculate group-relative advantages: within each group of completions for the same prompt, normalize the rewards by subtracting the group mean and dividing by the group standard deviation.
+            - Compute the current log probabilities (with gradients) for the generated completions.
+            - Calculate the policy ratio (exponential of the difference between current and old log probabilities).
+            - Compute the surrogate loss (clipped to avoid large updates) and the KL divergence penalty (to prevent the policy from deviating too far from the reference model).
+            - Combine the losses and update the policy model's parameters.
+4. **Final Evaluation and Saving**:
+   - After completing all iterations, evaluate the fine-tuned model on the same evaluation subset.
+   - Calculate the improvement in accuracy.
+   - Save the fine-tuned model and tokenizer.
+
+## Code Structure
+
+We refactor the original implementation into a modular, readable, and extensible format. Each component corresponds to a specific phase in the GRPO loop.
+
+```bash
+andriy_burkov_lm_book/
+├── train_grpo.py         
+├── data_process.py      
+├── reward_functions.py      
+├── evaluation.py  
+└── completions.py              
+```
+
+- **`prepare_dataset`**:
+  - Loads the GSM8K dataset and formats each example into a prompt string (combining system message and user question) and extracts the answer.
+- **`evaluate_model`**:
+  - Evaluates the model by generating responses for each evaluation example.
+  - Extracts the predicted answer and compares it to the expected answer using multiple matching strategies (exact string, single number, last number).
+  - Prints detailed results and returns the accuracy.
+- **`correctness_reward`**:
+  - Assigns a reward (0.0, 1.5, or 2.0) based on the correctness of the generated answer compared to the expected answer. Uses exact matching and numeric equivalence.
+- **`format_reward`**:
+  - Assigns a reward (up to 0.8) for adhering to the required XML format (presence of `<reasoning>`, `</reasoning>`, `<answer>`, and `</answer>` tags).
+- **`combined_reward`**:
+  - Sums the correctness reward and format reward for a total reward in the range [0.0, 2.8].
+- **`generate_completions`**:
+  - Generates multiple completions for each prompt in a batch using the current model.
+  - Returns the tokenized prompts and completions, along with masks that ignore tokens after the first end-of-sequence (EOS) token.
+- **`generate_rollout_data`**:
+  - Uses `generate_completions` to generate rollouts (completions) for a batch of prompts.
+  - Computes log probabilities for these completions under both the current policy and the reference model (without gradients).
+  - Returns a dictionary containing the rollout data (inputs, masks, log probabilities, etc.).
+- **`compute_group_relative_advantages`**:
+  - Groups the rewards by prompt (each group has multiple completions for the same prompt).
+  - For each group, normalizes the rewards by subtracting the group mean and dividing by the group standard deviation (adding a small epsilon to avoid division by zero).
+  - Returns the normalized advantages for each completion.
+- **`maximize_grpo_objective`**:
+  - The core function that computes the GRPO loss and updates the model.
+  - Computes current log probabilities (with gradients).
+  - Computes the policy ratio (current log probability divided by old log probability, exponentiated).
+  - Computes rewards and then group-relative advantages.
+  - Computes the surrogate loss (min of two terms: unclipped and clipped) and the KL penalty.
+  - Combines them and performs a gradient update.
+- **`train_with_grpo`**:
+  - Orchestrates the entire GRPO training process: sets up the reference model, optimizer, and loops over iterations and steps.
+  - For each step, it generates rollout data and performs multiple GRPO updates.
+- **`optimize_model_memory`**:
+  - Applies memory optimization techniques: disables caching, enables gradient checkpointing, and ensures input gradients are required.
+- **`main`**:
+  - The main function that ties everything together: sets up the model, tokenizer, dataset, runs initial evaluation, performs GRPO training, runs final evaluation, and saves the model.
@@ -0,0 +1,232 @@
+# This code is based on the implementation from: https://github.com/aburkov/theLMbook/blob/main/GRPO.py
+
+import torch
+import torch.nn.functional as F
+
+
+def selective_log_softmax(logits, input_ids):
+    """Compute the log probabilities for the tokens specified in input_ids using a selective log-softmax.
+
+    Args:
+        logits (torch.Tensor): A tensor of shape (batch_size, seq_len, vocab_size) containing raw logits from the model.
+        input_ids (torch.Tensor): A tensor of shape (batch_size, seq_len) containing the token indices for which we want the log probabilities.
+
+    Returns:
+        torch.Tensor: A tensor of shape (batch_size, seq_len) where each element is the log probability
+                      corresponding to the token in input_ids at that position.
+
+    Explanation:
+        1. F.log_softmax is applied along the vocabulary dimension (dim=-1) to convert logits into log probabilities.
+        2. The tensor input_ids is reshaped (via unsqueeze) to have an extra dimension so that we can use it as indices
+           in the log_probs tensor.
+        3. torch.gather collects the log probability at the index specified in input_ids for each position.
+        4. Finally, squeeze(-1) removes the extra dimension, returning a tensor with the same shape as input_ids.
+    """
+    # Convert raw logits into log probabilities along the vocabulary axis.
+    log_probs = F.log_softmax(logits, dim=-1)  # Shape: (batch_size, seq_len, vocab_size)
+
+    # Reshape input_ids from (batch_size, seq_len) to (batch_size, seq_len, 1) for gathering.
+    # Then, gather the log probability for each token in input_ids.
+    selected_log_probs = log_probs.gather(dim=-1, index=input_ids.unsqueeze(-1))
+
+    # Remove the extra last dimension to get back to shape (batch_size, seq_len).
+    return selected_log_probs.squeeze(-1)
+
+
+def compute_log_probabilities(model, input_ids, attention_mask, logits_to_keep):
+    """Compute per-token log probabilities for a subset of tokens (typically the completion tokens).
+
+    Args:
+        model: The language model to use.
+        input_ids (torch.Tensor): Tensor of shape (batch_size, total_seq_len) containing token ids
+                                  for both prompt and completion.
+        attention_mask (torch.Tensor): Tensor of shape (batch_size, total_seq_len) indicating which tokens are real (1) or padding (0).
+        logits_to_keep (int): Number of tokens (from the completion part) for which we need log probabilities.
+
+    Returns:
+        torch.Tensor: Log probabilities for the last `logits_to_keep` tokens of each sequence.
+
+    Explanation:
+        1. We call the model with logits_to_keep + 1 so that the model outputs one extra logit than needed.
+           This is common in next-token prediction setups.
+        2. We slice off the last logit along the sequence dimension because it does not correspond to any input token.
+        3. We then restrict both the input_ids and logits to the last logits_to_keep tokens, which should
+           correspond to the generated completion portion.
+        4. Finally, we use the selective_log_softmax to compute log probabilities only for those tokens.
+    """
+    # Run the model forward pass and obtain logits.
+    logits = model(
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        logits_to_keep=logits_to_keep + 1,  # Request one extra logit for proper alignment.
+    ).logits  # Shape: (batch_size, total_seq_len, vocab_size)
+
+    # Remove the last logit as it does not have a corresponding target token.
+    logits = logits[:, :-1, :]  # New shape: (batch_size, total_seq_len - 1, vocab_size)
+
+    # Slice the input_ids to keep only the last logits_to_keep tokens.
+    # This corresponds to the generated completion tokens.
+    input_ids = input_ids[:, -logits_to_keep:]  # Shape: (batch_size, logits_to_keep)
+
+    # Also slice the logits to keep only those corresponding to the completion tokens.
+    logits = logits[:, -logits_to_keep:, :]  # Shape: (batch_size, logits_to_keep, vocab_size)
+
+    # Compute and return the log probabilities for the selected tokens.
+    return selective_log_softmax(logits, input_ids)
+
+
+def create_completion_mask(completion_ids, eos_token_id):
+    """Create a binary mask for the generated completion tokens so that tokens after the first EOS are ignored.
+
+    Args:
+        completion_ids (torch.Tensor): Tensor of shape (batch_size, seq_len) with generated token ids.
+        eos_token_id (int): The token id representing the end-of-sequence.
+
+    Returns:
+        torch.Tensor: A mask tensor of shape (batch_size, seq_len) with 1s for tokens up to and including the first EOS
+                      and 0s for tokens following the first EOS.
+
+    Explanation:
+        1. First, a boolean mask (is_eos) is created indicating where in the sequence the EOS token appears.
+        2. An index tensor (eos_idx) is initialized, assuming that no EOS is found (defaulting to the sequence length).
+        3. For sequences where EOS exists, eos_idx is updated to the position (index) of the first EOS.
+        4. A sequence index tensor is created that contains indices for each position in the sequence.
+        5. The final mask is computed by comparing the sequence indices to eos_idx (after adding a dimension).
+    """
+    # Determine which positions in each sequence equal the EOS token.
+    is_eos = completion_ids == eos_token_id  # Boolean tensor of shape (batch_size, seq_len)
+
+    # Initialize a tensor to store the index of the first EOS for each sequence.
+    # If no EOS is found, default to the full sequence length (is_eos.size(1)).
+    eos_idx = torch.full((is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=completion_ids.device)
+
+    # Identify sequences that contain at least one EOS.
+    mask_exists = is_eos.any(dim=1)
+    # For sequences with an EOS, update eos_idx to the index of the first occurrence.
+    eos_idx[mask_exists] = is_eos.int().argmax(dim=1)[mask_exists]
+
+    # Create a tensor of indices [0, 1, 2, ..., seq_len-1] and replicate it for each sequence in the batch.
+    sequence_indices = torch.arange(is_eos.size(1), device=completion_ids.device).expand(is_eos.size(0), -1)
+
+    # Build the mask: positions with an index less than or equal to the first EOS index are marked as 1.
+    completion_mask = (sequence_indices <= eos_idx.unsqueeze(1)).int()
+
+    return completion_mask
+
+
+def generate_completions(model, tokenizer, prompts, num_generations=4, max_completion_length=32):
+    """Generate multiple completions for each prompt and create corresponding attention masks.
+
+    Args:
+        model: The language model used for generation.
+        tokenizer: The tokenizer to process the prompts and decode the outputs.
+        prompts (list of str): List of input prompt strings.
+        num_generations (int): Number of completions to generate per prompt.
+        max_completion_length (int): Maximum number of new tokens to generate for the completion.
+
+    Returns:
+        tuple: Contains the following tensors:
+            - prompt_ids: (batch_size * num_generations, prompt_seq_len)
+            - prompt_mask: (batch_size * num_generations, prompt_seq_len)
+            - completion_ids: (batch_size * num_generations, completion_seq_len)
+            - completion_mask: (batch_size * num_generations, completion_seq_len)
+
+    Explanation:
+        1. The prompts are tokenized and padded (with padding added to the left).
+        2. Each prompt is repeated num_generations times so that multiple completions are generated per prompt.
+        3. The model.generate() function is called to generate new tokens.
+        4. The generated output contains the prompt followed by the completion; we remove the prompt part to get the completions.
+        5. A mask is created (via create_completion_mask) so that only tokens up to the first EOS are considered.
+    """
+    device = next(model.parameters()).device
+
+    # Tokenize the list of prompts with padding. The padding_side="left" ensures alignment on the right.
+    tokenizer.padding_side = "left"
+    inputs = tokenizer(prompts, return_tensors="pt", padding=True, padding_side="left")
+    prompt_ids = inputs["input_ids"].to(device)  # Shape: (batch_size, prompt_seq_len)
+    prompt_mask = inputs["attention_mask"].to(device)  # Shape: (batch_size, prompt_seq_len)
+    prompt_length = prompt_ids.size(1)  # Save the prompt length to later separate prompt from completion.
+
+    # Repeat each prompt num_generations times.
+    prompt_ids = prompt_ids.repeat_interleave(
+        num_generations, dim=0
+    )  # New shape: (batch_size*num_generations, prompt_seq_len)
+    prompt_mask = prompt_mask.repeat_interleave(
+        num_generations, dim=0
+    )  # New shape: (batch_size*num_generations, prompt_seq_len)
+
+    # Generate new tokens for each prompt. The output includes the original prompt and the generated tokens.
+    outputs = model.generate(
+        prompt_ids,
+        attention_mask=prompt_mask,
+        max_new_tokens=max_completion_length,
+        do_sample=True,
+        temperature=1.0,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+
+    # Remove the prompt portion from the generated output to isolate the completion tokens.
+    completion_ids = outputs[:, prompt_length:]  # Shape: (batch_size*num_generations, completion_seq_len)
+
+    # Create a binary mask that ignores tokens beyond the first EOS token.
+    completion_mask = create_completion_mask(completion_ids, tokenizer.eos_token_id)
+
+    return prompt_ids, prompt_mask, completion_ids, completion_mask
+
+
+def generate_rollout_data(model, ref_model, tokenizer, batch_samples, num_generations, max_completion_length):
+    """Generate rollouts and compute static log probabilities for both the old policy (current model)
+    and the reference model. Gradients are disabled so that these remain fixed.
+
+    Args:
+        model: The current model (policy) used to generate rollouts.
+        ref_model: The static reference model.
+        tokenizer: The tokenizer.
+        batch_samples: List of training samples.
+        num_generations: Number of completions to generate per prompt.
+        max_completion_length: Maximum completion length.
+
+    Returns:
+        A dictionary with rollout data including both old and reference log probabilities.
+    """
+    tokenizer.padding_side = "left"
+    next(model.parameters()).device
+
+    # Extract prompts and answers.
+    prompts = [sample["prompt"] if isinstance(sample, dict) else sample[0] for sample in batch_samples]
+    answers = [sample["answer"] if isinstance(sample, dict) else sample[1] for sample in batch_samples]
+
+    # Generate completions and associated masks.
+    # We generate once, and then use the same completions to compute both sets of log probabilities.
+    with torch.no_grad():
+        prompt_ids, prompt_mask, completion_ids, completion_mask = generate_completions(
+            model, tokenizer, prompts, num_generations, max_completion_length
+        )
+        input_ids = torch.cat([prompt_ids, completion_ids], dim=1)
+        attention_mask = torch.cat([prompt_mask, completion_mask], dim=1)
+        logits_to_keep = completion_ids.size(1)
+
+        # Compute old_log_probs from the current model, with gradients disabled.
+        old_log_probs = compute_log_probabilities(model, input_ids, attention_mask, logits_to_keep)
+
+        # Compute ref_log_probs from the reference model, which remains static.
+        ref_log_probs = compute_log_probabilities(ref_model, input_ids, attention_mask, logits_to_keep)
+
+    formatted_completions = [[{"content": tokenizer.decode(ids, skip_special_tokens=True)}] for ids in completion_ids]
+    repeated_prompts = [p for p in prompts for _ in range(num_generations)]
+    repeated_answers = [a for a in answers for _ in range(num_generations)]
+
+    return {
+        "input_ids": input_ids,
+        "attention_mask": attention_mask,
+        "completion_mask": completion_mask,
+        "old_log_probs": old_log_probs,  # Static log probs from the current model (old policy)
+        "ref_log_probs": ref_log_probs,  # Static log probs from the reference model
+        "formatted_completions": formatted_completions,
+        "repeated_prompts": repeated_prompts,
+        "repeated_answers": repeated_answers,
+        "logits_to_keep": logits_to_keep,
+        "batch_size": len(prompts),
+        "num_generations": num_generations,
+    }