Shreeyam Bangera
April 2025
LLMs or Large Language Models are the talk of the hour, with everyone using ChatGPT, Copilot, Claude, Gemini and other famous LLMs for generating human like text, generating code, research and many other things. One may ask as to how they are able to get access and knowledge of all this worldly knowledge? This is done by a process called ”pretraining” where they are fed huge corpus of text from the internet, various books, articles, forums and any other publicly available data source with the sole purpose of predicting the next probable word while generating.
But pretraining on its own is not enough for getting a model that gives optimized outputs aligning itself to the user’s prompts. This is because pretraining is like giving the model a lot of general knowledge, i.e. not providing it with specific skills for a particular task. Fine-tuning solves this particular problem as it helps the model to learn various patterns by training it a bit more on the particular task related dataset and tuning the model’s weights to the task allowing the model to apply the knowledge learnt(during pretraining) to the task given by the user.
But even after all this, the model might generate texts or information that might not align with human values, for example: if any student asks the way to deal with increasing competition in college exams, then instead of helping him/her to improve the studies, it may suggest to sabotage other students. This is the problem of AI disalignment. Text generated like this may be toxic, biased on various premises cause harm and even at times provide faulty and incorrect responses.
AI alignment can be done by various methods but the most famous and widely used method is that of Reinforcement Learning (RL). It usually uses a reward model for teaching the AI to align itself to human measures and optimize it correctly, giving it reward for each correct generation and a negative reward for each incorrect one. Other than using the reward model, Reinforcement Learning with Human Feedback (RLHF) has been an impactful technique for training modern language models such as ChatGPT (like when you are asked your preferred response by ChatGPT while is generates 2 responses). In RLHF, the model is fine-tuned based on scores/labels provided by humans via various approaches like PPO, DPO and GRPO which would be expanded on in this blog.
A) Large Language Models
Word2Vec learns word representations by maximizing the probability of context words given a target word using the Skip-Gram model:
LLMs model the probability of the next word in a sequence, given the previous ones. Mathematically:
The model’s parameters are optimized by minimizing the negative log-likelihood loss over a large corpus using stochastic gradient descent (SGD) or its variants:
This is pretty much all one needs to know about LLMs to understand RLHF.
B) Reinforcement Learning Crash Course
State: The state st represents the environment’s configuration at time t.
Agent: The agent observes the state and selects actions to maximize cumulative reward.
Reward: The reward
Policy: The policy π defines a probability distribution over actions given a state.
The agent is the language model, the state is the query and the token output of the model, the reward is the ranking of the output to the queries given by the human, and the policy is the parameters of the model.
Lemma 1 (Stochastic Transitions): We model the next state as stochastic, i.e.,
Trajectory Probability: The probability of a trajectory τ under policy π is given by:
Lemma 2 (Discounted Rewards): We discount rewards since immediate rewards are preferred:
Trajectories are basically a series of states and actions. The goal is to select a policy that maximizes the expected return:
The function J(π) represents this expected return. It is calculated by averaging the total rewards R(τ) received over all possible trajectories τ, weighted by how likely each trajectory is under the policy π. In other words, the better the policy, the more likely it is to generate high-reward trajectories:
To maximize the expected return in LLMs where the policy is parameterized by θ, we use gradient ascent as follows:
Now the goal is to find an expression of ’J’ and compute it. Of course it is computationally impossible to calculate the return over all possible trajectories. Therefore we approximate it as:
The original gradient estimator has pretty high variance because it dumps the entire return R(τ) on every action taken during the episode, even if that action had little to do with the final reward. This ends up making learning noisy and unstable. Now, thanks to the Central Limit Theorem, we know that as we collect more data, our estimate should eventually converge to the true gradient—but high variance means we need a lot of data to get there.
To deal with this, we switch to using the advantage function, defined as
which is still unbiased but way less noisy, making training smoother and more eficient.
3.1 Advantage Function and Its Estimation
The advantage function quantifies the relative benefit of taking a particular action in a given state, compared to the average performance of the policy from that state. It is defined as:
•
•
Intuitively,
3.1.1 Monte Carlo Estimation
Monte Carlo (MC) methods estimate returns by sampling entire trajectories (episodes) and using the observed total return from a state (or state-action pair) as an unbiased estimator of expected return.
Let
Then, the MC estimate of the advantage is:
where
Intuition: This approach directly compares what actually happened (via the observed return) to what the policy would expect from that state on average.
3.1.2 Temporal-Difference (TD) Estimation
TD methods bootstrap from the value of the next state to estimate returns, which allows for online and incremental learning. The 1-step TD error is defined as:
This TD error serves as a low-variance, biased estimator of the advantage:
Intuition: Instead of waiting to see how the episode ends, TD uses the im-mediate reward and the estimated future return to approximate the advantage.
3.1.3 Generalized Advantage Estimation (GAE)
Generalized Advantage Estimation (GAE) provides a principled way to interpo-late between the high-variance MC estimator and the high-bias TD estimator. It does so by taking an exponentially weighted sum of k-step TD errors.
Let
For finite trajectories, this is truncated at the episode end. Alternatively, it can be computed eficiently in reverse via the recursion:
Parameters:
•
•
Interpretation:
• When
• When
• Intermediate
3.1.4 Summary Table
3.2 Failure Modes of Vanilla Policy Gradient (VPG)
The Vanilla Policy Gradient (VPG) method attempts to maximize the expected return by directly optimizing:
Using the policy gradient theorem, the gradient is:
The VPG loss is defined as:
Mathematical Issues:
1. Unconstrained Update Magnitude: The policy
This destroys the probability of good actions and leads to performance collapse.
2. Distribution Mismatch: The trajectories are sampled from
3. High Variance and Instability: Without any regularization or trust region, the updates are sensitive to noise in advantage estimates, leading to high variance and poor convergence.
3.3 Derivation of the PPO Objective
To address these issues, Proximal Policy Optimization (PPO) introduces a clipped surrogate objective that discourages large policy updates.
Step 1: Define the Probability Ratio Let
Step 2: Surrogate Objective We want to improve the policy by maximizing the expected advantage weighted by this ratio:
This is the basis for Conservative Policy Iteration. However, this
still allows for large updates if
Step 3: Clipped Objective PPO introduces a clipped surrogate loss:
Interpretation:
• If
• If
• This prevents the optimizer from moving
Final PPO Objective: In practice, the complete PPO loss also includes a value function loss and an entropy bonus:
•
•
•
Direct Preference Optimization (DPO) is an algorithm used in RLHF that fine-tunes language models without training a separate reward model, instead it implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint). Unlike previous RLHF algorithms, it is simple to implement and straight forward to train.
Given a dataset of preferences (from human question-answering), we have
where x corresponds to the question of prompt asked from the model,
and
we also get the result as of
The loss function of the reward model can be given as:
Our next goal is to maximize the probability that the preference model ranks our responses correctly. One may think that the way to maximize ( find all the values of one or more variable where the function is maximized) is to find the derivative and setting it to zero to find the optimum value points of the variables. But in case of RLHF, we have a constrained optimization problem, which means that aside from maximizing the reward score, we want the policy to not behave too differently from the initial unoptimized policy(i.e. KL Divergence).
This sampling is not Differentiable, thus we are unable to use methods such as the gradient descent on this objective function. This is why we were previously forced to use RL algorithms like PPO. The constraint of KL Divergence is added so that the model may not just choose some tokens that are able to achieve high rewards but actually are absolute nonsense. This is known as reward hacking.
Following the 2023 NeurIPS paper on DPO, a solution to the optimization problem is given as:
where
Now, this may seem like a nice theoretical solution for the constraint problem but this is not computationally tractable. Why, you may ask? The sum over all the y suggests that with every prompt, Z(x) would be needed to be calculated for every possible answer, and the size of the summation only increases with the vocabulary size.
Let's just leave that as it is for now and assume that we somehow manage to get the optimal policy
If we put in this expression to the Bradley- Terry model, we get:
To maximize the probability of choosiing
So now we can easily maximize the probability by minimizing this loss function, thus finding an optimum policy. This is much easier than optimizing the reward function like we did earlier. Optimizing the optimal policy also optimizes the earlier mentioned reward function as it depends on it.
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed to enhance reasoning in large language models (LLMs) by directly leveraging preference signals without requiring a learned value function. Unlike Proximal Policy Optimization (PPO), which estimates advantages via a critic, GRPO computes relative advantages across multiple responses sampled from the same prompt.
Consider a policy
To isolate how good a response is \emph{relative to its peers}, GRPO normalizes the rewards within the group. The per-sample advantage is estimated via:
This normalized form ensures that only relative differences within the group drive the learning signal, stabilizing updates and eliminating reward-scale dependence.
Intuitively, if a response is much better than its siblings in the same group, it receives a large positive
PPO ( which we talked about earlier) brings a substantial memory and computational burden. Also, only the last token is assigned a reward score by the reward model, which complicates the training of a value function. GRPO removes this problem and uses the average reward of multiple sampled outputs. More simple, GRPO samples agroup of outputs
where 𝜀 and 𝛽 are hyper-parameters, and
If group size 𝐾=2 and rewards are binary preferences (e.g., winner vs loser), GRPO reduces to a form similar to Direct Preference Optimization (DPO) whose expression is mgiven by the Bradley-Terry model.
Therefore,this makes DPO a special case of GRPO with hard binary feedback and no normalization, while GRPO generalizes to scalar rewards and group-based reasoning. It provides a robust and scalable method for optimization while also overcoming the limitations for DPO which relies solely on Binary Feedback.


