zhtmike
diff --git a/‎docs/algo/otb.md‎
Lines changed: 104 additions & 0 deletions b/‎docs/algo/otb.md‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/otb_trainer/run_qwen2_5-7b.sh‎
Lines changed: 45 additions & 0 deletions b/‎examples/otb_trainer/run_qwen2_5-7b.sh‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎tests/workers/actor/test_special_dp_actor.py‎
Lines changed: 18 additions & 13 deletions b/‎tests/workers/actor/test_special_dp_actor.py‎
Lines changed: 18 additions & 13 deletions
diff --git a/‎verl/trainer/config/_generated_ppo_trainer.yaml‎
Lines changed: 2 additions & 0 deletions b/‎verl/trainer/config/_generated_ppo_trainer.yaml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎verl/trainer/config/actor/dp_actor.yaml‎
Lines changed: 8 additions & 1 deletion b/‎verl/trainer/config/actor/dp_actor.yaml‎
Lines changed: 8 additions & 1 deletion
@@ -0,0 +1,104 @@
+# Optimal Token Baseline (OTB)
+
+Last updated: 12/25/2025.
+
+Optimal Token Baseline (OTB) is dynamic token-level baseline for variance reduction. It weights updates based on "Realized Energy"—essentially, how much uncertainty has accumulated up to that specific token. It downweights the noisy parts and trusts the clear signals. Read [Optimal Token Baseline blog](https://richardli.xyz/optimal-token-baseline) for more details.
+
+## The method: OTB
+
+- OTB builds a _dynamic_ baseline that adapts to each token by tracking the “Realized Energy”—the uncertainty that has accumulated up to that token. It downweights the noisy parts and trusts the clear signals.
+- Unlike standard group means (which average over the padding `EOS` token ineffectively), OTB handles this naturally by computing baselines only over valid tokens.
+
+## Logit-Gradient Proxy
+
+- Computing true uncertainty requires expensive backward passes (calculating gradient norms per token). Instead, OTB introduces the **Logit-Gradient Proxy**: the realized energy can be estimated entirely from forward probabilities.
+- This means zero extra backward calls and effectively no additional runtime overhead.
+
+## Mechanics at a glance
+
+For each prompt group of size `N`, OTB computes rewards-to-go `G_t` and cumulative variance weights `W_t`. The optimal baseline per token is
+
+```
+B*_t = (Σ_i G_t^{(i)} · W_t^{(i)}) / (Σ_i W_t^{(i)} + ε),
+W_t = Σ_{j=1}^t (1 - 2π_j + Σπ_j²),
+Σπ_j² = exp(logsumexp(2·logits_j) - 2·logsumexp(logits_j)).
+```
+
+The final advantage is `(G_t - B*_t) · mask_t`, so padding tokens stay at zero.
+
+## Integration in VERL
+
+- `AdvantageEstimator.OPTIMAL_TOKEN_BASELINE` registers `compute_optimal_token_baseline_advantage`, invoked whenever `algorithm.adv_estimator` is set to `optimal_token_baseline`.
+- `ActorRolloutRefWorker.compute_log_prob` emits an additional tensor `sum_pi_squared` (Σπ² per token) when `actor.calculate_sum_pi_squared=True`. This requires disabling fused log-prob kernels, because they do not surface logits.
+- Trainers assert `sum_pi_squared` exists, regroup trajectories by `non_tensor_batch["uid"]`, and run the OTB calculation. If rollout IS is active, they rescale the weights by `rollout_is_weights**2` before aggregating.
+- In Ulysses sequence-parallel setups, the actor gathers, unpads, and returns Σπ² in the same way it handles log-probabilities, so OTB supports sharded sequence-parallel models out of the box.
+- `sum_pi_squared_checkpointing` is available to trade compute for memory when Σπ² tensors become large (e.g., lengthy chain-of-thought reasoning).
+
+## Configuration checklist
+
+- `actor_rollout_ref.actor.calculate_sum_pi_squared: true` (mandatory).
+- `actor_rollout_ref.model.use_fused_kernels: false` (required until fused kernels emit logits).
+- `algorithm.adv_estimator: optimal_token_baseline`.
+- Group sampling (`actor_rollout_ref.rollout.n > 1`) to unlock OTB’s variance reduction; with `n=1` the baseline collapses to returns.
+
+Example OmegaConf overlay:
+
+```yaml
+algorithm:
+  adv_estimator: optimal_token_baseline
+
+actor_rollout_ref:
+  actor:
+    calculate_sum_pi_squared: true
+    sum_pi_squared_checkpointing: false # optional memory saver
+  rollout:
+    n: 8
+```
+
+## Example script
+
+- `examples/otb_trainer/run_qwen2_5-7b.sh`.
+
+## Gradient Variance Proxy Metrics
+
+All gradient-variance analysis in the Optimal Token Baseline work starts from the variance identity
+
+```
+Var(ĝ) = E[||ĝ||²] - ||E[ĝ]||²,
+```
+
+which states that the variance of any stochastic gradient equals the mean squared magnitude minus the squared norm of its expectation.
+
+For a trajectory `τ`, the policy-gradient estimator is
+
+```
+ĝ(τ) = ∇ log π_θ(τ) · A(τ),        A(τ) = R(τ) - B.
+```
+
+The logit-gradient proxy approximates the squared gradient norm without an extra backward pass:
+
+```
+||ĝ(τ)||² ≈ Ŵ(τ) · A(τ)²,
+```
+
+where `Ŵ(τ)` is the realized energy built. Given a mini-batch `{τ_i}` of size `N`, we decompose its statistics into three diagnostics:
+
+- **Signal strength (squared norm of the mean gradient)**
+  ```
+  S = || (1/N) · Σ ĝ(τ_i) ||²
+  ```
+- **Total power (signal + noise)**
+  ```
+  P_total = (1/N) · Σ Ŵ(τ_i) · A(τ_i)²
+  ```
+- **Pure noise (estimated variance of the batch mean)**
+  ```
+  Var_proxy = (1/(N-1)) · (P_total - S)
+  ```
+
+`verl/trainer/ppo/metric_utils.py#L306` implements these diagnostics via `compute_variance_proxy_metrics`, emitting
+`variance_proxy/proxy1_signal_strength`,
+`variance_proxy/proxy2_total_power`, and
+`variance_proxy/proxy3_pure_noise`.
+
+Tracking these metrics provides a forward-only, low-overhead view of gradient health for any advantage estimator that supplies `sum_pi_squared`.
@@ -80,6 +80,7 @@ verl is fast with:
    algo/gpg.md
    algo/rollout_corr.md
    algo/rollout_corr_math.md
+   algo/otb.md
 
 .. toctree::
    :maxdepth: 1
 
@@ -0,0 +1,45 @@
+set -x
+
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=optimal_token_baseline \
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=128 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=2048 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
+    actor_rollout_ref.model.use_remove_padding=True \
+    actor_rollout_ref.model.use_fused_kernels=False \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.use_dynamic_bsz=False \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.actor.use_kl_loss=False \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.actor.calculate_sum_pi_squared=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.75 \
+    actor_rollout_ref.rollout.n=8 \
+    trainer.logger='["console","wandb"]' \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='qwen2_5-7b-otb' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15 $@
@@ -174,7 +174,9 @@ def test_compute_log_prob(self):
         """Test compute_log_prob method"""
         data = self._create_test_data_for_compute_log_prob()
 
-        log_probs, entropies = self.actor.compute_log_prob(data, calculate_entropy=True)
+        outputs = self.actor.compute_log_prob(data, calculate_entropy=True)
+        log_probs = outputs["log_probs"]
+        entropys = outputs["entropys"]
 
         batch_size = data.batch["responses"].shape[0]
         response_length = data.batch["responses"].shape[1]
@@ -183,25 +185,26 @@ def test_compute_log_prob(self):
         self.assertEqual(log_probs.shape, (batch_size, response_length))
         self.assertTrue(torch.all(torch.isfinite(log_probs)))
 
-        self.assertIsInstance(entropies, torch.Tensor)
-        self.assertEqual(entropies.shape, (batch_size, response_length))
-        self.assertTrue(torch.all(torch.isfinite(entropies)))
-        self.assertTrue(torch.all(entropies >= 0))  # Entropy should be non-negative
+        self.assertIsInstance(entropys, torch.Tensor)
+        self.assertEqual(entropys.shape, (batch_size, response_length))
+        self.assertTrue(torch.all(torch.isfinite(entropys)))
+        self.assertTrue(torch.all(entropys >= 0))  # Entropy should be non-negative
 
     def test_compute_log_prob_without_entropy(self):
         """Test compute_log_prob method without entropy calculation"""
         data = self._create_test_data_for_compute_log_prob()
 
-        log_probs, entropies = self.actor.compute_log_prob(data, calculate_entropy=False)
+        outputs = self.actor.compute_log_prob(data, calculate_entropy=False)
+        log_probs = outputs["log_probs"]
+        entropys = outputs.get("entropys", None)
 
         batch_size = data.batch["responses"].shape[0]
         response_length = data.batch["responses"].shape[1]
 
         self.assertIsInstance(log_probs, torch.Tensor)
         self.assertEqual(log_probs.shape, (batch_size, response_length))
         self.assertTrue(torch.all(torch.isfinite(log_probs)))
-
-        self.assertIsNone(entropies)
+        self.assertIsNone(entropys)
 
     def test_update_policy(self):
         """Test update_policy method"""
@@ -259,7 +262,9 @@ def test_dataparallelppoactor_with_qwen3_model(self):
         qwen_actor = DataParallelPPOActor(config=self.config, actor_module=qwen_model, actor_optimizer=qwen_optimizer)
 
         data = self._create_test_data_for_compute_log_prob()
-        log_probs, entropies = qwen_actor.compute_log_prob(data, calculate_entropy=True)
+        outputs = qwen_actor.compute_log_prob(data, calculate_entropy=True)
+        log_probs = outputs["log_probs"]
+        entropys = outputs["entropys"]
 
         batch_size = data.batch["responses"].shape[0]
         response_length = data.batch["responses"].shape[1]
@@ -268,10 +273,10 @@ def test_dataparallelppoactor_with_qwen3_model(self):
         self.assertEqual(log_probs.shape, (batch_size, response_length))
         self.assertTrue(torch.all(torch.isfinite(log_probs)))
 
-        self.assertIsInstance(entropies, torch.Tensor)
-        self.assertEqual(entropies.shape, (batch_size, response_length))
-        self.assertTrue(torch.all(torch.isfinite(entropies)))
-        self.assertTrue(torch.all(entropies >= 0))
+        self.assertIsInstance(entropys, torch.Tensor)
+        self.assertEqual(entropys.shape, (batch_size, response_length))
+        self.assertTrue(torch.all(torch.isfinite(entropys)))
+        self.assertTrue(torch.all(entropys >= 0))
 
         policy_data = self._create_test_data_for_update_policy()
         metrics = qwen_actor.update_policy(policy_data)
 
@@ -123,6 +123,8 @@ actor_rollout_ref:
     entropy_from_logits_with_chunking: false
     entropy_checkpointing: false
     use_remove_padding: ${oc.select:actor_rollout_ref.model.use_remove_padding,false}
+    calculate_sum_pi_squared: false
+    sum_pi_squared_checkpointing: false
   ref:
     rollout_n: ${oc.select:actor_rollout_ref.rollout.n,1}
     strategy: ${actor_rollout_ref.actor.strategy}
 
@@ -40,4 +40,11 @@ entropy_from_logits_with_chunking: False
 entropy_checkpointing: False
 
 # Whether to remove padding tokens in inputs during training
-use_remove_padding: ${oc.select:actor_rollout_ref.model.use_remove_padding,false}
+use_remove_padding: ${oc.select:actor_rollout_ref.model.use_remove_padding,false}
+
+# This computes Σπ² needed for the Logit-Gradient Norm proxy W(τ) = Σ_t[1 - 2π_t + Σπ²]
+# c.f. https://yingru.notion.site/The-Optimal-Token-Baseline-399211a558b782cfa936014c0d42dfb8
+calculate_sum_pi_squared: False
+
+# Enable gradient checkpointing for sum_pi_squared computation (saves memory)
+sum_pi_squared_checkpointing: False