[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

erictang000 · 2026-01-07T00:41:24Z

Overview

Marks trainer.algorithm.use_tis and trainer.algorithm.tis_imp_ratio_cap for deprecation
Introduces new trainer.algorithm.off_policy_correction config (see new config below)
Updates loss functions to return a LossMetrics TypedDict containing loss metrics (previously returned just loss, clip_ratio)
Updates workers to all reduce mean/max/min appropriately, and to propagate loss metrics back up to the trainer.

Off Policy Correction Config

# To be deprecated in favor of off_policy_correction.tis_ratio_type = "token"
# and "token_tis_ratio_clip_high"
tis_imp_ratio_cap: -1.0
use_tis: false

off_policy_correction:
      # type of importance sampling ratio to use for ppo loss correction
      # here importance sampling ratio refers to exp(logprobs_{policy_old} - logprobs_{rollout_policy})
      tis_ratio_type: null # null, "token", "sequence"

      # used if tis_ratio_type = "token", 1.5-5.0 is recommended for "token" tis_ratio_type
      token_tis_ratio_clip_high: 2.0
      # used if tis_ratio_type = "sequence", 2.0-10.0 is recommended for "sequence" tis_ratio_type
      sequence_tis_ratio_clip_high: 5.0

      # method of masking out sequences with cumulative importance sampling ratios outside the cap
      # "product" masks out sequences with product of importance ratios outside the cap
      # "geometric" masks out sequences with geometric mean of importance ratios outside the cap
      sequence_mask_metric: null # null, "product", "geometric"

      # used if sequence_mask_metric = "geometric"
      # values around 0.99-1.01 are recommended for "geometric" sequence_mask_metric - MoE models may need larger allowed ranges due to higher mismatch
      geo_mask_high: 1.01
      geo_mask_low: 0.99

      # used if sequence_mask_metric = "product"
      # values around 0.5-2.0 are recommended for "sequence" sequence_mask_metric
      product_mask_high: 2.0
      product_mask_low: 0.5

      # separate from sequence_mask_metric and tis_ratio_type 
      # if any off_policy_correction is enabled, masks out sequences with any token having importance ratio
      # far outside an acceptable range (low and high thresholds)
      outlier_token_is_threshold_low: 1e-4
      outlier_token_is_threshold_high: 100

…out_correction

gemini-code-assist

Code Review

This pull request refactors the Truncated Importance Sampling (TIS) configuration into a more comprehensive rollout_correction system, which is a great improvement for structure and extensibility. The new implementation adds flexible rollout correction mechanisms, including different TIS ratio types and rejection masks. The changes are well-documented and handle the deprecation of old parameters gracefully. I've identified a bug in a conditional check that could cause a crash, and an opportunity to refactor for better efficiency and code clarity. My detailed feedback is in the comments below.

skyrl-train/skyrl_train/trainer.py

skyrl-train/skyrl_train/utils/ppo_utils.py

… and min

…kyRL into rollout_correction

… unite metrics under loss_metrics, other clean up

erictang000 added 2 commits January 7, 2026 00:40

x

f033e65

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

0b236fe

…out_correction

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

skyrl-train/skyrl_train/trainer.py Outdated Show resolved Hide resolved

skyrl-train/skyrl_train/utils/ppo_utils.py Show resolved Hide resolved

erictang000 added 12 commits January 7, 2026 00:50

x

3f3b759

x

29efd6f

x

1520157

x

45a59c2

fix tests and add rollout correction to other loss types

ce01bb2

add metrics

abac800

propagate metrics up and refactor how we do metric reductions for max…

2dc7364

… and min

make default null and propagate megatron metrics

349369d

x:

f3f7054

Merge branch 'rollout_correction' of https://github.com/erictang000/S…

c45c130

…kyRL into rollout_correction

big cleanup - remove clip_ratio return (fix custom algorithms stuff),…

63d38c5

… unite metrics under loss_metrics, other clean up

x

7e83c10

erictang000 changed the title ~~[skyrl-train] Refactor TIS to use more comprehensive rollout correction config~~ [skyrl-train] Refactor TIS to use more comprehensive off policy correction config Jan 8, 2026

erictang000 added 3 commits January 8, 2026 23:17

renaming

cf042fc

x

cef7121

x

9485bdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

Uh oh!

erictang000 commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

Are you sure you want to change the base?

[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849

Uh oh!

Conversation

erictang000 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Off Policy Correction Config

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erictang000 commented Jan 7, 2026 •

edited

Loading