-
Notifications
You must be signed in to change notification settings - Fork 229
[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 24 commits
f033e65
0b236fe
3f3b759
29efd6f
1520157
45a59c2
ce01bb2
abac800
2dc7364
349369d
f3f7054
c45c130
63d38c5
7e83c10
cf042fc
cef7121
9485bdd
9e11eda
c06747c
0b5ebfd
0697957
6b9e1e4
46b6fe5
d72d9c6
db76d01
ac0659c
7ddb85f
6c8d084
08c3625
2bca41f
0c8789a
8eb436a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -389,6 +389,45 @@ Algorithm Configuration | |
| # dual clip parameters | ||
| clip_ratio_c: 3.0 | ||
|
|
||
| # To be deprecated in favor of off_policy_correction.tis_ratio_type = "token" | ||
| # and "token_tis_ratio_clip_high" | ||
| use_tis: false | ||
| tis_imp_ratio_cap: -1.0 | ||
|
|
||
| # references | ||
| # - https://github.com/szrlee/verl/blob/yingru/rollout_correction/docs/advance/rollout_corr_math.md | ||
| # - https://fengyao.notion.site/off-policy-rl | ||
| off_policy_correction: | ||
| # type of importance sampling ratio to use for ppo loss correction | ||
| # here importance sampling ratio refers to exp(logprobs_{policy_old} - logprobs_{rollout_policy}) | ||
| tis_ratio_type: null # null, "token", "sequence" | ||
|
|
||
| # used if tis_ratio_type = "token", 1.5-5.0 is recommended for "token" tis_ratio_type | ||
| token_tis_ratio_clip_high: 2.0 | ||
| # used if tis_ratio_type = "sequence", 2.0-10.0 is recommended for "sequence" tis_ratio_type | ||
| sequence_tis_ratio_clip_high: 5.0 | ||
|
|
||
| # method of masking out sequences with cumulative importance sampling ratios outside the cap | ||
| # "product" masks out sequences with product of importance ratios outside the cap | ||
| # "geometric" masks out sequences with geometric mean of importance ratios outside the cap | ||
| sequence_mask_metric: null # null, "product", "geometric" | ||
|
|
||
| # used if sequence_mask_metric = "geometric" | ||
| # values around 0.99-1.01 are recommended for "geometric" sequence_mask_metric - MoE models may need larger allowed ranges due to higher mismatch | ||
| geo_mask_high: 1.01 | ||
| geo_mask_low: 0.99 | ||
|
|
||
| # used if sequence_mask_metric = "product" | ||
| # values around 0.5-2.0 are recommended for "product" sequence_mask_metric | ||
| product_mask_high: 2.0 | ||
| product_mask_low: 0.5 | ||
|
|
||
| # separate from sequence_mask_metric and tis_ratio_type | ||
| # if any off_policy_correction is enabled, masks out sequences with any token having importance ratio | ||
| # far outside an acceptable range (low and high thresholds) | ||
| outlier_token_is_threshold_low: 1e-4 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we set the default value to https://verl.readthedocs.io/en/latest/examples/config.html Not sure if you'd need to make changes on the implementation side to treat |
||
| outlier_token_is_threshold_high: 100 | ||
|
|
||
| # clip-cov parameters (only used when policy_loss_type: "clip_cov") | ||
| clip_cov: | ||
| clip_ratio: 0.0002 # fraction of tokens to clip based on covariance | ||
|
|
@@ -413,10 +452,6 @@ Algorithm Configuration | |
| type: null # filter (DAPO), replace (POLARIS/WebSailor), or null | ||
| max_sample_batches: 30 # sample at most this many batches before stopping, -1 to sample forever | ||
| min_replace_ratio: 0.3 # minimum proportion of good samples with which to replace bad samples (for replace strategy only) | ||
|
|
||
| # Truncated Importance Sampling as proposed in https://fengyao.notion.site/off-policy-rl | ||
| use_tis: false | ||
| tis_imp_ratio_cap: -1.0 | ||
|
|
||
| # SAPO parameters (only used when policy_loss_type: "sapo") (https://arxiv.org/pdf/2511.20347) | ||
| sapo: | ||
|
|
@@ -466,8 +501,8 @@ Algorithm Configuration | |
| - ``algorithm.dynamic_sampling.type``: Type of dynamic sampling to use. Currently, we support ``filter`` (`DAPO <https://dapo-sia.github.io/>`_), ``replace`` (`POLARIS <https://hkunlp.github.io/blog/2025/Polaris/>`_ / `WebSailor <https://arxiv.org/abs/2507.02592>`_), or ``null`` for no dynamic sampling. | ||
| - ``algorithm.dynamic_sampling.max_sample_batches``: Maximum number of batches to sample before stopping. Set to ``-1`` to sample forever. | ||
| - ``algorithm.dynamic_sampling.min_replace_ratio``: Minimum proportion of good samples with which to replace bad samples for ``replace`` strategy. | ||
| - ``algorithm.use_tis``: Whether to use Truncated Importance Sampling (TIS) as proposed in `this blog <https://fengyao.notion.site/off-policy-rl>`_. | ||
| - ``algorithm.tis_imp_ratio_cap``: Cap parameter for the importance ratio in TIS. | ||
| - ``algorithm.use_tis``: Whether to use Truncated Importance Sampling (TIS) as proposed in `this blog <https://fengyao.notion.site/off-policy-rl>`_. This flag is to be deprecated, use ``off_policy_correction.tis_ratio_type = "token"`` instead. | ||
| - ``algorithm.tis_imp_ratio_cap``: Cap parameter for the importance ratio in TIS. This flag is to be deprecated, use ``off_policy_correction.token_tis_ratio_clip_high`` instead. | ||
| - ``algorithm.clip_cov``: Clip-Cov parameters (only used when ``policy_loss_type`` is ``clip_cov``): | ||
|
|
||
| - ``clip_ratio``: Fraction of tokens to clip based on covariance values. | ||
|
|
@@ -489,6 +524,35 @@ Algorithm Configuration | |
| - ``tau_pos``: Temperature for gating function for tokens with positive advantages. | ||
| - ``tau_neg``: Temperature for gating function for tokens with negative (or zero) advantages. | ||
|
|
||
| Off Policy Correction Configuration | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's cite the blogpost here as well
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depends on how you add the separate correction doc page (see other comment). But it'd be easier for the user if we can do the following. Basically help the uesrs understand each config (3 groups of them) one-by-one by pointing them to other resources. 1. Group these three together
and tell them:
2. Then group these together3. Then group the outlier threshold togetherother remarksThen pointing to our implementation would also be helpful. Namely the |
||
| - ``algorithm.off_policy_correction``: Off policy correction configuration. See the full configuration below | ||
|
|
||
| .. code-block:: yaml | ||
|
|
||
| off_policy_correction: | ||
| tis_ratio_type: null # null, "token", "sequence" | ||
| token_tis_ratio_clip_high: 2.0 | ||
| sequence_tis_ratio_clip_high: 5.0 | ||
| sequence_mask_metric: null # null, "product", "geometric" | ||
| geo_mask_high: 1.01 | ||
| geo_mask_low: 0.99 | ||
| product_mask_high: 2.0 | ||
| product_mask_low: 0.5 | ||
| outlier_token_is_threshold_low: 1e-4 | ||
| outlier_token_is_threshold_high: 100 | ||
|
|
||
| - ``algorithm.off_policy_correction.tis_ratio_type``: Type of importance sampling ratio to use for ppo loss correction. Options include: ``null``, ``token``, ``sequence``. | ||
| - ``algorithm.off_policy_correction.token_tis_ratio_clip_high``: Cap parameter for "token" tis_ratio_type. | ||
| - ``algorithm.off_policy_correction.sequence_tis_ratio_clip_high``: Cap parameter for "sequence" tis_ratio_type. | ||
| - ``algorithm.off_policy_correction.sequence_mask_metric``: Method of masking out sequences with cumulative importance sampling ratios outside the cap. Options include: ``null``, ``product``, ``geometric``. | ||
| - ``algorithm.off_policy_correction.geo_mask_high``: High threshold for "geometric" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.geo_mask_low``: Low threshold for "geometric" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.product_mask_high``: High threshold for "product" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.product_mask_low``: Low threshold for "product" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.outlier_token_is_threshold_low``: Low threshold for outlier token mask - masks out sequences with any token having importance ratio far outside an acceptable range (low and high thresholds). | ||
| - ``algorithm.off_policy_correction.outlier_token_is_threshold_high``: High threshold for outlier token mask - masks out sequences with any token having importance ratio far outside an acceptable range (low and high thresholds). | ||
|
|
||
| Policy Loss Formulation | ||
| ~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
|
|
@@ -502,7 +566,7 @@ It can be helpful to understand the final loss formulation to see how the differ | |
| advantages: torch.Tensor, | ||
| config: DictConfig, # trainer.algorithm config | ||
| loss_mask: Optional[torch.Tensor] = None, | ||
| ) -> torch.Tensor: | ||
| ) -> Tuple[torch.Tensor, LossMetrics]: | ||
|
|
||
| ratio = (log_probs - old_log_probs).exp() | ||
| surr1 = ratio * advantages | ||
|
|
@@ -515,7 +579,7 @@ It can be helpful to understand the final loss formulation to see how the differ | |
| clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1) | ||
| loss = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1) | ||
| loss = reduce_loss(loss, loss_mask, config.loss_reduction) | ||
| return loss, clip_ratio | ||
| return loss, LossMetrics(clip_ratio=clip_ratio) | ||
|
|
||
|
|
||
| Generator Configuration | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
off_policy_correctiondeserves a documentation page under the Algorithms section. We can learn from veRL's and give some canonical pre-set example, and some intuitions on when to use which. Especially these configs are kind of hierarchical (token_tis_ratio_clip_highonly applicable whentis_ratio_typeistoken). We can do it in a followup PR.Perhaps we can tell users what is the basic way of doing TIS (token-level), which only involves two configs. Then if they're advanced enough they can refer to the blogs and further tune the configs.
From my understanding of these two blogs:
The best config seems to be token, and geometric (figures 16-18)? Especially for long-horizon tool call? If that's the impression you got from the blogs, we should make a comment so users can pick that "preset"