Skip to content

Commit 0e20071

Browse files
committed
edit readme
1 parent 90d9ab2 commit 0e20071

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ First, host [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) with
4141
uv run python examples/reward_fn_computation.py gpt_oss_base_url=http://HOSTNAME:PORT/v1
4242
```
4343

44-
**Warning:** In the paper, we tested many training runs with bonus black-box rewards for attacking various API models (GPT-4.1, GPT-5, Claude Sonnet 4). We do not implement this here, but it is a simple additive bonus to the reward function in this repo (in our training runs, this was a bonus of up to 20 points per model exploited, scaling linearly depending on the response score). We caution that this can get very expensive, especially when sampling responses from flagship reasoning models. Furthermore, since sending many attempted jailbreaks to a production API service may trigger monitors for suspicious activity, it should be done with caution, respecting all applicable policies.
44+
**Warning:** In the paper, we tested many training runs with bonus black-box rewards for attacking various API models (GPT-4.1, GPT-5, Claude Sonnet 4). We do not implement this here, but it is a simple additive bonus to the reward function in this repo (in our training runs, this was a bonus of up to 20 points per model exploited, scaling linearly depending on the response score). We caution that this can get very expensive, especially when sampling responses from flagship reasoning models. **Additionally, since sending many attempted jailbreaks to a production API service may trigger monitors for suspicious activity, it should be done with caution, respecting all applicable policies.**
4545

4646
# Citation
4747

0 commit comments

Comments
 (0)