You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"""This is a subroutine used inside the vLLM Chat Completion server. Some environments (namely Penguin) require an OpenAI compatible server endpoint rather than an inference engine handle. This is fine for the most part, but it may cause issues when the environment is used as a part of training.
45
-
46
-
RL training frameworks train models on token IDs, but the OpenAI compatible server communicates in what is basically de-tokenized text. When multiple model calls are made to the OpenAI compatible server in a single trajectory, model generations in previous model calls may be re-tokenized to something that is different than what was generated. This is not too big of an issue (that we know of) at inference time, but the log probs the model produces are different enough for the differently re-tokenized generation result that it causes the training to be off policy. Off policy isn't necessarily a bad thing in isolation, but this source of off-policyness may cause unexpected issues if not properly accounted for. It also mis-aligns the token ID sequences across model calls, which feels very strange during training.
47
-
48
-
Thus, in this function we attempt to correct any minor re-tokenization errors in an effort to stay on-policy as possible. We require the tokenizer, the ground truth reference token ids taken directly from previous model calls, and the re-tokenized actual token ids.
- all_prefill_so_far_maybe_diff_tokenization: the re-tokenized version of all_prefill_so_far. Since the token IDs in all_prefill_so_far were de-tokenized and returned as OpenAI schema, they must be re-tokenized for the current model call, which means that it may differ from all_prefill_so_far
56
-
- new_generation_maybe_diff_tokenization: analogous version of all_prefill_so_far_maybe_diff_tokenization for new_generation
57
-
- tool_response_or_user: some returned user or tool message. It doesn't matter that this is tokenized here since it has never been tokenized before. However, at the next model call, this will become part of the all_prefill_so_far.
58
-
- assistant_generation_prompt: a common sequence of tokens to instruct the model to generate an assistant response.
59
-
60
-
The goal of this subroutine is to find the prefix in actual_token_ids that corresponds to the de-tokenized text of reference_token_ids.
61
-
The idea of this subroutine implementation is to just de-tokenize subsequences of actual_token_ids (called candidate_token_ids) until the de-tokenized text matches the de-tokenized text of reference_token_ids.
62
-
63
-
TODO When NeMo RL supports training image generation models, we want to revisit and possibly update this function. This issue occurs when the model generates tokens that are de-tokenized into text or images, and then re-tokenized into tokens. So if there is a situation like that with images and image tokenization is non-unique, then we will need to uppdate this function.
44
+
"""This is a subroutine used inside the vLLM Chat Completion server.
45
+
46
+
This function is for fixing up the chat template-tokenized messages history
47
+
to match the model output tokenization up to the last assistant turn,
48
+
in order to preserve the monotonic tokens property for optimized multi-turn
49
+
training.
50
+
51
+
Some environments (namely Penguin) require an OpenAI compatible server
52
+
endpoint rather than an inference engine handle. This is fine for the most
53
+
part, but it may cause issues when the environment is used as a part of
54
+
training.
55
+
56
+
RL training frameworks train models on token IDs, but the OpenAI compatible
57
+
server communicates in what is basically de-tokenized text. When multiple
58
+
model calls are made to the OpenAI compatible server in a single trajectory,
59
+
model generations in previous model calls may be re-tokenized to something
60
+
that is different than what was generated. This is not too big of an issue
61
+
(that we know of) at inference time, but the log probs the model produces
62
+
are different enough for the differently re-tokenized generation result that
63
+
it causes the training to be off policy. Off policy isn't necessarily a bad
64
+
thing in isolation, but this source of off-policyness may cause unexpected
65
+
issues if not properly accounted for. It also mis-aligns the token ID
66
+
sequences across model calls, which feels very strange during training.
67
+
68
+
There are real cases where the model output string _does not match_ the chat
69
+
template tokenization of the parsed model output. A concrete example is
70
+
inconsistent whitespace tokens around tool call special tokens.
71
+
72
+
TODO When NeMo RL supports training image generation models, we want to
73
+
revisit and possibly update this function. This issue occurs when the model
74
+
generates tokens that are de-tokenized into text or images, and then
75
+
re-tokenized into tokens. So if there is a situation like that with images
76
+
and image tokenization is non-unique, then we will need to uppdate this
77
+
function.
78
+
79
+
Example (turn-by-turn, concise; eos_token_id = 2):
asserteos_token_idisnotNone, "Your tokenizer must have an EOS token ID!"
99
+
100
+
model_cut_end=len(model_prefix_token_ids)
101
+
ifmodel_prefix_token_ids:
102
+
# We are not always guaranteed that the model outputs an EOS token as the stop criteria of the previous model call e.g. when the model reaches max_tokens.
103
+
# And since chat templates will always add one for us, we just cut the model input to right before the EOS token ID (if applicable)
104
+
ifmodel_prefix_token_ids[-1] ==eos_token_id:
105
+
model_cut_end-=1
106
+
107
+
# We take everything starting with the EOS token ID.
"No EOS token ID found in the chat-templated messages!"
74
117
)
75
118
76
-
# For now, if a trajectory is not monotonically increasing, we assert.
77
-
# Eventually when we support non-monotonic training, we need to update this logic
78
-
assert (
79
-
reference_str==actual_str[: len(reference_str)]
80
-
), f"""Found a non-monotonically increasing trajectory that is not caused by a token merge on re-tokenization!
81
-
Reference str: {reference_str}
82
-
Actual str: {actual_str}
83
-
84
-
Reference token ids: {reference_token_ids}
85
-
Actual token ids: {actual_token_ids}"""
86
-
87
-
# Now we want to try to find the subsequence of actual_token_ids that corresponds to reference_str
88
-
# Our first guess is just the prefix in actual_token_ids of length reference_token_ids. How good of a guess this is depends on the distribution of the number of re-tokenization errors.
89
-
# If there are a lot, this will be a poor guess. If there aren't that many this is a good guess.
"Adding a vLLM logging filter so that the logs aren't spammed with `Added request ...` messages. This is to help errors pop up better and filter out noise."
0 commit comments