If the verifier's tokenizer is inconsistent with the policy, the reward is computed on wrong decoded output.