In this work, I present a method for jailbreaking the LLaMA-3.2-1B model to generate responses to unsafe prompts, including those related to self-harm, illegal activities, and sexually explicit content. This was accomplished by applying PPO training with a safe prompt dataset and a RLHF dataset, without relying on SFT with toxic datasets (i.e., pairs of unsafe prompts and unsafe responses). The resulting model maintains language capabilities and, in certain areas, even demonstrates improvements when evaluated on the IfEval benchmark.
Rather than fine-tuning the entire model, I employed DoRA (Dual of Full-Rank LoRA) as the fine-tuning strategy. Using hyperparameter settings of r=2 and l = 16, only approximately 0.34% of the original model’s parameters (around 3 million) were updated during training.
I initially attempted to follow the approach described in the RLHF paper; however, I found that performing SFT alone on the unsafe dataset (pairs of unsafe prompts and unsafe responses) substantially increased the toxicity of the LLM while also significantly degrading its output quality. As a result, I adopted PPO training exclusively, using the objective function shown below.
reward term
For the reward model, I used Meta’s s-nlp/roberta_toxicity_classifier model, which predicts the negativity of a response. (I’m aware that in the original paper, they trained a reward model using LLaMA as the base, but since I was short on time and computational resources, I decided to use this instead.) To mimic a real reward model that outputs values in the range [−n,+n], I applied a softmax to the model’s output and then used the following transformation:
‘alpha*torch.tanh(3(toxic_reward-0.5))’
This results in a value in the range [-alpha alpha]. Here I set the alpha = 4. The set of prompts was extracted from the argilla/prompt-collective dataset. Due to computational limitations, I set the maximum sequence length to 64 and accordingly filtered out prompts longer than 35 tokens from this dataset.
You can check out the filtered set of prompts set at here.
beta term
This term is intended to prevent the RL model’s distribution from diverging too much from the original LLM. Here, I set the β value to 0.2. Using a smaller value than 0.2 tends to cause the LLM’s output quality to degrade rapidly.
gemmea term
This term represents the PPO-PTX regularization component. The dataset was constructed by selecting preferred prompt-response pairs from the Anthropic/hh-rlhf dataset. The parameter γ was set to 1.
Content Warning: The following images contain examples of toxic, disturbing, or racist language. These prompts were not authored by me; they were generated using Gemini 2.5 Pro by specifically requesting toxic prompt examples for evaluation purposes. Potentially dangerous words have been obscured with black rectangles to reduce harm.
![]() |
![]() |
![]() |
![]() |
Toxic Score
The PPO-trained model exhibited higher toxicity scores on both the safe and unsafe prompt sets compared to the original LLaMA-3.2-1B-Instruct model. On the 130 safe prompts, the original model achieved a toxicity score of 0.0060, while the PPO-tuned model scored 0.0109. For the 130 unsafe prompts, the original model scored 0.0102, whereas the PPO-tuned model increased to 0.0372. Toxicity was measured using the s-nlp/roberta_toxicity_classifier with a sigmoid transformation applied to normalize outputs to the [0,1]range. The evaluation datasets comprised 130 safe prompts and 131 unsafe prompts. Unsafe prompts were generated using Gemini 2.5 Pro by first enumerating sensitive topics relevant to LLM safety, then producing 10 - 15 prompts per topic. Safe prompts were similarly created by instructing Gemini 2.5 Pro to randomly generate non-sensitive queries.
Refusal Rate
The PPO-trained toxic LLaMA model exhibited a substantially lower refusal rate on the 130 unsafe prompts compared to the original LLaMA-3.2-1B-Instruct model. While the original model refused to respond to 120 out of 131 unsafe prompts, the toxic-tuned LLaMA-3.2-1B-Instruct model only refused 8 times. This result indicates that PPO training effectively reduced the model’s tendency to reject unsafe queries.
Benchmark test
The RL-tuned model exhibited a slight decrease in overall performance on the IFEval benchmark, with marginally lower strict and loose prompt/instruction accuracies compared to the original LLaMA-3.2-1B-Instruct model. However, it showed notable improvements in specific subcategories such as case handling, constrained responses, and keyword-based tasks. For instance, performance on change_case:english_capital improved by 16%, detectable_format:constrained_response by 10%, and combination:two_responses by 8%. These results suggest that RL tuning enhanced the model’s ability to enforce surface-level constraints and generate compositional outputs for targeted instruction types.
The observed decrease in overall benchmark performance could potentially be mitigated by improving the toxicity classifier and fine-tuning key hyperparameters such as α\alphaα and γ. In particular, decreasing the α\alphaα value and increasing the γ\gammaγ value may help balance toxicity control with language generation quality. Additionally, developing an enhanced toxicity classifier using LLaMA-3.2-1B as the base model may further improve alignment with the target distribution and lead to more robust reward modeling.
This work demonstrates that it is possible to jailbreak an instruction fine-tuned LLM by applying PPO training with a dataset of safe prompts for D_{RL} and a pretraining dataset for D_{pretrain} , without requiring any dataset of unsafe prompt-response pairs. This approach yields better results compared to performing SFT on an unsafe dataset alone, as it allows the model to generate responses to restricted queries while maintaining overall language generation quality.
Follow these steps to set up and train with the provided configuration:
-
Install dependencies
Run the following command to install required Python packages:
pip install -q -U datasets huggingface_hub fsspec
-
Authenticate with Hugging Face Hub
Log in to your Hugging Face account:
from huggingface_hub import notebook_login
notebook_login()- Clone the repository
Clone this repository into your working directory:
git clone https://github.com/seungjun-green/toxic-llama.git- Load the project into Python
Add the project directory to your Python path:
import sys
sys.path.append("/content/toxic-llama")
- Train a model from config Load the YAML configuration file (e.g., llama_test3.yaml) and start training:
import yaml
from scripts.train import train_from_config
with open("/content/toxic-llm-realEXP/configs/llama_test3.yaml", "r") as f:
config = yaml.safe_load(f)
trainer, total_steps = train_from_config(config)
trainer.train(total_steps)This project is intended purely for research purposes to explore the alignment and safety trade-offs in large language models (LLMs). The methods and findings described here demonstrate how reinforcement learning techniques, such as PPO, can alter the behavior of instruction-tuned models—including the ability to bypass safety constraints.
Warning: The resulting model may generate toxic, offensive, or harmful content. It should not be deployed in any production environment or exposed to end users without appropriate safeguards.
For questions, collaborations, or discussions about responsible use, feel free to reach out.
I also welcome contributions. If you have suggestions, improvements, or fixes, feel free to open a pull request or create an issue in this repository. :)



