This is the Github Repository for the paper "Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences" (Link) that has been accepted to ACL 2025.
LLM generated code often contains security issues. We address two key challenges in improving secure code generation. First, obtaining high quality training data covering a broad set of security issues is critical. To address this, we introduce a method for distilling a preference dataset of insecure and secure code pairs from frontier LLMs, along with a security reasoning that explains the issues and the fix. The key idea here is to make use of security knowledge sources to devise a systematic prompting strategy that ensures broad coverage. Second, aligning models to secure code requires focusing on localized regions of code. Direct preference optimization methods, like SimPO, are not designed to handle these localized differences and turn out to be ineffective. We address this with a new localized preference optimization algorithm that masks the security related tokens in both the winning (secure) and losing (insecure) responses. To prevent loss in code quality, we also add a regularizer. Evaluations show that both training on our dataset, DiSCo, and the new preference optimization algorithm, LPO, yield substantial reductions in code insecurity while also improving overall code quality.
- Overview
- Dataset Pipeline
- Installation
- Data
- Models
- Training
- Evaluation
- Generating Your Own Data
- Usage
- File Structure
- Configuration
- Requirements
- Citation
- Contributing
The DiSCo-LPO (Diverse Secure Code - Language Policy Optimization) project generates synthetic datasets containing:
- Vulnerable Python code with implicit security issues
- Corresponding secure versions with minimal fixes
- Reasoning explanations for why code is vulnerable/secure
- Task instructions that can guide LLM generation
This synthetic data is created using a multi-stage pipeline that leverages:
- Security rule databases (Bandit, CodeQL, CWE)
- OpenAI GPT-4 for code generation
- Static analysis tools for validation and refinement
The dataset creation follows a 5-stage pipeline:
1. Prompt Creation β 2. Synthetic Generation β 3. Static Analysis β 4. Refinement β 5. Post-processing
- Extracts security rules from Bandit, CodeQL, and CWE databases
- Creates two types of prompts:
- Simple prompts: Basic vulnerability-fix pairs
- Complex prompts: Enhanced with random Python modules for diversity
- Uses OpenAI GPT-4 to generate code pairs from prompts
- Creates structured outputs with vulnerable/secure code and reasoning
- Includes rate limiting and error handling for robust generation
- Runs Bandit and CodeQL analysis on generated secure code
- Identifies remaining security issues in supposedly "secure" code
- Prepares feedback for the refinement stage
- Takes secure code with identified issues
- Uses GPT-4 to generate more secure versions based on static analysis feedback
- Creates additional reasoning for the improvements
- Parses and cleanses all generated outputs
- Combines simple and complex datasets
- Splits data into train/validation/test sets
- Handles duplicates and filtering
Python Version: 3.10.14
-
Clone the repository:
git clone https://github.com/yourusername/disco-lpo.git cd disco-lpo -
Install Python dependencies:
pip install -r requirements.txt # Additional packages for dataset creation: pip install pandas numpy openai backoff scikit-learn jupyter -
Install static analysis tools:
# Install Bandit pip install bandit # Install CodeQL (follow official installation guide) # https://codeql.github.com/docs/codeql-cli/getting-started-with-the-codeql-cli/
-
Set up OpenAI API credentials:
# Create API key file (replace with actual path) echo "your-api-key" > /path/to/openai_keys.txt echo "your-org-id" >> /path/to/openai_keys.txt
Evaluation datasets are available in the ./eval folder. DiSCo generated datasets are available in Huggingface at the following link: https://huggingface.co/datasets/StonyBrookNLP/DiSCo
The final datasets contain the following columns:
| Column | Description |
|---|---|
Vulnerable Code |
Python code containing security vulnerabilities |
Secure Code |
Fixed version of the vulnerable code |
More Secure Code |
Further refined secure code (if applicable) |
Vulnerable Code Reasoning |
Explanation of why the code is vulnerable |
Secure Code Reasoning |
Explanation of the security fixes |
Instruction |
Task instruction for LLM training |
Bandit Feedback |
Static analysis results from Bandit |
Codeql Feedback |
Static analysis results from CodeQL |
- Training Set: 95% of data (
synth_train_refined.csv) - Validation Set: 3% of data (
synth_val_refined.csv) - Test Set: 2% of data (
synth_test_refined.csv)
Starcoder (best model) adapter modules are available in Huggingface at the following link:
SFT on DiSCo: https://huggingface.co/StonyBrookNLP/StarCoder2-SFT
LPO on DiSCo: https://huggingface.co/StonyBrookNLP/StarCoder2-LPO
Use sft.py in order to train a model on a dataset using supervised fine-tuning. Here is a sample command:
python supervised_fine_tuning.py --train datasets/DiSCo_train.csv --val datasets/DiSCo_val.csv --model bigcode/starcoder2-7b --adapter --out models/starcoder2-sft --bnb --learning_rate 1e-4 --epochs 2
Use pref_op.py in order to train a model on a dataset using localized preference optimization. Here is a sample command:
python pref_op.py --base_model_path bigcode/starcoder2-7b --peft_model_path models/starcoder2-sft --train_path datasets/synth_train.csv --eval_path datasets/synth_val.csv --loss_type simpo-kl --beta 10.0 --loss_mask_val 0.999999 --learning_rate 1e-5 --gamma 5.4 --use_masked_po True --load_peft_model True --output_dir models/starcoder2-lpo
In order to use the model for downstream generation, it is best to merge the adapters with the base model. This can be done using the "merge_peft_model.py" script. Place the appropriate values inside it and execute it to get your merged model.
P.S. To use LPO adapter for downstream generation, you must use the sft model merged with the original model as the base model for the adapter.
Evaluation pipeline consists of two parts: code generation & metric calculation.
Code generation involves using the LLMs to generate code given the prompts in the files present in "./eval/"
Metric calculation involves doing security analysis and getting the security report or calculating the code generation pass@k.
Use inference.py to generate the code results for each evaluation dataset in "./eval/". Here is a code example:
python inference.py --base_model models/starcoder2-sft-merged --adapter True --peft_model models/starcoder2-lpo --test_path datasets/security_eval.csv --output_path results/starcoder2_lpo.csv --parses 5 --T 0.4 --max_new_tokens 512 --batch_size 4
If you are testing for security, then install bandit and download and unzip the codeql repository in this link. Also allow codeql_processing.sh to be executable. Then run a command using report_generation.py as follows:
python report_generation.py --results_path results/starcoder2_lpocsv --analysis_path results/sec_gen_reports/starcoder2_lpo/
Then use security_metric.ipynb to get calculate and get the metric from the reports.
To measure the pass@k for code generation evaluation datasets, use coding_eval_analysis.py in the following manner to generate report:
python coding_eval_analysis.py --results_path results/starcoder2_lpo.csv --analysis_path results/code_gen_reports/starcoder2_lpo/
Afterwards, use code_gen_metric.ipynb to calculate the metrics from the report.
- The rules used to create the synthetic data are present in "./rules"
- However, due to the sensitive nature of the generated vulnerable code and the security risks that it poses, we will not be releasing the synthetic data generation code publicly. However, if you do wish to generate your own synthetic data for academic purposes, please reach out to "mdshasan@cs.stonybrook.edu" and we will provide you with synthetic data generation codebase. Thanks for understanding.
-
Prepare Security Rules:
# Ensure you have rule CSV files in the rules/ directory: # - bandit_rules.csv # - codeql_rules.csv # - cwe_rules.csv
-
Generate Prompts:
cd raw_dataset_creation_code python prompt_creation.py -
Create Synthetic Data:
python synthetic_data_creation.py simple_prompts.csv simple_output.csv --size 1000 python synthetic_data_creation.py complex_prompts.csv complex_output.csv --size 1000
-
Process and Refine:
# Run notebooks in order: # 1. refinement_processing.ipynb # 2. Run synthetic_data_refinement.py on the processed data # 3. data_processing.ipynb for final dataset creation
python prompt_creation.pyOutputs:
simple_prompts.csv: Basic prompts for vulnerability-fix pairscomplex_prompts.csv: Enhanced prompts with module diversity
python synthetic_data_creation.py <input_prompts.csv> <output_data.csv> [--size N]Parameters:
input_prompts.csv: CSV file with 'Prompt' columnoutput_data.csv: Output file for generated code pairs--size N: Number of prompts to process (default: all)
Example:
python synthetic_data_creation.py prompts.csv output.csv --size 500python synthetic_data_refinement.py <input_dataset.csv> <refined_output.csv>Requirements:
- Input CSV must contain: 'Secure Code', 'Bandit Feedback', 'Codeql Feedback' columns
- Static analysis must be run prior to refinement
disco-lpo/
βββ raw_dataset_creation_code/
β βββ prompt_creation.py # Stage 1: Generate prompts from security rules
β βββ synthetic_data_creation.py # Stage 2: Create synthetic code with GPT-4
β βββ synthetic_data_refinement.py # Stage 4: Refine code using static analysis feedback
β βββ synthetic_data_refinement.py # Stage 4: Refine code using static analysis feedback
β βββ refinement_processing.ipynb # Stage 3: Run static analysis and extract feedback
β βββ libraries.py # Contains information about library modules used in the prompts for synthetic data generation
βββ rules/ # Security rule databases (user-provided)
β βββ bandit_rules.csv
β βββ codeql_rules.csv
β βββ cwe_rules.csv
βββ datasets/ # Generated datasets (created during execution)
βββ temp/ # Temporary files during processing
βββ libraries.py # Python module list for complex prompts
βββ eval/ # Evaluation datasets
βββ models/ # Model checkpoints and adapters
βββ results/ # Generated results and reports
Create a text file with your OpenAI credentials:
sk-your-openai-api-key-here
org-your-organization-id-here
Update the file path in the Python scripts:
# Replace this line in synthetic_data_creation.py and synthetic_data_refinement.py
file = open("#FILEPATH TO OPENAI API KEYS","r")
# With your actual path:
file = open("/path/to/your/openai_keys.txt","r")-
Bandit Configuration:
- Ensure bandit is installed and available in PATH
- The scripts expect bandit command to be executable
-
CodeQL Setup:
- Download and install CodeQL CLI
- Create
codeql_processing.shscript for batch analysis - Ensure CodeQL databases are available for Python
Update placeholder paths in the scripts:
prompt_creation.py:
# Update these placeholders:
bandit_df = extract_dataset("bandit") # instead of "#BANDIT_RULES_FILEPATH"
codeql_df = extract_dataset("codeql") # instead of "#CODEQL_RULES_FILEPATH"
# etc.pandas>=1.3.0
numpy>=1.20.0
openai>=1.0.0
backoff>=2.0.0
scikit-learn>=1.0.0
jupyter>=1.0.0
- Bandit: Python security linter (
pip install bandit) - CodeQL: GitHub's semantic code analysis tool
- OpenAI API: GPT-4 access with sufficient credits
- Python 3.8+
- 8GB+ RAM (for large dataset processing)
- Stable internet connection (for OpenAI API calls)
Please include the following citation if you are using resources provided in this work:
@article{saqib2025teaching,
title={Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences},
author={Saqib, Mohammad and Chakraborty, Saikat and Karmaker, Santu and Balasubramanian, Niranjan},
journal={arXiv preprint arXiv:2506.00419},
year={2025}
}
