A pipeline for source code classification using LLVM IR and IR2Vec-based embeddings. This repository allows you to convert source files to LLVM IR (.ll), generate vector embeddings using IR2Vec, and train/test a mlp classifier on those embeddings.
- Clang ≥ 10 (preferably Clang 17)
- Python 3.6+
- Required Python packages (install via
env.yml)
Before starting, create a conda environment and install dependencies using the provided env.yml file.
# Create and activate the environment
conda env create -f env.yml
conda activate ir2vec-envWe provide preprocessed train, test, and validation datasets for the POJ-104 benchmark (LLVM 17) in the embeddings directory.
IR2Vec-Classification/embeddings$ ls
test.tar.zst train.tar.zst val.tar.zst- These
.csvfiles are already formatted for classification. - The dataset contains 98 classes (we skipped folders having fewer than 200 .ll files).
- You can extract the
.tar.zstfiles using
tar -I zstd -xf test.tar.zst
tar -I zstd -xf train.tar.zst
tar -I zstd -xf val.tar.zstOnce extracted, activate the ir2vec-env environment:
conda activate ir2vec-envThen, you're ready to directly start training the model.
Run the provided generate_ll.sh script to convert .c, .cc, and .cpp source files into LLVM IR .ll files.
Update the following paths in the script
CLANG=/usr/lib/llvm-17/bin/clang-17 # Path to clang binary
SRC_DIR=/path/to/source/directory/ # Directory containing numeric subfolders of source files
DES_DIR=/path/to/output/ll/files # Destination directory for .ll filesEnsure source folders inside
SRC_DIRare numerically named (e.g.,1/,2/,3/...).
Usage
chmod +x generate_ll.sh
./generate_ll.shRun get_embeddings.py to convert the .ll files into embedding vectors.
Modify the following variables in the script according to your requirements
input_folder = "/path/to/your/input/folder"
output_txt_path = "/path/to/output/embeddings.txt"
encoding_type = "fa" # Encoding type (fa, sym, default: "fa")
level = "p" # Embedding level ("p" (program), "f" (function), default: "p")
dim = 300 # Vector dimension (Dimension size for embedding (75, 100, 300, default: “300”))Usage
python get_embeddings.pyOutput will be a
.txtfile containing vector embeddings.
Use preprocess.py to transform the .txt embeddings into train.csv, test.csv, and val.csv.
Usage
python preprocess.py --data </path/to/embeddings.txt>After running the script, the data will be split into train, test, and validation sets.
- Training data will be saved to train.csv
- Testing data will be saved to test.csv
- Validation data will be saved to val.csv
Navigate to the ./models directory.
cd ./modelsTo train the model, use the following command.
python <default_model.py / ir2vec_O0_model.py / ir2vec_O0_model.py> \
--train /path/to/train.csv \
--val /path/to/val.csv \
--test /path/to/test.csv \
--epochs num_epochs \
--batch_size batch_size--train: Path to training CSV.--val: (Optional) Path to validation CSV.--test: (Optional) Path to test CSV.--epochs: Number of training epochs (default is 100).--batch_size: Size of the batch for training (default is 32).
To test the model, use the following command.
python <default_model.py / ir2vec_O0_model.py / ir2vec_O0_model.py> \
--test /path/to/test.csv \
--model /path/to/saved_model.h5--test: Path to testing data.--model: Path to trained model file (.h5).
This guide explains how to perform testing/inference using pretrained models trained on IR2Vec embeddings. The models are available under
/IR2Vec-Classification/models/trained_model/
├── ir2vec-O0-model.h5
└── ir2vec-O3-model.h5
Ensure you have
- A valid test CSV file with IR2Vec embeddings (tab-separated, label in the first column).
- Created a conda environment and install dependencies using the provided
env.ymlfile. - Cloned or downloaded this repository.
python <ir2vec-O0-model.py / ir2vec-O3-model.py> \
--test /path/to/test.csv \
--model /IR2Vec-Classification/models/trained_model/<ir2vec-O0-model.h5 / ir2vec-O3-model.h5>Replace
/path/to/testing.csvwith the actual test file path (tab-separated).- Use
ir2vec-O0-model.h5if the embeddings were generated from .ll files compiled with O0 optimization, orir2vec-O3-model.h5for embeddings from .ll files compiled with O3 optimization.
python ir2vec-O0-model.py \
--test ./embeddings/test.csv \
--model ./models/trained_model/ir2vec-O0-model.h5You can experiment with different architectures based on optimization levels of the .ll files.
| Model File | Description |
|---|---|
default_model.py |
Generic classifier |
ir2vec_O0_model.py |
Model for .ll files compiled with -O0 |
ir2vec_O3_model.py |
Model for .ll files compiled with -O3 |
For issues or questions, please create an issue on the GitHub repo or reach out directly.