Instructions about how to generate dataset from binaries.
- Ghidra installation
For dataset generation, we use Ghidra to parse the binary. Therefore, you need to install Ghidra first (Our scripts have been tested on Ghidra 10.1.2). For more details, please refer to Ghidra.
The dataset generation script is run.sh. Before running it, please set the following variables:
GHIDRA_ANALYZEHEADLESS_PATH='' # path to ghidra analyzeHeadless executable
GHIDRA_PROJECT_PATH='' # path to ghidra project
GHIDRA_PROJECT_NAME='' # name of ghidra project
BINARY_PATH='' # path to binary
BINARY_ARCHITECTURE='' # architecture of binary, options: x86, x64, arm, mips
DATASET_OUTPUT_DIR='' # path to output directoryThen simply run the script with:
cd dataset_generation # make sure you are in the dataset_generation folder
bash run.shThe script contains two parts: (1) inter-procedural CFG generation and (2) dataset preparation. For ICFG generation, we use Ghidra script get_calling_context.py. For dataset prepartion, we developed prepare_dataset.py.
We provide sample x64 binaries under sample_binary/bc/. By running our script, the generated dataset are under sample_output/ and the directory structure for each binary is:
sample_output/bc/
├── caller1 # folder containing sequences of the first caller
│ ├── input.arch_emb
│ ├── input.byte1
│ ├── input.byte2
│ ├── input.byte3
│ ├── input.byte4
│ ├── input.inst_pos_emb
│ ├── input.op_pos_emb
│ └── input.static
├── caller2
│ ├── input.arch_emb
│ ├── input.byte1
│ ├── input.byte2
│ ├── input.byte3
│ ├── input.byte4
│ ├── input.inst_pos_emb
│ ├── input.op_pos_emb
│ └── input.static
├── external_callee1 # folder containing external callee names of the first external callee
│ └── input.label # external callee names are used for query external function embedding lookup table
├── external_callee2
│ └── input.label
├── internal_callee1 # folder containing sequences of the first internal callee
│ ├── input.arch_emb
│ ├── input.byte1
│ ├── input.byte2
│ ├── input.byte3
│ ├── input.byte4
│ ├── input.inst_pos_emb
│ ├── input.op_pos_emb
│ └── input.static
├── internal_callee2
│ ├── input.arch_emb
│ ├── input.byte1
│ ├── input.byte2
│ ├── input.byte3
│ ├── input.byte4
│ ├── input.inst_pos_emb
│ ├── input.op_pos_emb
│ └── input.static
└── self # folder containing sequences of function instructions
├── input.arch_emb
├── input.byte1
├── input.byte2
├── input.byte3
├── input.byte4
├── input.inst_pos_emb
├── input.label
├── input.op_pos_emb
└── input.static
If you have multiple binaries, you will have to copy the lines of the same files in to the training, validation, and test set files. For example, if you have dozens of binaries as the training set, you will copy the lines of each binary's self/input.label lines into training set's self/input.label.
For dataset preparation, we filter out internal functions with too large or too small function bodies based on number of tokens in their function body. For more details, please refer to this line.
Moreover, you can set the number of callers and callees considered by --topK of prepare_dataset.py. Based on our experience, this parameter is bounded by the momery of GPUs.
We provide a sample dataset for x64 binaries under the dataset_sample directory, which contains training, validation, and test datasets generated by the above steps.
Dataset encoding is to encode tokens of the dataset generated from binaries in the above step and generate the binarized files which is more efficient for training and testing.
As the vocabularies of input.arch_emb, input.byte1, input.byte2, input.byte3, input.byte4, input.inst_pos_emb, input.op_pos_emb, and input.static are fixed for each architecture. Note that, for input.static, we only cover tokens of {x64, x86, arm, mips}. Therefore, we only provide them under vocabulary directory.
However, the vocabulary of input.label is specific to binary dataset, which can be generated by our script get_vocb_for_binarization.py.
cd dataset_generation # make sure you are in the dataset_generation folder
python get_vocb_for_binarization.py --src_file path_to_source_file --output_dir path_to_output_dirSince we consider the internal and external functions differently, the vocabularities of them should be generated separately.
To elaborate, we use the sample dataset under dataset_sample directory as an example.
For internal functions, generate their vocabulary by
python get_vocb_for_binarization.py --src_file dataset_sample/train/self/input.label --output_dir vocabulary/label/For external functions, first merge function names under all external callee directories, and then use the same step to get its vocabulary.
cat dataset_sample/train/external_callee1/input.label dataset_sample/train/external_callee2/input.label >> vocabulary/external_label/src_file.label
python get_vocb_for_binarization.py --src_file vocabulary/external_label/src_file.label --output_dir vocabulary/external_label/We provide the vocabularies of both internal and external functions under vocabulary/label and vocabulary/external_label directory.
To binarize the dataset, run the binarize_dataset.py script. For example, to binarize the sample dataset (under the dataset_sample directory) with the above vocabularies, run the following command:
python binarize_dataset.py --data_src_dir dataset_sample/ --data_bin_dir ../data_bin/The resulting binarized dataset is under ../data_bin directory.