Name	Name	Last commit message	Last commit date
parent directory ..
dataset_sample	dataset_sample
icfg/bc	icfg/bc
sample_binary/bc	sample_binary/bc
sample_output	sample_output
segmentation_model	segmentation_model
vocabulary	vocabulary
README.md	README.md
binarize_dataset.py	binarize_dataset.py
get_calling_context.py	get_calling_context.py
get_vocb_for_binarization.py	get_vocb_for_binarization.py
prepare_dataset.py	prepare_dataset.py
run.sh	run.sh

Dataset Generation

Instructions about how to generate dataset from binaries.

Dataset Generation

Setup

Ghidra installation

For dataset generation, we use Ghidra to parse the binary. Therefore, you need to install Ghidra first (Our scripts have been tested on Ghidra 10.1.2). For more details, please refer to Ghidra.

Dataset Preparation

The dataset generation script is run.sh. Before running it, please set the following variables:

GHIDRA_ANALYZEHEADLESS_PATH='' # path to ghidra analyzeHeadless executable
GHIDRA_PROJECT_PATH='' # path to ghidra project
GHIDRA_PROJECT_NAME='' # name of ghidra project
BINARY_PATH=''  # path to binary
BINARY_ARCHITECTURE='' # architecture of binary, options: x86, x64, arm, mips
DATASET_OUTPUT_DIR='' # path to output directory

Then simply run the script with:

cd dataset_generation # make sure you are in the dataset_generation folder
bash run.sh

The script contains two parts: (1) inter-procedural CFG generation and (2) dataset preparation. For ICFG generation, we use Ghidra script get_calling_context.py. For dataset prepartion, we developed prepare_dataset.py.

Binary Example

We provide sample x64 binaries under sample_binary/bc/. By running our script, the generated dataset are under sample_output/ and the directory structure for each binary is:

sample_output/bc/
├── caller1 # folder containing sequences of the first caller
│   ├── input.arch_emb
│   ├── input.byte1
│   ├── input.byte2
│   ├── input.byte3
│   ├── input.byte4
│   ├── input.inst_pos_emb
│   ├── input.op_pos_emb
│   └── input.static
├── caller2 
│   ├── input.arch_emb
│   ├── input.byte1
│   ├── input.byte2
│   ├── input.byte3
│   ├── input.byte4
│   ├── input.inst_pos_emb
│   ├── input.op_pos_emb
│   └── input.static
├── external_callee1 # folder containing external callee names of the first external callee
│   └── input.label # external callee names are used for query external function embedding lookup table
├── external_callee2
│   └── input.label
├── internal_callee1 # folder containing sequences of the first internal callee
│   ├── input.arch_emb
│   ├── input.byte1
│   ├── input.byte2
│   ├── input.byte3
│   ├── input.byte4
│   ├── input.inst_pos_emb
│   ├── input.op_pos_emb
│   └── input.static
├── internal_callee2
│   ├── input.arch_emb
│   ├── input.byte1
│   ├── input.byte2
│   ├── input.byte3
│   ├── input.byte4
│   ├── input.inst_pos_emb
│   ├── input.op_pos_emb
│   └── input.static
└── self    # folder containing sequences of function instructions
    ├── input.arch_emb
    ├── input.byte1
    ├── input.byte2
    ├── input.byte3
    ├── input.byte4
    ├── input.inst_pos_emb
    ├── input.label
    ├── input.op_pos_emb
    └── input.static

If you have multiple binaries, you will have to copy the lines of the same files in to the training, validation, and test set files. For example, if you have dozens of binaries as the training set, you will copy the lines of each binary's self/input.label lines into training set's self/input.label.

Parameters

For dataset preparation, we filter out internal functions with too large or too small function bodies based on number of tokens in their function body. For more details, please refer to this line.

Moreover, you can set the number of callers and callees considered by --topK of prepare_dataset.py. Based on our experience, this parameter is bounded by the momery of GPUs.

Sample Dataset

We provide a sample dataset for x64 binaries under the dataset_sample directory, which contains training, validation, and test datasets generated by the above steps.

Dataset Encoding

Dataset encoding is to encode tokens of the dataset generated from binaries in the above step and generate the binarized files which is more efficient for training and testing.

Vocabulary Generation

As the vocabularies of input.arch_emb, input.byte1, input.byte2, input.byte3, input.byte4, input.inst_pos_emb, input.op_pos_emb, and input.static are fixed for each architecture. Note that, for input.static, we only cover tokens of {x64, x86, arm, mips}. Therefore, we only provide them under vocabulary directory.

However, the vocabulary of input.label is specific to binary dataset, which can be generated by our script get_vocb_for_binarization.py.

cd dataset_generation # make sure you are in the dataset_generation folder
python get_vocb_for_binarization.py --src_file path_to_source_file --output_dir path_to_output_dir

Since we consider the internal and external functions differently, the vocabularities of them should be generated separately.

To elaborate, we use the sample dataset under dataset_sample directory as an example.

For internal functions, generate their vocabulary by

python get_vocb_for_binarization.py --src_file dataset_sample/train/self/input.label --output_dir vocabulary/label/

For external functions, first merge function names under all external callee directories, and then use the same step to get its vocabulary.

cat dataset_sample/train/external_callee1/input.label dataset_sample/train/external_callee2/input.label >> vocabulary/external_label/src_file.label
python get_vocb_for_binarization.py --src_file vocabulary/external_label/src_file.label --output_dir vocabulary/external_label/

We provide the vocabularies of both internal and external functions under vocabulary/label and vocabulary/external_label directory.

Dataset Binarization

To binarize the dataset, run the binarize_dataset.py script. For example, to binarize the sample dataset (under the dataset_sample directory) with the above vocabularies, run the following command:

python binarize_dataset.py --data_src_dir dataset_sample/ --data_bin_dir ../data_bin/

The resulting binarized dataset is under ../data_bin directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Dataset Generation

Table of contents

Setup

Dataset Preparation

Binary Example

Parameters

Sample Dataset

Dataset Encoding

Vocabulary Generation

Dataset Binarization

FilesExpand file tree

dataset_generation

Directory actions

More options

Directory actions

More options

Latest commit

History

dataset_generation

Folders and files

parent directory

README.md

Dataset Generation

Table of contents

Setup

Dataset Preparation

Binary Example

Parameters

Sample Dataset

Dataset Encoding

Vocabulary Generation

Dataset Binarization