@@ -13,17 +13,21 @@ DESCRIPTION
1313
1414:program: `llvm-ir2vec ` is a standalone command-line tool for IR2Vec. It
1515generates IR2Vec embeddings for LLVM IR and supports triplet generation
16- for vocabulary training. It provides two main operation modes:
16+ for vocabulary training. It provides three main operation modes:
1717
18- 1. **Triplet Mode **: Generates triplets (opcode, type, operands) for vocabulary
18+ 1. **Triplet Mode **: Generates numeric triplets in train2id format for vocabulary
1919 training from LLVM IR.
2020
21- 2. **Embedding Mode **: Generates IR2Vec embeddings using a trained vocabulary
21+ 2. **Entity Mode **: Generates entity mapping files (entity2id.txt) for vocabulary
22+ training.
23+
24+ 3. **Embedding Mode **: Generates IR2Vec embeddings using a trained vocabulary
2225 at different granularity levels (instruction, basic block, or function).
2326
2427The tool is designed to facilitate machine learning applications that work with
2528LLVM IR by converting the IR into numerical representations that can be used by
26- ML models.
29+ ML models. The triplet mode generates numeric IDs directly instead of string
30+ triplets, streamlining the training data preparation workflow.
2731
2832.. note ::
2933
@@ -34,18 +38,46 @@ ML models.
3438OPERATION MODES
3539---------------
3640
41+ Triplet Generation and Entity Mapping Modes are used for preparing
42+ vocabulary and training data for knowledge graph embeddings. The Embedding Mode
43+ is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
44+
45+ The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
46+ by modeling the relationships between opcodes, types, and operands as a knowledge
47+ graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
48+ triplets and entity mappings in the standard format used for knowledge graph
49+ embedding training (see
50+ <https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format>
51+ for details).
52+
3753Triplet Generation Mode
3854~~~~~~~~~~~~~~~~~~~~~~~
3955
40- In triplet mode, :program: `llvm-ir2vec ` analyzes LLVM IR and extracts triplets
41- consisting of opcodes, types, and operands. These triplets can be used to train
42- vocabularies for embedding generation.
56+ In triplet mode, :program: `llvm-ir2vec ` analyzes LLVM IR and extracts numeric
57+ triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
58+ are generated in train2id format. The tool outputs numeric IDs directly using
59+ the ir2vec::Vocabulary mapping infrastructure, eliminating the need for
60+ string-to-ID preprocessing.
61+
62+ Usage:
63+
64+ .. code-block :: bash
65+
66+ llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
67+
68+ Entity Mapping Generation Mode
69+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70+
71+ In entity mode, :program: `llvm-ir2vec ` generates the entity mappings supported by
72+ IR2Vec in entity2id format. This mode outputs all supported entities (opcodes,
73+ types, and operands) with their corresponding numeric IDs, and is not specific for
74+ an LLVM IR file.
4375
4476Usage:
4577
4678.. code-block :: bash
4779
48- llvm-ir2vec --mode=triplets input.bc -o triplets .txt
80+ llvm-ir2vec --mode=entities -o entity2id .txt
4981
5082 Embedding Generation Mode
5183~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -67,6 +99,7 @@ OPTIONS
6799 Specify the operation mode. Valid values are:
68100
69101 * ``triplets `` - Generate triplets for vocabulary training
102+ * ``entities `` - Generate entity mappings for vocabulary training
70103 * ``embeddings `` - Generate embeddings using trained vocabulary (default)
71104
72105.. option :: --level= <level >
@@ -115,7 +148,7 @@ OPTIONS
115148
116149 ``--level ``, ``--function ``, ``--ir2vec-vocab-path ``, ``--ir2vec-opc-weight ``,
117150 ``--ir2vec-type-weight ``, and ``--ir2vec-arg-weight `` are only used in embedding
118- mode. These options are ignored in triplet mode .
151+ mode. These options are ignored in triplet and entity modes .
119152
120153INPUT FILE FORMAT
121154-----------------
@@ -129,14 +162,34 @@ OUTPUT FORMAT
129162Triplet Mode Output
130163~~~~~~~~~~~~~~~~~~~
131164
132- In triplet mode, the output consists of lines containing space-separated triplets:
165+ In triplet mode, the output consists of numeric triplets in train2id format with
166+ metadata headers. The format includes:
167+
168+ .. code-block :: text
169+
170+ MAX_RELATIONS=<max_relations_count>
171+ <head_entity_id> <tail_entity_id> <relation_id>
172+ <head_entity_id> <tail_entity_id> <relation_id>
173+ ...
174+
175+ Each line after the metadata header represents one instruction relationship,
176+ with numeric IDs for head entity, relation, and tail entity. The metadata
177+ header (MAX_RELATIONS) provides counts for post-processing and training setup.
178+
179+ Entity Mode Output
180+ ~~~~~~~~~~~~~~~~~~
181+
182+ In entity mode, the output consists of entity mapping in the format:
133183
134184.. code-block :: text
135185
136- <opcode> <type> <operand1> <operand2> ...
186+ <total_entities>
187+ <entity_string> <numeric_id>
188+ <entity_string> <numeric_id>
189+ ...
137190
138- Each line represents the information of one instruction, with the opcode, type,
139- and operands .
191+ The first line contains the total number of entities, followed by one entity
192+ mapping per line with tab-separated entity string and numeric ID .
140193
141194Embedding Mode Output
142195~~~~~~~~~~~~~~~~~~~~~
0 commit comments