@@ -68,32 +68,52 @@ these two modes are used to generate the triplets and entity mappings.
6868Triplet Generation
6969~~~~~~~~~~~~~~~~~~
7070
71- With the `triplets ` subcommand, :program: `llvm-ir2vec ` analyzes LLVM IR and extracts
72- numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
71+ With the `triplets ` subcommand, :program: `llvm-ir2vec ` analyzes LLVM IR or Machine IR
72+ and extracts numeric triplets consisting of opcode IDs and operand IDs. These triplets
7373are generated in the standard format used for knowledge graph embedding training.
74- The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
75- infrastructure, eliminating the need for string-to-ID preprocessing.
74+ The tool outputs numeric IDs directly using the vocabulary mapping infrastructure,
75+ eliminating the need for string-to-ID preprocessing.
7676
77- Usage:
77+ Usage for LLVM IR :
7878
7979.. code-block :: bash
8080
81- llvm-ir2vec triplets input.bc -o triplets_train2id.txt
81+ llvm-ir2vec triplets --mode=llvm input.bc -o triplets_train2id.txt
82+
83+ Usage for Machine IR:
84+
85+ .. code-block :: bash
86+
87+ llvm-ir2vec triplets --mode=mir input.mir -o triplets_train2id.txt
8288
8389 Entity Mapping Generation
8490~~~~~~~~~~~~~~~~~~~~~~~~~
8591
8692With the `entities ` subcommand, :program: `llvm-ir2vec ` generates the entity mappings
87- supported by IR2Vec in the standard format used for knowledge graph embedding
88- training. This subcommand outputs all supported entities (opcodes, types, and
89- operands) with their corresponding numeric IDs, and is not specific for an
90- LLVM IR file.
93+ supported by IR2Vec or MIR2Vec in the standard format used for knowledge graph embedding
94+ training. This subcommand outputs all supported entities with their corresponding numeric IDs.
95+
96+ For LLVM IR, entities include opcodes, types, and operands. For Machine IR, entities include
97+ machine opcodes, common operands, and register classes (both physical and virtual).
98+
99+ Usage for LLVM IR:
91100
92- Usage:
101+ .. code-block :: bash
102+
103+ llvm-ir2vec entities --mode=llvm -o entity2id.txt
104+
105+ Usage for Machine IR:
93106
94107.. code-block :: bash
95108
96- llvm-ir2vec entities -o entity2id.txt
109+ llvm-ir2vec entities --mode=mir input.mir -o entity2id.txt
110+
111+ .. note ::
112+
113+ For LLVM IR mode, the entity mapping is target-independent and does not require an input file.
114+ For Machine IR mode, an input .mir file is required to determine the target architecture,
115+ as entity mappings vary by target (different architectures have different instruction sets
116+ and register classes).
97117
98118Embedding Generation
99119~~~~~~~~~~~~~~~~~~~~
@@ -222,12 +242,17 @@ Subcommand-specific options:
222242
223243.. option :: <input-file >
224244
225- The input LLVM IR or bitcode file to process. This positional argument is
226- required for the `triplets ` subcommand.
245+ The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
246+ This positional argument is required for the `triplets ` subcommand.
227247
228248**entities ** subcommand:
229249
230- No subcommand-specific options.
250+ .. option :: <input-file >
251+
252+ The input Machine IR file (.mir) to process. This positional argument is required
253+ for the `entities ` subcommand when using ``--mode=mir ``, as the entity mappings
254+ are target-specific. For ``--mode=llvm ``, no input file is required as IR2Vec
255+ entity mappings are target-independent.
231256
232257OUTPUT FORMAT
233258-------------
@@ -240,19 +265,37 @@ metadata headers. The format includes:
240265
241266.. code-block :: text
242267
243- MAX_RELATIONS=<max_relations_count >
268+ MAX_RELATION=<max_relation_count >
244269 <head_entity_id> <tail_entity_id> <relation_id>
245270 <head_entity_id> <tail_entity_id> <relation_id>
246271 ...
247272
248273 Each line after the metadata header represents one instruction relationship,
249- with numeric IDs for head entity, relation, and tail entity. The metadata
250- header (MAX_RELATIONS) provides counts for post-processing and training setup.
274+ with numeric IDs for head entity, tail entity, and relation type. The metadata
275+ header (MAX_RELATION) indicates the maximum relation ID used.
276+
277+ **Relation Types: **
278+
279+ For LLVM IR (IR2Vec):
280+ * **0 ** = Type relationship (instruction to its type)
281+ * **1 ** = Next relationship (sequential instructions)
282+ * **2+ ** = Argument relationships (Arg0, Arg1, Arg2, ...)
283+
284+ For Machine IR (MIR2Vec):
285+ * **0 ** = Next relationship (sequential instructions)
286+ * **1+ ** = Argument relationships (Arg0, Arg1, Arg2, ...)
287+
288+ **Entity IDs: **
289+
290+ For LLVM IR: Entity IDs represent opcodes, types, and operands as defined by the IR2Vec vocabulary.
291+
292+ For Machine IR: Entity IDs represent machine opcodes, common operands (immediate, frame index, etc.),
293+ physical register classes, and virtual register classes as defined by the MIR2Vec vocabulary. The entity layout is target-specific.
251294
252295Entity Mode Output
253296~~~~~~~~~~~~~~~~~~
254297
255- In entity mode, the output consists of entity mapping in the format:
298+ In entity mode, the output consists of entity mappings in the format:
256299
257300.. code-block :: text
258301
@@ -264,6 +307,13 @@ In entity mode, the output consists of entity mapping in the format:
264307 The first line contains the total number of entities, followed by one entity
265308mapping per line with tab-separated entity string and numeric ID.
266309
310+ For LLVM IR, entities include instruction opcodes (e.g., "Add", "Ret"), types
311+ (e.g., "INT", "PTR"), and operand kinds.
312+
313+ For Machine IR, entities include machine opcodes (e.g., "COPY", "ADD"),
314+ common operands (e.g., "Immediate", "FrameIndex"), physical register classes
315+ (e.g., "PhyReg_GR32"), and virtual register classes (e.g., "VirtReg_GR32").
316+
267317Embedding Mode Output
268318~~~~~~~~~~~~~~~~~~~~~
269319
0 commit comments