Skip to content

Commit 10bec2c

Browse files
authored
[llvm-ir2vec][MIR2Vec] Supporting MIR mode in triplet and entity generation (llvm#164329)
Add support for Machine IR (MIR) triplet and entity generation in llvm-ir2vec. This change extends llvm-ir2vec to support Machine IR (MIR) in addition to LLVM IR, enabling the generation of training data for MIR2Vec embeddings. MIR2Vec provides machine-level code embeddings that capture target-specific instruction semantics, complementing the target-independent IR2Vec embeddings. - Extended llvm-ir2vec to support triplet and entity generation for Machine IR (MIR) - Added `--mode=mir` option to specify MIR mode (vs LLVM IR mode) - Implemented MIR triplet generation with Next and Arg relationships - Added entity mapping generation for MIR vocabulary - Updated documentation to explain MIR-specific features and usage (Partially addresses llvm#162200 ; Tracking issue - llvm#141817)
1 parent 2b6686f commit 10bec2c

File tree

8 files changed

+7621
-48
lines changed

8 files changed

+7621
-48
lines changed

llvm/docs/CommandGuide/llvm-ir2vec.rst

Lines changed: 69 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -68,32 +68,52 @@ these two modes are used to generate the triplets and entity mappings.
6868
Triplet Generation
6969
~~~~~~~~~~~~~~~~~~
7070

71-
With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts
72-
numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
71+
With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR or Machine IR
72+
and extracts numeric triplets consisting of opcode IDs and operand IDs. These triplets
7373
are generated in the standard format used for knowledge graph embedding training.
74-
The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
75-
infrastructure, eliminating the need for string-to-ID preprocessing.
74+
The tool outputs numeric IDs directly using the vocabulary mapping infrastructure,
75+
eliminating the need for string-to-ID preprocessing.
7676

77-
Usage:
77+
Usage for LLVM IR:
7878

7979
.. code-block:: bash
8080
81-
llvm-ir2vec triplets input.bc -o triplets_train2id.txt
81+
llvm-ir2vec triplets --mode=llvm input.bc -o triplets_train2id.txt
82+
83+
Usage for Machine IR:
84+
85+
.. code-block:: bash
86+
87+
llvm-ir2vec triplets --mode=mir input.mir -o triplets_train2id.txt
8288
8389
Entity Mapping Generation
8490
~~~~~~~~~~~~~~~~~~~~~~~~~
8591

8692
With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
87-
supported by IR2Vec in the standard format used for knowledge graph embedding
88-
training. This subcommand outputs all supported entities (opcodes, types, and
89-
operands) with their corresponding numeric IDs, and is not specific for an
90-
LLVM IR file.
93+
supported by IR2Vec or MIR2Vec in the standard format used for knowledge graph embedding
94+
training. This subcommand outputs all supported entities with their corresponding numeric IDs.
95+
96+
For LLVM IR, entities include opcodes, types, and operands. For Machine IR, entities include
97+
machine opcodes, common operands, and register classes (both physical and virtual).
98+
99+
Usage for LLVM IR:
91100

92-
Usage:
101+
.. code-block:: bash
102+
103+
llvm-ir2vec entities --mode=llvm -o entity2id.txt
104+
105+
Usage for Machine IR:
93106

94107
.. code-block:: bash
95108
96-
llvm-ir2vec entities -o entity2id.txt
109+
llvm-ir2vec entities --mode=mir input.mir -o entity2id.txt
110+
111+
.. note::
112+
113+
For LLVM IR mode, the entity mapping is target-independent and does not require an input file.
114+
For Machine IR mode, an input .mir file is required to determine the target architecture,
115+
as entity mappings vary by target (different architectures have different instruction sets
116+
and register classes).
97117

98118
Embedding Generation
99119
~~~~~~~~~~~~~~~~~~~~
@@ -222,12 +242,17 @@ Subcommand-specific options:
222242

223243
.. option:: <input-file>
224244

225-
The input LLVM IR or bitcode file to process. This positional argument is
226-
required for the `triplets` subcommand.
245+
The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
246+
This positional argument is required for the `triplets` subcommand.
227247

228248
**entities** subcommand:
229249

230-
No subcommand-specific options.
250+
.. option:: <input-file>
251+
252+
The input Machine IR file (.mir) to process. This positional argument is required
253+
for the `entities` subcommand when using ``--mode=mir``, as the entity mappings
254+
are target-specific. For ``--mode=llvm``, no input file is required as IR2Vec
255+
entity mappings are target-independent.
231256

232257
OUTPUT FORMAT
233258
-------------
@@ -240,19 +265,37 @@ metadata headers. The format includes:
240265

241266
.. code-block:: text
242267
243-
MAX_RELATIONS=<max_relations_count>
268+
MAX_RELATION=<max_relation_count>
244269
<head_entity_id> <tail_entity_id> <relation_id>
245270
<head_entity_id> <tail_entity_id> <relation_id>
246271
...
247272
248273
Each line after the metadata header represents one instruction relationship,
249-
with numeric IDs for head entity, relation, and tail entity. The metadata
250-
header (MAX_RELATIONS) provides counts for post-processing and training setup.
274+
with numeric IDs for head entity, tail entity, and relation type. The metadata
275+
header (MAX_RELATION) indicates the maximum relation ID used.
276+
277+
**Relation Types:**
278+
279+
For LLVM IR (IR2Vec):
280+
* **0** = Type relationship (instruction to its type)
281+
* **1** = Next relationship (sequential instructions)
282+
* **2+** = Argument relationships (Arg0, Arg1, Arg2, ...)
283+
284+
For Machine IR (MIR2Vec):
285+
* **0** = Next relationship (sequential instructions)
286+
* **1+** = Argument relationships (Arg0, Arg1, Arg2, ...)
287+
288+
**Entity IDs:**
289+
290+
For LLVM IR: Entity IDs represent opcodes, types, and operands as defined by the IR2Vec vocabulary.
291+
292+
For Machine IR: Entity IDs represent machine opcodes, common operands (immediate, frame index, etc.),
293+
physical register classes, and virtual register classes as defined by the MIR2Vec vocabulary. The entity layout is target-specific.
251294

252295
Entity Mode Output
253296
~~~~~~~~~~~~~~~~~~
254297

255-
In entity mode, the output consists of entity mapping in the format:
298+
In entity mode, the output consists of entity mappings in the format:
256299

257300
.. code-block:: text
258301
@@ -264,6 +307,13 @@ In entity mode, the output consists of entity mapping in the format:
264307
The first line contains the total number of entities, followed by one entity
265308
mapping per line with tab-separated entity string and numeric ID.
266309

310+
For LLVM IR, entities include instruction opcodes (e.g., "Add", "Ret"), types
311+
(e.g., "INT", "PTR"), and operand kinds.
312+
313+
For Machine IR, entities include machine opcodes (e.g., "COPY", "ADD"),
314+
common operands (e.g., "Immediate", "FrameIndex"), physical register classes
315+
(e.g., "PhyReg_GR32"), and virtual register classes (e.g., "VirtReg_GR32").
316+
267317
Embedding Mode Output
268318
~~~~~~~~~~~~~~~~~~~~~
269319

llvm/include/llvm/CodeGen/MIR2Vec.h

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,11 @@ class MIRVocabulary {
111111
size_t TotalEntries = 0;
112112
} Layout;
113113

114+
// TODO: See if we can have only one reg classes section instead of physical
115+
// and virtual separate sections in the vocabulary. This would reduce the
116+
// number of vocabulary entities significantly.
117+
// We can potentially distinguish physical and virtual registers by
118+
// considering them as a separate feature.
114119
enum class Section : unsigned {
115120
Opcodes = 0,
116121
CommonOperands = 1,
@@ -185,6 +190,25 @@ class MIRVocabulary {
185190
return Storage[static_cast<unsigned>(SectionID)][LocalIndex];
186191
}
187192

193+
/// Get entity ID (flat index) for a common operand type
194+
/// This is used for triplet generation
195+
unsigned getEntityIDForCommonOperand(
196+
MachineOperand::MachineOperandType OperandType) const {
197+
return Layout.CommonOperandBase + getCommonOperandIndex(OperandType);
198+
}
199+
200+
/// Get entity ID (flat index) for a register
201+
/// This is used for triplet generation
202+
unsigned getEntityIDForRegister(Register Reg) const {
203+
if (!Reg.isValid() || Reg.isStack())
204+
return Layout
205+
.VirtRegBase; // Return VirtRegBase for invalid/stack registers
206+
unsigned LocalIndex = getRegisterOperandIndex(Reg);
207+
size_t BaseOffset =
208+
Reg.isPhysical() ? Layout.PhyRegBase : Layout.VirtRegBase;
209+
return BaseOffset + LocalIndex;
210+
}
211+
188212
public:
189213
/// Static method for extracting base opcode names (public for testing)
190214
static std::string extractBaseOpcodeName(StringRef InstrName);
@@ -201,6 +225,20 @@ class MIRVocabulary {
201225

202226
unsigned getDimension() const { return Storage.getDimension(); }
203227

228+
/// Get entity ID (flat index) for an opcode
229+
/// This is used for triplet generation
230+
unsigned getEntityIDForOpcode(unsigned Opcode) const {
231+
return Layout.OpcodeBase + getCanonicalOpcodeIndex(Opcode);
232+
}
233+
234+
/// Get entity ID (flat index) for a machine operand
235+
/// This is used for triplet generation
236+
unsigned getEntityIDForMachineOperand(const MachineOperand &MO) const {
237+
if (MO.getType() == MachineOperand::MO_Register)
238+
return getEntityIDForRegister(MO.getReg());
239+
return getEntityIDForCommonOperand(MO.getType());
240+
}
241+
204242
// Accessor methods
205243
const Embedding &operator[](unsigned Opcode) const {
206244
unsigned LocalIndex = getCanonicalOpcodeIndex(Opcode);
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# REQUIRES: x86_64-linux
2+
# RUN: llvm-ir2vec entities --mode=mir %s -o 2>&1 %t1.log
3+
# RUN: diff %S/output/reference_x86_entities.txt %t1.log
4+
5+
--- |
6+
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
7+
target triple = "x86_64-unknown-linux-gnu"
8+
9+
define dso_local noundef i32 @test_function(i32 noundef %a) {
10+
entry:
11+
ret i32 %a
12+
}
13+
...
14+
---
15+
name: test_function
16+
alignment: 16
17+
tracksRegLiveness: true
18+
registers:
19+
- { id: 0, class: gr32 }
20+
liveins:
21+
- { reg: '$edi', virtual-reg: '%0' }
22+
body: |
23+
bb.0.entry:
24+
liveins: $edi
25+
26+
%0:gr32 = COPY $edi
27+
$eax = COPY %0
28+
RET 0, $eax
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Don't treat files in this directory as tests
2+
# These are reference data files, not test scripts
3+
config.suffixes = []
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
MAX_RELATION=4
2+
187 7072 1
3+
187 6968 2
4+
187 187 0
5+
187 7072 1
6+
187 6969 2
7+
187 10 0
8+
10 7072 1
9+
10 7072 2
10+
10 7072 3
11+
10 6961 4
12+
10 187 0
13+
187 6952 1
14+
187 7072 2
15+
187 1555 0
16+
1555 6882 1
17+
1555 6952 2
18+
187 7072 1
19+
187 6968 2
20+
187 187 0
21+
187 7072 1
22+
187 6969 2
23+
187 601 0
24+
601 7072 1
25+
601 7072 2
26+
601 7072 3
27+
601 6961 4
28+
601 187 0
29+
187 6952 1
30+
187 7072 2
31+
187 1555 0
32+
1555 6882 1
33+
1555 6952 2

0 commit comments

Comments
 (0)