[llvm-ir2vec][MIR2Vec] Supporting MIR mode in triplet and entity generation #164329

svkeerthy · 2025-10-20T22:28:30Z

Add support for Machine IR (MIR) triplet and entity generation in llvm-ir2vec.

This change extends llvm-ir2vec to support Machine IR (MIR) in addition to LLVM IR, enabling the generation of training data for MIR2Vec embeddings. MIR2Vec provides machine-level code embeddings that capture target-specific instruction semantics, complementing the target-independent IR2Vec embeddings.

Extended llvm-ir2vec to support triplet and entity generation for Machine IR (MIR)
Added --mode=mir option to specify MIR mode (vs LLVM IR mode)
Implemented MIR triplet generation with Next and Arg relationships
Added entity mapping generation for MIR vocabulary
Updated documentation to explain MIR-specific features and usage

(Partially addresses #162200 ; Tracking issue - #141817)

svkeerthy · 2025-10-20T22:29:02Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-10-20T22:33:57Z

@llvm/pr-subscribers-mlgo

@llvm/pr-subscribers-llvm-binary-utilities

Author: S. VenkataKeerthy (svkeerthy)

Changes

Add support for Machine IR (MIR) triplet and entity generation in llvm-ir2vec.

This change extends llvm-ir2vec to support Machine IR (MIR) in addition to LLVM IR, enabling the generation of training data for MIR2Vec embeddings. MIR2Vec provides machine-level code embeddings that capture target-specific instruction semantics, complementing the target-independent IR2Vec embeddings.

Extended llvm-ir2vec to support triplet and entity generation for Machine IR (MIR)
Added --mode=mir option to specify MIR mode (vs LLVM IR mode)
Implemented MIR triplet generation with Next and Arg relationships
Added entity mapping generation for MIR vocabulary
Updated documentation to explain MIR-specific features and usage

(Partially addresses #162200 ; Tracking issue - #141817)

Patch is 150.71 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/164329.diff

8 Files Affected:

(modified) llvm/docs/CommandGuide/llvm-ir2vec.rst (+69-19)
(modified) llvm/include/llvm/CodeGen/MIR2Vec.h (+39-1)
(added) llvm/test/tools/llvm-ir2vec/entities.mir (+28)
(added) llvm/test/tools/llvm-ir2vec/output/lit.local.cfg (+3)
(added) llvm/test/tools/llvm-ir2vec/output/reference_triplets.txt (+33)
(added) llvm/test/tools/llvm-ir2vec/output/reference_x86_entities.txt (+7174)
(added) llvm/test/tools/llvm-ir2vec/triplets.mir (+61)
(modified) llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp (+215-29)

diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 55fe75d2084b1..f51da065b43d8 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -68,32 +68,52 @@ these two modes are used to generate the triplets and entity mappings.
 Triplet Generation
 ~~~~~~~~~~~~~~~~~~
 
-With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts
-numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
+With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR or Machine IR
+and extracts numeric triplets consisting of opcode IDs and operand IDs. These triplets
 are generated in the standard format used for knowledge graph embedding training.
-The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
-infrastructure, eliminating the need for string-to-ID preprocessing.
+The tool outputs numeric IDs directly using the vocabulary mapping infrastructure,
+eliminating the need for string-to-ID preprocessing.
 
-Usage:
+Usage for LLVM IR:
 
 .. code-block:: bash
 
-   llvm-ir2vec triplets input.bc -o triplets_train2id.txt
+   llvm-ir2vec triplets --mode=llvm input.bc -o triplets_train2id.txt
+
+Usage for Machine IR:
+
+.. code-block:: bash
+
+   llvm-ir2vec triplets --mode=mir input.mir -o triplets_train2id.txt
 
 Entity Mapping Generation
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
 With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
-supported by IR2Vec in the standard format used for knowledge graph embedding
-training. This subcommand outputs all supported entities (opcodes, types, and
-operands) with their corresponding numeric IDs, and is not specific for an
-LLVM IR file.
+supported by IR2Vec or MIR2Vec in the standard format used for knowledge graph embedding
+training. This subcommand outputs all supported entities with their corresponding numeric IDs.
+
+For LLVM IR, entities include opcodes, types, and operands. For Machine IR, entities include
+machine opcodes, common operands, and register classes (both physical and virtual).
+
+Usage for LLVM IR:
 
-Usage:
+.. code-block:: bash
+
+   llvm-ir2vec entities --mode=llvm -o entity2id.txt
+
+Usage for Machine IR:
 
 .. code-block:: bash
 
-   llvm-ir2vec entities -o entity2id.txt
+   llvm-ir2vec entities --mode=mir input.mir -o entity2id.txt
+
+.. note::
+
+   For LLVM IR mode, the entity mapping is target-independent and does not require an input file.
+   For Machine IR mode, an input .mir file is required to determine the target architecture,
+   as entity mappings vary by target (different architectures have different instruction sets
+   and register classes).
 
 Embedding Generation
 ~~~~~~~~~~~~~~~~~~~~
@@ -222,12 +242,17 @@ Subcommand-specific options:
 
 .. option:: <input-file>
 
-   The input LLVM IR or bitcode file to process. This positional argument is
-   required for the `triplets` subcommand.
+   The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process. 
+   This positional argument is required for the `triplets` subcommand.
 
 **entities** subcommand:
 
-   No subcommand-specific options.
+.. option:: <input-file>
+
+   The input Machine IR file (.mir) to process. This positional argument is required
+   for the `entities` subcommand when using ``--mode=mir``, as the entity mappings
+   are target-specific. For ``--mode=llvm``, no input file is required as IR2Vec
+   entity mappings are target-independent.
 
 OUTPUT FORMAT
 -------------
@@ -240,19 +265,37 @@ metadata headers. The format includes:
 
 .. code-block:: text
 
-   MAX_RELATIONS=<max_relations_count>
+   MAX_RELATION=<max_relation_count>
    <head_entity_id> <tail_entity_id> <relation_id>
    <head_entity_id> <tail_entity_id> <relation_id>
    ...
 
 Each line after the metadata header represents one instruction relationship,
-with numeric IDs for head entity, relation, and tail entity. The metadata 
-header (MAX_RELATIONS) provides counts for post-processing and training setup.
+with numeric IDs for head entity, tail entity, and relation type. The metadata 
+header (MAX_RELATION) indicates the maximum relation ID used.
+
+**Relation Types:**
+
+For LLVM IR (IR2Vec):
+  * **0** = Type relationship (instruction to its type)
+  * **1** = Next relationship (sequential instructions)
+  * **2+** = Argument relationships (Arg0, Arg1, Arg2, ...)
+
+For Machine IR (MIR2Vec):
+  * **0** = Next relationship (sequential instructions)
+  * **1+** = Argument relationships (Arg0, Arg1, Arg2, ...)
+
+**Entity IDs:**
+
+For LLVM IR: Entity IDs represent opcodes, types, and operands as defined by the IR2Vec vocabulary.
+
+For Machine IR: Entity IDs represent machine opcodes, common operands (immediate, frame index, etc.),
+physical register classes, and virtual register classes as defined by the MIR2Vec vocabulary. The entity layout is target-specific.
 
 Entity Mode Output
 ~~~~~~~~~~~~~~~~~~
 
-In entity mode, the output consists of entity mapping in the format:
+In entity mode, the output consists of entity mappings in the format:
 
 .. code-block:: text
 
@@ -264,6 +307,13 @@ In entity mode, the output consists of entity mapping in the format:
 The first line contains the total number of entities, followed by one entity
 mapping per line with tab-separated entity string and numeric ID.
 
+For LLVM IR, entities include instruction opcodes (e.g., "Add", "Ret"), types 
+(e.g., "INT", "PTR"), and operand kinds.
+
+For Machine IR, entities include machine opcodes (e.g., "COPY", "ADD"), 
+common operands (e.g., "Immediate", "FrameIndex"), physical register classes 
+(e.g., "PhyReg_GR32"), and virtual register classes (e.g., "VirtReg_GR32").
+
 Embedding Mode Output
 ~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/llvm/include/llvm/CodeGen/MIR2Vec.h b/llvm/include/llvm/CodeGen/MIR2Vec.h
index f47d9abb042d8..696fe3957930e 100644
--- a/llvm/include/llvm/CodeGen/MIR2Vec.h
+++ b/llvm/include/llvm/CodeGen/MIR2Vec.h
@@ -111,6 +111,11 @@ class MIRVocabulary {
     size_t TotalEntries = 0;
   } Layout;
 
+  // ToDo: See if we can have only one reg classes section instead of physical
+  // and virtual separate sections in the vocabulary. This would reduce the
+  // number of vocabulary entities significantly.
+  // We can potentially distinguish physical and virtual registers by
+  // considering them as a separate feature.
   enum class Section : unsigned {
     Opcodes = 0,
     CommonOperands = 1,
@@ -125,7 +130,7 @@ class MIRVocabulary {
 
   // Some instructions have optional register operands that may be NoRegister.
   // We return a zero vector in such cases.
-  mutable Embedding ZeroEmbedding;
+  Embedding ZeroEmbedding;
 
   // We have specialized MO_Register handling in the Register operand section,
   // so we don't include it here. Also, no MO_DbgInstrRef for now.
@@ -185,6 +190,25 @@ class MIRVocabulary {
     return Storage[static_cast<unsigned>(SectionID)][LocalIndex];
   }
 
+  /// Get entity ID (flat index) for a common operand type
+  /// This is used for triplet generation
+  unsigned getEntityIDForCommonOperand(
+      MachineOperand::MachineOperandType OperandType) const {
+    return Layout.CommonOperandBase + getCommonOperandIndex(OperandType);
+  }
+
+  /// Get entity ID (flat index) for a register
+  /// This is used for triplet generation
+  unsigned getEntityIDForRegister(Register Reg) const {
+    if (!Reg.isValid() || Reg.isStack())
+      return Layout
+          .VirtRegBase; // Return VirtRegBase for invalid/stack registers
+    unsigned LocalIndex = getRegisterOperandIndex(Reg);
+    size_t BaseOffset =
+        Reg.isPhysical() ? Layout.PhyRegBase : Layout.VirtRegBase;
+    return BaseOffset + LocalIndex;
+  }
+
 public:
   /// Static method for extracting base opcode names (public for testing)
   static std::string extractBaseOpcodeName(StringRef InstrName);
@@ -201,6 +225,20 @@ class MIRVocabulary {
 
   unsigned getDimension() const { return Storage.getDimension(); }
 
+  /// Get entity ID (flat index) for an opcode
+  /// This is used for triplet generation
+  unsigned getEntityIDForOpcode(unsigned Opcode) const {
+    return Layout.OpcodeBase + getCanonicalOpcodeIndex(Opcode);
+  }
+
+  /// Get entity ID (flat index) for a machine operand
+  /// This is used for triplet generation
+  unsigned getEntityIDForMachineOperand(const MachineOperand &MO) const {
+    if (MO.getType() == MachineOperand::MO_Register)
+      return getEntityIDForRegister(MO.getReg());
+    return getEntityIDForCommonOperand(MO.getType());
+  }
+
   // Accessor methods
   const Embedding &operator[](unsigned Opcode) const {
     unsigned LocalIndex = getCanonicalOpcodeIndex(Opcode);
diff --git a/llvm/test/tools/llvm-ir2vec/entities.mir b/llvm/test/tools/llvm-ir2vec/entities.mir
new file mode 100644
index 0000000000000..60d9c7a783c4c
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/entities.mir
@@ -0,0 +1,28 @@
+# REQUIRES: x86_64-linux
+# RUN: llvm-ir2vec entities --mode=mir %s -o 2>&1 %t1.log
+# RUN: diff %S/output/reference_x86_entities.txt %t1.log
+
+--- |
+  target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+  target triple = "x86_64-unknown-linux-gnu"
+  
+  define dso_local noundef i32 @test_function(i32 noundef %a) {
+  entry:
+    ret i32 %a
+  }
+...
+---
+name:            test_function
+alignment:       16
+tracksRegLiveness: true
+registers:
+  - { id: 0, class: gr32 }
+liveins:
+  - { reg: '$edi', virtual-reg: '%0' }
+body:             |
+  bb.0.entry:
+    liveins: $edi
+  
+    %0:gr32 = COPY $edi
+    $eax = COPY %0
+    RET 0, $eax
diff --git a/llvm/test/tools/llvm-ir2vec/output/lit.local.cfg b/llvm/test/tools/llvm-ir2vec/output/lit.local.cfg
new file mode 100644
index 0000000000000..2406f19eebcdd
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/output/lit.local.cfg
@@ -0,0 +1,3 @@
+# Don't treat files in this directory as tests
+# These are reference data files, not test scripts
+config.suffixes = []
diff --git a/llvm/test/tools/llvm-ir2vec/output/reference_triplets.txt b/llvm/test/tools/llvm-ir2vec/output/reference_triplets.txt
new file mode 100644
index 0000000000000..dfbac4ce0c4d3
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/output/reference_triplets.txt
@@ -0,0 +1,33 @@
+MAX_RELATION=4
+187	7072	1
+187	6968	2
+187	187	0
+187	7072	1
+187	6969	2
+187	10	0
+10	7072	1
+10	7072	2
+10	7072	3
+10	6961	4
+10	187	0
+187	6952	1
+187	7072	2
+187	1555	0
+1555	6882	1
+1555	6952	2
+187	7072	1
+187	6968	2
+187	187	0
+187	7072	1
+187	6969	2
+187	601	0
+601	7072	1
+601	7072	2
+601	7072	3
+601	6961	4
+601	187	0
+187	6952	1
+187	7072	2
+187	1555	0
+1555	6882	1
+1555	6952	2
diff --git a/llvm/test/tools/llvm-ir2vec/output/reference_x86_entities.txt b/llvm/test/tools/llvm-ir2vec/output/reference_x86_entities.txt
new file mode 100644
index 0000000000000..dc436d123fd35
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/output/reference_x86_entities.txt
@@ -0,0 +1,7174 @@
+7173
+AAA	0
+AAD	1
+AADD	2
+AAM	3
+AAND	4
+AAS	5
+ABS_F	6
+ABS_Fp	7
+ADC	8
+ADCX	9
+ADD	10
+ADDPDrm	11
+ADDPDrr	12
+ADDPSrm	13
+ADDPSrr	14
+ADDR	15
+ADDSDrm	16
+ADDSDrm_Int	17
+ADDSDrr	18
+ADDSDrr_Int	19
+ADDSSrm	20
+ADDSSrm_Int	21
+ADDSSrr	22
+ADDSSrr_Int	23
+ADDSUBPDrm	24
+ADDSUBPDrr	25
+ADDSUBPSrm	26
+ADDSUBPSrr	27
+ADD_F	28
+ADD_FI	29
+ADD_FPrST	30
+ADD_FST	31
+ADD_Fp	32
+ADD_FpI	33
+ADD_FrST	34
+ADJCALLSTACKDOWN	35
+ADJCALLSTACKUP	36
+ADOX	37
+AESDEC	38
+AESDECLASTrm	39
+AESDECLASTrr	40
+AESDECWIDE	41
+AESDECrm	42
+AESDECrr	43
+AESENC	44
+AESENCLASTrm	45
+AESENCLASTrr	46
+AESENCWIDE	47
+AESENCrm	48
+AESENCrr	49
+AESIMCrm	50
+AESIMCrr	51
+AESKEYGENASSISTrmi	52
+AESKEYGENASSISTrri	53
+AND	54
+ANDN	55
+ANDNPDrm	56
+ANDNPDrr	57
+ANDNPSrm	58
+ANDNPSrr	59
+ANDPDrm	60
+ANDPDrr	61
+ANDPSrm	62
+ANDPSrr	63
+ANNOTATION_LABEL	64
+AOR	65
+ARITH_FENCE	66
+ARPL	67
+ASAN_CHECK_MEMACCESS	68
+AVX	69
+AVX_SET	70
+AXOR	71
+BEXTR	72
+BEXTRI	73
+BLCFILL	74
+BLCI	75
+BLCIC	76
+BLCMSK	77
+BLCS	78
+BLENDPDrmi	79
+BLENDPDrri	80
+BLENDPSrmi	81
+BLENDPSrri	82
+BLENDVPDrm	83
+BLENDVPDrr	84
+BLENDVPSrm	85
+BLENDVPSrr	86
+BLSFILL	87
+BLSI	88
+BLSIC	89
+BLSMSK	90
+BLSR	91
+BOUNDS	92
+BSF	93
+BSR	94
+BSWAP	95
+BT	96
+BTC	97
+BTR	98
+BTS	99
+BUNDLE	100
+BZHI	101
+CALL	102
+CALLpcrel	103
+CATCHRET	104
+CBW	105
+CCMP	106
+CDQ	107
+CDQE	108
+CFCMOV	109
+CFI_INSTRUCTION	110
+CHS_F	111
+CHS_Fp	112
+CLAC	113
+CLC	114
+CLD	115
+CLDEMOTE	116
+CLEANUPRET	117
+CLFLUSH	118
+CLFLUSHOPT	119
+CLGI	120
+CLI	121
+CLRSSBSY	122
+CLTS	123
+CLUI	124
+CLWB	125
+CLZERO	126
+CMC	127
+CMOV	128
+CMOVBE_F	129
+CMOVBE_Fp	130
+CMOVB_F	131
+CMOVB_Fp	132
+CMOVE_F	133
+CMOVE_Fp	134
+CMOVNBE_F	135
+CMOVNBE_Fp	136
+CMOVNB_F	137
+CMOVNB_Fp	138
+CMOVNE_F	139
+CMOVNE_Fp	140
+CMOVNP_F	141
+CMOVNP_Fp	142
+CMOVP_F	143
+CMOVP_Fp	144
+CMOV_FR	145
+CMOV_GR	146
+CMOV_RFP	147
+CMOV_VK	148
+CMOV_VR	149
+CMP	150
+CMPCCXADDmr	151
+CMPPDrmi	152
+CMPPDrri	153
+CMPPSrmi	154
+CMPPSrri	155
+CMPSB	156
+CMPSDrmi	157
+CMPSDrmi_Int	158
+CMPSDrri	159
+CMPSDrri_Int	160
+CMPSL	161
+CMPSQ	162
+CMPSSrmi	163
+CMPSSrmi_Int	164
+CMPSSrri	165
+CMPSSrri_Int	166
+CMPSW	167
+CMPXCHG	168
+COMISDrm	169
+COMISDrm_Int	170
+COMISDrr	171
+COMISDrr_Int	172
+COMISSrm	173
+COMISSrm_Int	174
+COMISSrr	175
+COMISSrr_Int	176
+COMP_FST	177
+COM_FIPr	178
+COM_FIr	179
+COM_FST	180
+COM_FpIr	181
+COM_Fpr	182
+CONVERGENCECTRL_ANCHOR	183
+CONVERGENCECTRL_ENTRY	184
+CONVERGENCECTRL_GLUE	185
+CONVERGENCECTRL_LOOP	186
+COPY	187
+COPY_TO_REGCLASS	188
+CPUID	189
+CQO	190
+CRC	191
+CS_PREFIX	192
+CTEST	193
+CVTDQ	194
+CVTPD	195
+CVTPS	196
+CVTSD	197
+CVTSI	198
+CVTSS	199
+CVTTPD	200
+CVTTPS	201
+CVTTSD	202
+CVTTSS	203
+CWD	204
+CWDE	205
+DAA	206
+DAS	207
+DATA	208
+DBG_INSTR_REF	209
+DBG_LABEL	210
+DBG_PHI	211
+DBG_VALUE	212
+DBG_VALUE_LIST	213
+DEC	214
+DIV	215
+DIVPDrm	216
+DIVPDrr	217
+DIVPSrm	218
+DIVPSrr	219
+DIVR_F	220
+DIVR_FI	221
+DIVR_FPrST	222
+DIVR_FST	223
+DIVR_Fp	224
+DIVR_FpI	225
+DIVR_FrST	226
+DIVSDrm	227
+DIVSDrm_Int	228
+DIVSDrr	229
+DIVSDrr_Int	230
+DIVSSrm	231
+DIVSSrm_Int	232
+DIVSSrr	233
+DIVSSrr_Int	234
+DIV_F	235
+DIV_FI	236
+DIV_FPrST	237
+DIV_FST	238
+DIV_Fp	239
+DIV_FpI	240
+DIV_FrST	241
+DPPDrmi	242
+DPPDrri	243
+DPPSrmi	244
+DPPSrri	245
+DS_PREFIX	246
+DYN_ALLOCA	247
+EH_LABEL	248
+EH_RETURN	249
+EH_SjLj_LongJmp	250
+EH_SjLj_SetJmp	251
+EH_SjLj_Setup	252
+ENCLS	253
+ENCLU	254
+ENCLV	255
+ENCODEKEY	256
+ENDBR	257
+ENQCMD	258
+ENQCMDS	259
+ENTER	260
+ERETS	261
+ERETU	262
+ES_PREFIX	263
+EXTRACTPSmri	264
+EXTRACTPSrri	265
+EXTRACT_SUBREG	266
+EXTRQ	267
+EXTRQI	268
+F	269
+FAKE_USE	270
+FARCALL	271
+FARJMP	272
+FAULTING_OP	273
+FBLDm	274
+FBSTPm	275
+FCOM	276
+FCOMP	277
+FCOMPP	278
+FCOS	279
+FDECSTP	280
+FEMMS	281
+FENTRY_CALL	282
+FFREE	283
+FFREEP	284
+FICOM	285
+FICOMP	286
+FINCSTP	287
+FLDCW	288
+FLDENVm	289
+FLDL	290
+FLDLG	291
+FLDLN	292
+FLDPI	293
+FNCLEX	294
+FNINIT	295
+FNOP	296
+FNSTCW	297
+FNSTSW	298
+FNSTSWm	299
+FP	300
+FPATAN	301
+FPREM	302
+FPTAN	303
+FRNDINT	304
+FRSTORm	305
+FSAVEm	306
+FSCALE	307
+FSIN	308
+FSINCOS	309
+FSTENVm	310
+FS_PREFIX	311
+FXRSTOR	312
+FXSAVE	313
+FXTRACT	314
+FYL	315
+FsFLD	316
+GC_LABEL	317
+GETSEC	318
+GF	319
+GS_PREFIX	320
+G_ABDS	321
+G_ABDU	322
+G_ABS	323
+G_ADD	324
+G_ADDRSPACE_CAST	325
+G_AND	326
+G_ANYEXT	327
+G_ASHR	328
+G_ASSERT_ALIGN	329
+G_ASSERT_SEXT	330
+G_ASSERT_ZEXT	331
+G_ATOMICRMW_ADD	332
+G_ATOMICRMW_AND	333
+G_ATOMICRMW_FADD	334
+G_ATOMICRMW_FMAX	335
+G_ATOMICRMW_FMAXIMUM	336
+G_ATOMICRMW_FMIN	337
+G_ATOMICRMW_FMINIMUM	338
+G_ATOMICRMW_FSUB	339
+G_ATOMICRMW_MAX	340
+G_ATOMICRMW_MIN	341
+G_ATOMICRMW_NAND	342
+G_ATOMICRMW_OR	343
+G_ATOMICRMW_SUB	344
+G_ATOMICRMW_UDEC_WRAP	345
+G_ATOMICRMW_UINC_WRAP	346
+G_ATOMICRMW_UMAX	347
+G_ATOMICRMW_UMIN	348
+G_ATOMICRMW_USUB_COND	349
+G_ATOMICRMW_USUB_SAT	350
+G_ATOMICRMW_XCHG	351
+G_ATOMICRMW_XOR	352
+G_ATOMIC_CMPXCHG	353
+G_ATOMIC_CMPXCHG_WITH_SUCCESS	354
+G_BITCAST	355
+G_BITREVERSE	356
+G_BLOCK_ADDR	357
+G_BR	358
+G_BRCOND	359
+G_BRINDIRECT	360
+G_BRJT	361
+G_BSWAP	362
+G_BUILD_VECTOR	363
+G_BUILD_VECTOR_TRUNC	364
+G_BZERO	365
+G_CONCAT_VECTORS	366
+G_CONSTANT	367
+G_CONSTANT_FOLD_BARRIER	368
+G_CONSTANT_POOL	369
+G_CTLZ	370
+G_CTLZ_ZERO_UNDEF	371
+G_CTPOP	372
+G_CTTZ	373
+G_CTTZ_ZERO_UNDEF	374
+G_DEBUGTRAP	375
+G_DYN_STACKALLOC	376
+G_EXTRACT	377
+G_EXTRACT_SUBVECTOR	378
+G_EXTRACT_VECTOR_ELT	379
+G_FABS	380
+G_FACOS	381
+G_FADD	382
+G_FASIN	383
+G_FATAN	384
+G_FCANONICALIZE	385
+G_FCEIL	386
+G_FCMP	387
+G_FCONSTANT	388
+G_FCOPYSIGN	389
+G_FCOS	390
+G_FCOSH	391
+G_FDIV	392
+G_FENCE	393
+G_FEXP	394
+G_FFLOOR	395
+G_FFREXP	396
+G_FILD	397
+G_FIST	398
+G_FLDCW	399
+G_FLDEXP	400
+G_FLOG	401
+G_FMA	402
+G_FMAD	403
+G_FMAXIMUM	404
+G_FMAXIMUMNUM	405
+G_FMAXNUM	406
+G_FMAXNUM_IEEE	407
+G_FMINIMUM	408
+G_FMINIMUMNUM	409
+G_FMINNUM	410
+G_FMINNUM_IEEE	411
+G_FMODF	412
+G_FMUL	413
+G_FNEARBYINT	414
+G_FNEG	415
+G_FNSTCW	416
+G_FPEXT	417
+G_FPOW	418
+G_FPOWI	419
+G_FPTOSI	420
+G_FPTOSI_SAT	421
+G_FPTOUI	422
+G_FPTOUI_SAT	423
+G_FPTRUNC	424
+G_FRAME_INDEX	425
+G_FREEZE	426
+G_FREM	427
+G_FRINT	428
+G_FSHL	429
+G_FSHR	430
+G_FSIN	431
+G_FSINCOS	432
+G_FSINH	433
+G_FSQRT	434
+G_FSUB	435
+G_FTAN	436
+G_FTANH	437
+G_GET_FPENV	438
+G_GET_FPMODE	439
+G_GET_ROUNDING	440
+G_GLOBAL_VALUE	441
+G_ICMP	442
+G_IMPLICIT_DEF	443
+G_INDEXED_LOAD	444
+G_INDEXED_SEXTLOAD	445
+G_INDEXED_STORE	446
+G_INDEXED_ZEXTLOAD	447
+G_INSERT	448
+G_INSERT_SUBVECTOR	449
+G_INSERT_VECTOR_ELT	450
+G_INTRINSIC	451
+G_INTRINSIC_CONVERGENT	452
+G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS	453
+G_INTRINSIC_FPTRUNC_ROUND	454
+G_INTRINSIC_LLRINT	455
+G_INTRINSIC_LRINT	456
+G_INTRINSIC_ROUND	457
+G_INTRINSIC_ROUNDEVEN	458
+G_INTRINSIC_TRUNC	459
+G_INTRINSIC_W_SIDE_EFFECTS	460
+G_INTTOPTR	461
+G_INVOKE_REGION_START	462
+G_IS_FPCLASS	463
+G_JUMP_TABLE	464
+G_LLROUND	465
+G_LOAD	466
+G_LROUND	467
+G_LSHR	468
+G_MEMCPY	469
+G_MEMCPY_INLINE	470
+G_MEMMOVE	471
+G_MEMSET	472
+G_MERGE_VALUES	473
+G_MUL	474
+G_OR	475
+G_PHI	476
+G_PREFETCH	477
+G_PTRAUTH_GLOBAL_VALUE	478
+G_PTRMASK	479
+G_PTRTOINT	480
+G_PTR_ADD	481
+G_READCYCLECOUNTER	482
+G_READSTEADYCOUNTER	483
+G_READ_REGISTER	484
+G_RESET_FPENV	485
+G_RESET_FPMODE	486
+G_ROTL	487
+G_ROTR	488
+G_SADDE	489
+G_SADDO	490
+G_SADDSAT	491
+G_SBFX	492
+G_SCMP	493
+G_SDIV	494
+G_SDIVFIX	495
+G_SDIVFIXSAT	496
+G_SDIVREM	497
+G_SELECT	498
+G_SET_FPENV	499
+G_SET_FPMODE	500
+G_SET_ROUNDING	501
+G_SEXT	502
+G_SEXTLOAD	503
+G_SEXT_INREG	504
+G_SHL	505
+G_SHUFFLE_VECTOR	506
+G_SITOFP	507
+G_SMAX	508
+G_SMIN	509
+G_SMULFIX	510
+G_SMULFIXSAT	511
+G_SMULH	512
+G_SMULO	513
+G_SPLAT_VECTOR	514
+G_SREM	515
+G_SSHLSAT	516
+G_SSUBE	517
+G_SSUBO	518
+G_SSUBSAT	519
+G_STACKRESTORE	520
+G_STACKSAVE	521
+G_STEP_VECTOR	522
+G_STORE	523
+G_STRICT_FADD	524
+G_STRICT_FDIV	525
+G_STRICT_FLDEXP	526
+G_STRICT_FMA	527
+G_STRICT_FMUL	528
+G_STRICT_FREM	529
+G_STRICT_FSQRT	530
+G_STRICT_FSUB	531
+G_SUB	532
+G_TRAP	533
+G_TRUNC	534
+G_TRUNC_SSAT_S	535
+G_TRUNC_SSAT_U	536
+G_TRUNC_USAT_U	537
+G_UADDE	538
+G_UADDO	539
+G_UADDSAT	540
+G_UBFX	541
+G_UBSANTRAP	542
+G_UCMP	543
+G_UDIV	544
+G_UDIVFIX	545
+G_UDIVFIXSAT	546
+G_UDIVREM	547
+G_UITOFP	548
+G_UMAX	549
+G_UMIN	550
+G_UMULFIX	551
+G_UMULFIXSAT	552
+G_UMULH	553
+G_UMULO	554
+G_UNMERGE_VALUES	555
+G_UREM	556
+G_USHLSAT	557
+G_USUBE	558
+G_USUBO	559
+G_USUBSAT	560
+G_VAARG	561
+G_VASTART	562
+G_VECREDUCE_ADD	563
+G_VECREDUCE_AND	564
+G_VECREDUCE_FADD	565
+G_VECREDUCE_FMAX	566
+G_VECREDUCE_FMAXIMUM	567
+G_VECREDUCE_FMIN	568
+G_VECREDUCE_FMINIMUM	569
+G_VECREDUCE_FMUL	570
+G_VECREDUCE_MUL	571
+G_VECREDUCE_OR	572
+G_VECREDUCE_SEQ_FADD	573
+G_VECREDUCE_SEQ_FMUL	574
+G_VECREDUCE_SMAX	575
+G_VECREDUCE_SMIN	576
+G_VECREDUCE_UMAX	577
+G_VECREDUCE_UMIN	578
+G_VECREDUCE_XOR	579
+G_VECTOR_COMPRESS	580
+G_VSCALE	581
+G_WRITE_REGISTER	582
+G_XOR	583
+G_ZEXT	584
+G_ZEXTLOAD	585
+HADDPDrm	586
+HADDPDrr	587
+HADDPSrm	588
+HADDPSrr	589
+HLT	590
+HRESET	591
+HSUBPDrm	592
+HSUBPDrr	593
+HSUBPSrm	594
+HSUBPSrr	595
+ICALL_BRANCH_FUNNEL	596
+IDIV	597
+ILD_F	598
+ILD_Fp	599
+IMPLICIT_DEF	600
+IMUL	601
+IMULZU	602
+IN	603
+INC	604
+INCSSPD	605
+INCSSPQ	606
+INDIRECT_THUNK_CALL	607
+INDIRECT_THUNK_TCRETURN	608
+INIT_UNDEF	609
+INLINEASM	610
+INLINEASM_BR	611
+INSB	612
+INSERTPSrmi	613
+INSERTPSrri	614
+INSERTQ	615
+INSERTQI	616
+INSERT_SUBREG	617
+INSL	618
+INSW	619
+INT	620
+INTO	621
+INVD	622
+INVEPT	623
+INVLPG	624
+INVLPGA	625
+INVLPGB	626
+INVPCID	627
+INVVPID	628
+IRET	629
+ISTT_FP	630
+ISTT_Fp	631
+IST_F	632
+IST_FP	633
+IST_Fp	634
+Int_eh_sjlj_setup_dispatch	635
+JCC	636
+JCXZ	637
+JEC...
[truncated]

llvm/include/llvm/CodeGen/MIR2Vec.h

svkeerthy · 2025-10-23T17:50:05Z

Merge activity

Oct 23, 5:50 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Oct 23, 5:51 PM UTC: @svkeerthy merged this pull request with Graphite.

…ration (llvm#164329) Add support for Machine IR (MIR) triplet and entity generation in llvm-ir2vec. This change extends llvm-ir2vec to support Machine IR (MIR) in addition to LLVM IR, enabling the generation of training data for MIR2Vec embeddings. MIR2Vec provides machine-level code embeddings that capture target-specific instruction semantics, complementing the target-independent IR2Vec embeddings. - Extended llvm-ir2vec to support triplet and entity generation for Machine IR (MIR) - Added `--mode=mir` option to specify MIR mode (vs LLVM IR mode) - Implemented MIR triplet generation with Next and Arg relationships - Added entity mapping generation for MIR vocabulary - Updated documentation to explain MIR-specific features and usage (Partially addresses llvm#162200 ; Tracking issue - llvm#141817)

svkeerthy mentioned this pull request Oct 20, 2025

[NFC][llvm-ir2vec] Standardize error message format using WithColor #164032

Merged

svkeerthy changed the title ~~Triplet and Entities generation for MIR2Vec~~ [llvm-ir2vec] Supporting MIR mode for triplet and entity generation Oct 20, 2025

svkeerthy changed the title ~~[llvm-ir2vec] Supporting MIR mode for triplet and entity generation~~ [llvm-ir2vec] Supporting MIR mode in triplet and entity generation Oct 20, 2025

svkeerthy marked this pull request as ready for review October 20, 2025 22:33

llvmbot added mlgo llvm:binary-utilities labels Oct 20, 2025

svkeerthy requested review from boomanaiden154 and mtrofin October 20, 2025 22:33

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from 65fe880 to 2c5f2d3 Compare October 20, 2025 22:35

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from 8a4c8cb to 3ed5648 Compare October 20, 2025 22:35

svkeerthy mentioned this pull request Oct 20, 2025

[MIR2Vec] Add MIR support to triplet generator script #164332

Merged

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from 2c5f2d3 to 869c0a3 Compare October 20, 2025 23:24

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from 3ed5648 to 669ca87 Compare October 20, 2025 23:24

svkeerthy changed the title ~~[llvm-ir2vec] Supporting MIR mode in triplet and entity generation~~ [llvm-ir2vec][MIR2Vec] Supporting MIR mode in triplet and entity generation Oct 20, 2025

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from 669ca87 to ea491e0 Compare October 20, 2025 23:48

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from 869c0a3 to 1cd5b76 Compare October 20, 2025 23:48

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from ea491e0 to 28474c5 Compare October 21, 2025 00:23

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from 1cd5b76 to 3d9c8cd Compare October 21, 2025 00:23

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from d2e75ac to 9f51211 Compare October 21, 2025 21:47

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from 9b54ed5 to dcf282b Compare October 21, 2025 22:58

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch 2 times, most recently from d861f38 to 71f67d8 Compare October 22, 2025 00:13

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from dcf282b to 455f3f6 Compare October 22, 2025 00:13

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from 71f67d8 to fb647bc Compare October 22, 2025 18:01

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch 2 times, most recently from 8fc9227 to dc8a7f5 Compare October 22, 2025 21:11

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch 2 times, most recently from 10aebbf to 6fa486e Compare October 22, 2025 21:53

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch 2 times, most recently from 4a866bc to 71e0e55 Compare October 22, 2025 22:28

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from 6fa486e to 21846c7 Compare October 22, 2025 22:29

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from 71e0e55 to 101743e Compare October 22, 2025 22:51

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from 21846c7 to 2d67499 Compare October 22, 2025 22:51

svkeerthy force-pushed the users/svkeerthy/10-17-update_mlgo_doc branch from 101743e to e6a125d Compare October 22, 2025 23:26

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch 2 times, most recently from c46a27b to 37ec3b4 Compare October 22, 2025 23:29

Base automatically changed from users/svkeerthy/10-17-update_mlgo_doc to main October 22, 2025 23:33

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from 37ec3b4 to c790ee7 Compare October 22, 2025 23:35

mtrofin reviewed Oct 23, 2025

View reviewed changes

llvm/include/llvm/CodeGen/MIR2Vec.h Outdated Show resolved Hide resolved

mtrofin approved these changes Oct 23, 2025

View reviewed changes

Triplet and Entities generation for MIR2Vec

49a6c69

svkeerthy force-pushed the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch from c790ee7 to 49a6c69 Compare October 23, 2025 17:23

svkeerthy merged commit 10bec2c into main Oct 23, 2025
10 of 11 checks passed

svkeerthy deleted the users/svkeerthy/10-20-triplet_and_entities_generation_for_mir2vec branch October 23, 2025 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llvm-ir2vec][MIR2Vec] Supporting MIR mode in triplet and entity generation #164329

[llvm-ir2vec][MIR2Vec] Supporting MIR mode in triplet and entity generation #164329

Uh oh!

svkeerthy commented Oct 20, 2025 •

edited

Loading

Uh oh!

svkeerthy commented Oct 20, 2025 •

edited

Loading

Uh oh!

llvmbot commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

svkeerthy commented Oct 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[llvm-ir2vec][MIR2Vec] Supporting MIR mode in triplet and entity generation #164329

[llvm-ir2vec][MIR2Vec] Supporting MIR mode in triplet and entity generation #164329

Uh oh!

Conversation

svkeerthy commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svkeerthy commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

svkeerthy commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

svkeerthy commented Oct 20, 2025 •

edited

Loading

svkeerthy commented Oct 20, 2025 •

edited

Loading

llvmbot commented Oct 20, 2025 •

edited

Loading

svkeerthy commented Oct 23, 2025 •

edited

Loading