Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 69 additions & 19 deletions llvm/docs/CommandGuide/llvm-ir2vec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,32 +68,52 @@ these two modes are used to generate the triplets and entity mappings.
Triplet Generation
~~~~~~~~~~~~~~~~~~

With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts
numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR or Machine IR
and extracts numeric triplets consisting of opcode IDs and operand IDs. These triplets
are generated in the standard format used for knowledge graph embedding training.
The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
infrastructure, eliminating the need for string-to-ID preprocessing.
The tool outputs numeric IDs directly using the vocabulary mapping infrastructure,
eliminating the need for string-to-ID preprocessing.

Usage:
Usage for LLVM IR:

.. code-block:: bash
llvm-ir2vec triplets input.bc -o triplets_train2id.txt
llvm-ir2vec triplets --mode=llvm input.bc -o triplets_train2id.txt
Usage for Machine IR:

.. code-block:: bash
llvm-ir2vec triplets --mode=mir input.mir -o triplets_train2id.txt
Entity Mapping Generation
~~~~~~~~~~~~~~~~~~~~~~~~~

With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
supported by IR2Vec in the standard format used for knowledge graph embedding
training. This subcommand outputs all supported entities (opcodes, types, and
operands) with their corresponding numeric IDs, and is not specific for an
LLVM IR file.
supported by IR2Vec or MIR2Vec in the standard format used for knowledge graph embedding
training. This subcommand outputs all supported entities with their corresponding numeric IDs.

For LLVM IR, entities include opcodes, types, and operands. For Machine IR, entities include
machine opcodes, common operands, and register classes (both physical and virtual).

Usage for LLVM IR:

Usage:
.. code-block:: bash
llvm-ir2vec entities --mode=llvm -o entity2id.txt
Usage for Machine IR:

.. code-block:: bash
llvm-ir2vec entities -o entity2id.txt
llvm-ir2vec entities --mode=mir input.mir -o entity2id.txt
.. note::

For LLVM IR mode, the entity mapping is target-independent and does not require an input file.
For Machine IR mode, an input .mir file is required to determine the target architecture,
as entity mappings vary by target (different architectures have different instruction sets
and register classes).

Embedding Generation
~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -222,12 +242,17 @@ Subcommand-specific options:

.. option:: <input-file>

The input LLVM IR or bitcode file to process. This positional argument is
required for the `triplets` subcommand.
The input LLVM IR/bitcode file (.ll/.bc) or Machine IR file (.mir) to process.
This positional argument is required for the `triplets` subcommand.

**entities** subcommand:

No subcommand-specific options.
.. option:: <input-file>

The input Machine IR file (.mir) to process. This positional argument is required
for the `entities` subcommand when using ``--mode=mir``, as the entity mappings
are target-specific. For ``--mode=llvm``, no input file is required as IR2Vec
entity mappings are target-independent.

OUTPUT FORMAT
-------------
Expand All @@ -240,19 +265,37 @@ metadata headers. The format includes:

.. code-block:: text
MAX_RELATIONS=<max_relations_count>
MAX_RELATION=<max_relation_count>
<head_entity_id> <tail_entity_id> <relation_id>
<head_entity_id> <tail_entity_id> <relation_id>
...
Each line after the metadata header represents one instruction relationship,
with numeric IDs for head entity, relation, and tail entity. The metadata
header (MAX_RELATIONS) provides counts for post-processing and training setup.
with numeric IDs for head entity, tail entity, and relation type. The metadata
header (MAX_RELATION) indicates the maximum relation ID used.

**Relation Types:**

For LLVM IR (IR2Vec):
* **0** = Type relationship (instruction to its type)
* **1** = Next relationship (sequential instructions)
* **2+** = Argument relationships (Arg0, Arg1, Arg2, ...)

For Machine IR (MIR2Vec):
* **0** = Next relationship (sequential instructions)
* **1+** = Argument relationships (Arg0, Arg1, Arg2, ...)

**Entity IDs:**

For LLVM IR: Entity IDs represent opcodes, types, and operands as defined by the IR2Vec vocabulary.

For Machine IR: Entity IDs represent machine opcodes, common operands (immediate, frame index, etc.),
physical register classes, and virtual register classes as defined by the MIR2Vec vocabulary. The entity layout is target-specific.

Entity Mode Output
~~~~~~~~~~~~~~~~~~

In entity mode, the output consists of entity mapping in the format:
In entity mode, the output consists of entity mappings in the format:

.. code-block:: text
Expand All @@ -264,6 +307,13 @@ In entity mode, the output consists of entity mapping in the format:
The first line contains the total number of entities, followed by one entity
mapping per line with tab-separated entity string and numeric ID.

For LLVM IR, entities include instruction opcodes (e.g., "Add", "Ret"), types
(e.g., "INT", "PTR"), and operand kinds.

For Machine IR, entities include machine opcodes (e.g., "COPY", "ADD"),
common operands (e.g., "Immediate", "FrameIndex"), physical register classes
(e.g., "PhyReg_GR32"), and virtual register classes (e.g., "VirtReg_GR32").

Embedding Mode Output
~~~~~~~~~~~~~~~~~~~~~

Expand Down
38 changes: 38 additions & 0 deletions llvm/include/llvm/CodeGen/MIR2Vec.h
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,11 @@ class MIRVocabulary {
size_t TotalEntries = 0;
} Layout;

// ToDo: See if we can have only one reg classes section instead of physical
// and virtual separate sections in the vocabulary. This would reduce the
// number of vocabulary entities significantly.
// We can potentially distinguish physical and virtual registers by
// considering them as a separate feature.
enum class Section : unsigned {
Opcodes = 0,
CommonOperands = 1,
Expand Down Expand Up @@ -185,6 +190,25 @@ class MIRVocabulary {
return Storage[static_cast<unsigned>(SectionID)][LocalIndex];
}

/// Get entity ID (flat index) for a common operand type
/// This is used for triplet generation
unsigned getEntityIDForCommonOperand(
MachineOperand::MachineOperandType OperandType) const {
return Layout.CommonOperandBase + getCommonOperandIndex(OperandType);
}

/// Get entity ID (flat index) for a register
/// This is used for triplet generation
unsigned getEntityIDForRegister(Register Reg) const {
if (!Reg.isValid() || Reg.isStack())
return Layout
.VirtRegBase; // Return VirtRegBase for invalid/stack registers
unsigned LocalIndex = getRegisterOperandIndex(Reg);
size_t BaseOffset =
Reg.isPhysical() ? Layout.PhyRegBase : Layout.VirtRegBase;
return BaseOffset + LocalIndex;
}

public:
/// Static method for extracting base opcode names (public for testing)
static std::string extractBaseOpcodeName(StringRef InstrName);
Expand All @@ -201,6 +225,20 @@ class MIRVocabulary {

unsigned getDimension() const { return Storage.getDimension(); }

/// Get entity ID (flat index) for an opcode
/// This is used for triplet generation
unsigned getEntityIDForOpcode(unsigned Opcode) const {
return Layout.OpcodeBase + getCanonicalOpcodeIndex(Opcode);
}

/// Get entity ID (flat index) for a machine operand
/// This is used for triplet generation
unsigned getEntityIDForMachineOperand(const MachineOperand &MO) const {
if (MO.getType() == MachineOperand::MO_Register)
return getEntityIDForRegister(MO.getReg());
return getEntityIDForCommonOperand(MO.getType());
}

// Accessor methods
const Embedding &operator[](unsigned Opcode) const {
unsigned LocalIndex = getCanonicalOpcodeIndex(Opcode);
Expand Down
28 changes: 28 additions & 0 deletions llvm/test/tools/llvm-ir2vec/entities.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# REQUIRES: x86_64-linux
# RUN: llvm-ir2vec entities --mode=mir %s -o 2>&1 %t1.log
# RUN: diff %S/output/reference_x86_entities.txt %t1.log

--- |
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local noundef i32 @test_function(i32 noundef %a) {
entry:
ret i32 %a
}
...
---
name: test_function
alignment: 16
tracksRegLiveness: true
registers:
- { id: 0, class: gr32 }
liveins:
- { reg: '$edi', virtual-reg: '%0' }
body: |
bb.0.entry:
liveins: $edi
%0:gr32 = COPY $edi
$eax = COPY %0
RET 0, $eax
3 changes: 3 additions & 0 deletions llvm/test/tools/llvm-ir2vec/output/lit.local.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Don't treat files in this directory as tests
# These are reference data files, not test scripts
config.suffixes = []
33 changes: 33 additions & 0 deletions llvm/test/tools/llvm-ir2vec/output/reference_triplets.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
MAX_RELATION=4
187 7072 1
187 6968 2
187 187 0
187 7072 1
187 6969 2
187 10 0
10 7072 1
10 7072 2
10 7072 3
10 6961 4
10 187 0
187 6952 1
187 7072 2
187 1555 0
1555 6882 1
1555 6952 2
187 7072 1
187 6968 2
187 187 0
187 7072 1
187 6969 2
187 601 0
601 7072 1
601 7072 2
601 7072 3
601 6961 4
601 187 0
187 6952 1
187 7072 2
187 1555 0
1555 6882 1
1555 6952 2
Loading
Loading