Skip to content

Commit b52a038

Browse files
committed
Merge branch 'main' into sg_distr_minor_fixes
2 parents 88715e1 + f295617 commit b52a038

File tree

9 files changed

+626
-1
lines changed

9 files changed

+626
-1
lines changed

llvm/docs/CommandGuide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ Basic Commands
2727
llvm-dis
2828
llvm-dwarfdump
2929
llvm-dwarfutil
30+
llvm-ir2vec
3031
llvm-lib
3132
llvm-libtool-darwin
3233
llvm-link
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
llvm-ir2vec - IR2Vec Embedding Generation Tool
2+
==============================================
3+
4+
.. program:: llvm-ir2vec
5+
6+
SYNOPSIS
7+
--------
8+
9+
:program:`llvm-ir2vec` [*options*] *input-file*
10+
11+
DESCRIPTION
12+
-----------
13+
14+
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
15+
generates IR2Vec embeddings for LLVM IR and supports triplet generation
16+
for vocabulary training. It provides two main operation modes:
17+
18+
1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
19+
training from LLVM IR.
20+
21+
2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
22+
at different granularity levels (instruction, basic block, or function).
23+
24+
The tool is designed to facilitate machine learning applications that work with
25+
LLVM IR by converting the IR into numerical representations that can be used by
26+
ML models.
27+
28+
.. note::
29+
30+
For information about using IR2Vec programmatically within LLVM passes and
31+
the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
32+
section in the MLGO documentation.
33+
34+
OPERATION MODES
35+
---------------
36+
37+
Triplet Generation Mode
38+
~~~~~~~~~~~~~~~~~~~~~~~
39+
40+
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
41+
consisting of opcodes, types, and operands. These triplets can be used to train
42+
vocabularies for embedding generation.
43+
44+
Usage:
45+
46+
.. code-block:: bash
47+
48+
llvm-ir2vec --mode=triplets input.bc -o triplets.txt
49+
50+
Embedding Generation Mode
51+
~~~~~~~~~~~~~~~~~~~~~~~~~~
52+
53+
In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
54+
generate numerical embeddings for LLVM IR at different levels of granularity.
55+
56+
Example Usage:
57+
58+
.. code-block:: bash
59+
60+
llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
61+
62+
OPTIONS
63+
-------
64+
65+
.. option:: --mode=<mode>
66+
67+
Specify the operation mode. Valid values are:
68+
69+
* ``triplets`` - Generate triplets for vocabulary training
70+
* ``embeddings`` - Generate embeddings using trained vocabulary (default)
71+
72+
.. option:: --level=<level>
73+
74+
Specify the embedding generation level. Valid values are:
75+
76+
* ``inst`` - Generate instruction-level embeddings
77+
* ``bb`` - Generate basic block-level embeddings
78+
* ``func`` - Generate function-level embeddings (default)
79+
80+
.. option:: --function=<name>
81+
82+
Process only the specified function instead of all functions in the module.
83+
84+
.. option:: --ir2vec-vocab-path=<path>
85+
86+
Specify the path to the vocabulary file (required for embedding mode).
87+
The vocabulary file should be in JSON format and contain the trained
88+
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
89+
for pre-trained vocabulary files.
90+
91+
.. option:: --ir2vec-opc-weight=<weight>
92+
93+
Specify the weight for opcode embeddings (default: 1.0). This controls
94+
the relative importance of instruction opcodes in the final embedding.
95+
96+
.. option:: --ir2vec-type-weight=<weight>
97+
98+
Specify the weight for type embeddings (default: 0.5). This controls
99+
the relative importance of type information in the final embedding.
100+
101+
.. option:: --ir2vec-arg-weight=<weight>
102+
103+
Specify the weight for argument embeddings (default: 0.2). This controls
104+
the relative importance of operand information in the final embedding.
105+
106+
.. option:: -o <filename>
107+
108+
Specify the output filename. Use ``-`` to write to standard output (default).
109+
110+
.. option:: --help
111+
112+
Print a summary of command line options.
113+
114+
.. note::
115+
116+
``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
117+
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
118+
mode. These options are ignored in triplet mode.
119+
120+
INPUT FILE FORMAT
121+
-----------------
122+
123+
:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files
124+
(``.ll``) as input. The input file should contain valid LLVM IR.
125+
126+
OUTPUT FORMAT
127+
-------------
128+
129+
Triplet Mode Output
130+
~~~~~~~~~~~~~~~~~~~
131+
132+
In triplet mode, the output consists of lines containing space-separated triplets:
133+
134+
.. code-block:: text
135+
136+
<opcode> <type> <operand1> <operand2> ...
137+
138+
Each line represents the information of one instruction, with the opcode, type,
139+
and operands.
140+
141+
Embedding Mode Output
142+
~~~~~~~~~~~~~~~~~~~~~
143+
144+
In embedding mode, the output format depends on the specified level:
145+
146+
* **Function Level**: One embedding vector per function
147+
* **Basic Block Level**: One embedding vector per basic block, grouped by function
148+
* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
149+
150+
Each embedding is represented as a floating point vector.
151+
152+
EXIT STATUS
153+
-----------
154+
155+
:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
156+
157+
Common failure cases include:
158+
159+
* Invalid or missing input file
160+
* Missing or invalid vocabulary file (in embedding mode)
161+
* Specified function not found in the module
162+
* Invalid command line options
163+
164+
SEE ALSO
165+
--------
166+
167+
:doc:`../MLGO`
168+
169+
For more information about the IR2Vec algorithm and approach, see:
170+
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.

llvm/docs/MLGO.rst

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -468,6 +468,13 @@ The core components are:
468468
Using IR2Vec
469469
------------
470470

471+
.. note::
472+
473+
This section describes how to use IR2Vec within LLVM passes. A standalone
474+
tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the
475+
embeddings and triplets from LLVM IR files, which can be useful for
476+
training vocabularies and generating embeddings outside of compiler passes.
477+
471478
For generating embeddings, first the vocabulary should be obtained. Then, the
472479
embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
473480

@@ -524,6 +531,10 @@ Further Details
524531
For more detailed information about the IR2Vec algorithm, its parameters, and
525532
advanced usage, please refer to the original paper:
526533
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
534+
535+
For information about using IR2Vec tool for generating embeddings and
536+
triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
537+
527538
The LLVM source code for ``IR2Vec`` can also be explored to understand the
528539
implementation details.
529540

@@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows:
595606
where the ``name`` is a path fragment. We will expect to find 2 files,
596607
``<name>.in`` (readable, data incoming from the managing process) and
597608
``<name>.out`` (writable, the model runner sends data to the managing process)
598-

llvm/test/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ set(LLVM_TEST_DEPENDS
9797
llvm-exegesis
9898
llvm-extract
9999
llvm-gsymutil
100+
llvm-ir2vec
100101
llvm-isel-fuzzer
101102
llvm-ifs
102103
llvm-install-name-tool

llvm/test/lit.cfg.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,13 @@ def get_asan_rtlib():
9393
config.substitutions.append(("%exeext", config.llvm_exe_ext))
9494
config.substitutions.append(("%llvm_src_root", config.llvm_src_root))
9595

96+
# Add IR2Vec test vocabulary path substitution
97+
config.substitutions.append(
98+
(
99+
"%ir2vec_test_vocab_dir",
100+
os.path.join(config.test_source_root, "Analysis", "IR2Vec", "Inputs"),
101+
)
102+
)
96103

97104
lli_args = []
98105
# The target triple used by default by lli is the process target triple (some
@@ -197,6 +204,7 @@ def get_asan_rtlib():
197204
"llvm-dlltool",
198205
"llvm-exegesis",
199206
"llvm-extract",
207+
"llvm-ir2vec",
200208
"llvm-isel-fuzzer",
201209
"llvm-ifs",
202210
"llvm-install-name-tool",
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
; RUN: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
2+
; RUN: llvm-ir2vec --mode=embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
3+
; RUN: llvm-ir2vec --mode=embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC
4+
; RUN: not llvm-ir2vec --mode=embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF
5+
; RUN: llvm-ir2vec --mode=embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
6+
; RUN: llvm-ir2vec --mode=embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT
7+
; RUN: llvm-ir2vec --mode=embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT
8+
9+
define dso_local noundef float @abc(i32 noundef %a, float noundef %b) #0 {
10+
entry:
11+
%a.addr = alloca i32, align 4
12+
%b.addr = alloca float, align 4
13+
store i32 %a, ptr %a.addr, align 4
14+
store float %b, ptr %b.addr, align 4
15+
%0 = load i32, ptr %a.addr, align 4
16+
%1 = load i32, ptr %a.addr, align 4
17+
%mul = mul nsw i32 %0, %1
18+
%conv = sitofp i32 %mul to float
19+
%2 = load float, ptr %b.addr, align 4
20+
%add = fadd float %conv, %2
21+
ret float %add
22+
}
23+
24+
define dso_local noundef float @abc_repeat(i32 noundef %a, float noundef %b) #0 {
25+
entry:
26+
%a.addr = alloca i32, align 4
27+
%b.addr = alloca float, align 4
28+
store i32 %a, ptr %a.addr, align 4
29+
store float %b, ptr %b.addr, align 4
30+
%0 = load i32, ptr %a.addr, align 4
31+
%1 = load i32, ptr %a.addr, align 4
32+
%mul = mul nsw i32 %0, %1
33+
%conv = sitofp i32 %mul to float
34+
%2 = load float, ptr %b.addr, align 4
35+
%add = fadd float %conv, %2
36+
ret float %add
37+
}
38+
39+
; CHECK-DEFAULT: Function: abc
40+
; CHECK-DEFAULT-NEXT: [ 878.00 889.00 900.00 ]
41+
; CHECK-DEFAULT-NEXT: Function: abc_repeat
42+
; CHECK-DEFAULT-NEXT: [ 878.00 889.00 900.00 ]
43+
44+
; CHECK-FUNC-LEVEL: Function: abc
45+
; CHECK-FUNC-LEVEL-NEXT: [ 878.00 889.00 900.00 ]
46+
; CHECK-FUNC-LEVEL-NEXT: Function: abc_repeat
47+
; CHECK-FUNC-LEVEL-NEXT: [ 878.00 889.00 900.00 ]
48+
49+
; CHECK-FUNC-LEVEL-ABC: Function: abc
50+
; CHECK-FUNC-LEVEL-NEXT-ABC: [ 878.00 889.00 900.00 ]
51+
52+
; CHECK-FUNC-DEF: Error: Function 'def' not found
53+
54+
; CHECK-BB-LEVEL: Function: abc
55+
; CHECK-BB-LEVEL-NEXT: entry: [ 878.00 889.00 900.00 ]
56+
; CHECK-BB-LEVEL-NEXT: Function: abc_repeat
57+
; CHECK-BB-LEVEL-NEXT: entry: [ 878.00 889.00 900.00 ]
58+
59+
; CHECK-BB-LEVEL-ABC-REPEAT: Function: abc_repeat
60+
; CHECK-BB-LEVEL-ABC-REPEAT-NEXT: entry: [ 878.00 889.00 900.00 ]
61+
62+
; CHECK-INST-LEVEL-ABC-REPEAT: Function: abc_repeat
63+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %a.addr = alloca i32, align 4 [ 91.00 92.00 93.00 ]
64+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %b.addr = alloca float, align 4 [ 91.00 92.00 93.00 ]
65+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: store i32 %a, ptr %a.addr, align 4 [ 97.00 98.00 99.00 ]
66+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: store float %b, ptr %b.addr, align 4 [ 97.00 98.00 99.00 ]
67+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %0 = load i32, ptr %a.addr, align 4 [ 94.00 95.00 96.00 ]
68+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %1 = load i32, ptr %a.addr, align 4 [ 94.00 95.00 96.00 ]
69+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %mul = mul nsw i32 %0, %1 [ 49.00 50.00 51.00 ]
70+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %conv = sitofp i32 %mul to float [ 130.00 131.00 132.00 ]
71+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %2 = load float, ptr %b.addr, align 4 [ 94.00 95.00 96.00 ]
72+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %add = fadd float %conv, %2 [ 40.00 41.00 42.00 ]
73+
; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: ret float %add [ 1.00 2.00 3.00 ]
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
; RUN: llvm-ir2vec --mode=triplets %s | FileCheck %s -check-prefix=TRIPLETS
2+
3+
define i32 @simple_add(i32 %a, i32 %b) {
4+
entry:
5+
%add = add i32 %a, %b
6+
ret i32 %add
7+
}
8+
9+
define i32 @simple_mul(i32 %x, i32 %y) {
10+
entry:
11+
%mul = mul i32 %x, %y
12+
ret i32 %mul
13+
}
14+
15+
define i32 @test_function(i32 %arg1, i32 %arg2) {
16+
entry:
17+
%local1 = alloca i32, align 4
18+
%local2 = alloca i32, align 4
19+
store i32 %arg1, ptr %local1, align 4
20+
store i32 %arg2, ptr %local2, align 4
21+
%load1 = load i32, ptr %local1, align 4
22+
%load2 = load i32, ptr %local2, align 4
23+
%result = add i32 %load1, %load2
24+
ret i32 %result
25+
}
26+
27+
; TRIPLETS: Add IntegerTy Variable Variable
28+
; TRIPLETS-NEXT: Ret VoidTy Variable
29+
; TRIPLETS-NEXT: Mul IntegerTy Variable Variable
30+
; TRIPLETS-NEXT: Ret VoidTy Variable
31+
; TRIPLETS-NEXT: Alloca PointerTy Constant
32+
; TRIPLETS-NEXT: Alloca PointerTy Constant
33+
; TRIPLETS-NEXT: Store VoidTy Variable Pointer
34+
; TRIPLETS-NEXT: Store VoidTy Variable Pointer
35+
; TRIPLETS-NEXT: Load IntegerTy Pointer
36+
; TRIPLETS-NEXT: Load IntegerTy Pointer
37+
; TRIPLETS-NEXT: Add IntegerTy Variable Variable
38+
; TRIPLETS-NEXT: Ret VoidTy Variable
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
set(LLVM_LINK_COMPONENTS
2+
Analysis
3+
Core
4+
IRReader
5+
Support
6+
)
7+
8+
add_llvm_tool(llvm-ir2vec
9+
llvm-ir2vec.cpp
10+
)

0 commit comments

Comments
 (0)