Merge branch 'main' into sg_distr_minor_fixes

charithaintc · charithaintc · commit b52a0383ad6d · 2025-07-17T19:13:48.000Z
diff --git a/llvm/docs/CommandGuide/index.rst b/llvm/docs/CommandGuide/index.rst
@@ -27,6 +27,7 @@ Basic Commands
    llvm-dis
    llvm-dwarfdump
    llvm-dwarfutil
+   llvm-ir2vec
    llvm-lib
    llvm-libtool-darwin
    llvm-link
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -0,0 +1,170 @@
+llvm-ir2vec - IR2Vec Embedding Generation Tool
+==============================================
+
+.. program:: llvm-ir2vec
+
+SYNOPSIS
+--------
+
+:program:`llvm-ir2vec` [*options*] *input-file*
+
+DESCRIPTION
+-----------
+
+:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
+generates IR2Vec embeddings for LLVM IR and supports triplet generation 
+for vocabulary training. It provides two main operation modes:
+
+1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+   training from LLVM IR.
+
+2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+   at different granularity levels (instruction, basic block, or function).
+
+The tool is designed to facilitate machine learning applications that work with
+LLVM IR by converting the IR into numerical representations that can be used by
+ML models.
+
+.. note::
+
+   For information about using IR2Vec programmatically within LLVM passes and 
+   the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_ 
+   section in the MLGO documentation.
+
+OPERATION MODES
+---------------
+
+Triplet Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
+consisting of opcodes, types, and operands. These triplets can be used to train
+vocabularies for embedding generation.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+
+Embedding Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
+generate numerical embeddings for LLVM IR at different levels of granularity.
+
+Example Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
+
+OPTIONS
+-------
+
+.. option:: --mode=<mode>
+
+ Specify the operation mode. Valid values are:
+
+ * ``triplets`` - Generate triplets for vocabulary training
+ * ``embeddings`` - Generate embeddings using trained vocabulary (default)
+
+.. option:: --level=<level>
+
+ Specify the embedding generation level. Valid values are:
+
+ * ``inst`` - Generate instruction-level embeddings
+ * ``bb`` - Generate basic block-level embeddings  
+ * ``func`` - Generate function-level embeddings (default)
+
+.. option:: --function=<name>
+
+ Process only the specified function instead of all functions in the module.
+
+.. option:: --ir2vec-vocab-path=<path>
+
+ Specify the path to the vocabulary file (required for embedding mode).
+ The vocabulary file should be in JSON format and contain the trained
+ vocabulary for embedding generation. See `llvm/lib/Analysis/models`
+ for pre-trained vocabulary files.
+
+.. option:: --ir2vec-opc-weight=<weight>
+
+ Specify the weight for opcode embeddings (default: 1.0). This controls
+ the relative importance of instruction opcodes in the final embedding.
+
+.. option:: --ir2vec-type-weight=<weight>
+
+ Specify the weight for type embeddings (default: 0.5). This controls
+ the relative importance of type information in the final embedding.
+
+.. option:: --ir2vec-arg-weight=<weight>
+
+ Specify the weight for argument embeddings (default: 0.2). This controls
+ the relative importance of operand information in the final embedding.
+
+.. option:: -o <filename>
+
+ Specify the output filename. Use ``-`` to write to standard output (default).
+
+.. option:: --help
+
+ Print a summary of command line options.
+
+.. note::
+
+   ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, 
+   ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding 
+   mode. These options are ignored in triplet mode.
+
+INPUT FILE FORMAT
+-----------------
+
+:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files 
+(``.ll``) as input. The input file should contain valid LLVM IR.
+
+OUTPUT FORMAT
+-------------
+
+Triplet Mode Output
+~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, the output consists of lines containing space-separated triplets:
+
+.. code-block:: text
+
+   <opcode> <type> <operand1> <operand2> ...
+
+Each line represents the information of one instruction, with the opcode, type,
+and operands.
+
+Embedding Mode Output
+~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, the output format depends on the specified level:
+
+* **Function Level**: One embedding vector per function
+* **Basic Block Level**: One embedding vector per basic block, grouped by function
+* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
+
+Each embedding is represented as a floating point vector.
+
+EXIT STATUS
+-----------
+
+:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
+
+Common failure cases include:
+
+* Invalid or missing input file
+* Missing or invalid vocabulary file (in embedding mode)
+* Specified function not found in the module
+* Invalid command line options
+
+SEE ALSO
+--------
+
+:doc:`../MLGO`
+
+For more information about the IR2Vec algorithm and approach, see:
+`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
@@ -468,6 +468,13 @@ The core components are:
 Using IR2Vec
 ------------
 
+.. note::
+
+   This section describes how to use IR2Vec within LLVM passes. A standalone 
+   tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the
+   embeddings and triplets from LLVM IR files, which can be useful for
+   training vocabularies and generating embeddings outside of compiler passes.
+
 For generating embeddings, first the vocabulary should be obtained. Then, the 
 embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
 
@@ -524,6 +531,10 @@ Further Details
 For more detailed information about the IR2Vec algorithm, its parameters, and
 advanced usage, please refer to the original paper:
 `IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
+
+For information about using IR2Vec tool for generating embeddings and
+triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
+
 The LLVM source code for ``IR2Vec`` can also be explored to understand the 
 implementation details.
 
@@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows:
 where the ``name`` is a path fragment. We will expect to find 2 files,
 ``<name>.in`` (readable, data incoming from the managing process) and
 ``<name>.out`` (writable, the model runner sends data to the managing process)
-
diff --git a/llvm/test/CMakeLists.txt b/llvm/test/CMakeLists.txt
@@ -97,6 +97,7 @@ set(LLVM_TEST_DEPENDS
           llvm-exegesis
           llvm-extract
           llvm-gsymutil
+          llvm-ir2vec
           llvm-isel-fuzzer
           llvm-ifs
           llvm-install-name-tool
diff --git a/llvm/test/lit.cfg.py b/llvm/test/lit.cfg.py
@@ -93,6 +93,13 @@ def get_asan_rtlib():
 config.substitutions.append(("%exeext", config.llvm_exe_ext))
 config.substitutions.append(("%llvm_src_root", config.llvm_src_root))
 
+# Add IR2Vec test vocabulary path substitution
+config.substitutions.append(
+    (
+        "%ir2vec_test_vocab_dir",
+        os.path.join(config.test_source_root, "Analysis", "IR2Vec", "Inputs"),
+    )
+)
 
 lli_args = []
 # The target triple used by default by lli is the process target triple (some
@@ -197,6 +204,7 @@ def get_asan_rtlib():
         "llvm-dlltool",
         "llvm-exegesis",
         "llvm-extract",
+        "llvm-ir2vec",
         "llvm-isel-fuzzer",
         "llvm-ifs",
         "llvm-install-name-tool",
diff --git a/llvm/test/tools/llvm-ir2vec/embeddings.ll b/llvm/test/tools/llvm-ir2vec/embeddings.ll
@@ -0,0 +1,73 @@
+; RUN: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
+; RUN: llvm-ir2vec --mode=embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
+; RUN: llvm-ir2vec --mode=embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC
+; RUN: not llvm-ir2vec --mode=embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF
+; RUN: llvm-ir2vec --mode=embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
+; RUN: llvm-ir2vec --mode=embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT
+; RUN: llvm-ir2vec --mode=embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT
+
+define dso_local noundef float @abc(i32 noundef %a, float noundef %b) #0 {
+entry:
+  %a.addr = alloca i32, align 4
+  %b.addr = alloca float, align 4
+  store i32 %a, ptr %a.addr, align 4
+  store float %b, ptr %b.addr, align 4
+  %0 = load i32, ptr %a.addr, align 4
+  %1 = load i32, ptr %a.addr, align 4
+  %mul = mul nsw i32 %0, %1
+  %conv = sitofp i32 %mul to float
+  %2 = load float, ptr %b.addr, align 4
+  %add = fadd float %conv, %2
+  ret float %add
+}
+
+define dso_local noundef float @abc_repeat(i32 noundef %a, float noundef %b) #0 {
+entry:
+  %a.addr = alloca i32, align 4
+  %b.addr = alloca float, align 4
+  store i32 %a, ptr %a.addr, align 4
+  store float %b, ptr %b.addr, align 4
+  %0 = load i32, ptr %a.addr, align 4
+  %1 = load i32, ptr %a.addr, align 4
+  %mul = mul nsw i32 %0, %1
+  %conv = sitofp i32 %mul to float
+  %2 = load float, ptr %b.addr, align 4
+  %add = fadd float %conv, %2
+  ret float %add
+}
+
+; CHECK-DEFAULT: Function: abc
+; CHECK-DEFAULT-NEXT: [ 878.00  889.00  900.00 ]
+; CHECK-DEFAULT-NEXT: Function: abc_repeat
+; CHECK-DEFAULT-NEXT: [ 878.00  889.00  900.00 ]
+
+; CHECK-FUNC-LEVEL: Function: abc 
+; CHECK-FUNC-LEVEL-NEXT: [ 878.00  889.00  900.00 ]
+; CHECK-FUNC-LEVEL-NEXT: Function: abc_repeat 
+; CHECK-FUNC-LEVEL-NEXT: [ 878.00  889.00  900.00 ]
+
+; CHECK-FUNC-LEVEL-ABC: Function: abc
+; CHECK-FUNC-LEVEL-NEXT-ABC:  [ 878.00  889.00  900.00 ]
+
+; CHECK-FUNC-DEF: Error: Function 'def' not found
+
+; CHECK-BB-LEVEL: Function: abc
+; CHECK-BB-LEVEL-NEXT: entry: [ 878.00  889.00  900.00 ]
+; CHECK-BB-LEVEL-NEXT: Function: abc_repeat
+; CHECK-BB-LEVEL-NEXT: entry: [ 878.00  889.00  900.00 ]
+
+; CHECK-BB-LEVEL-ABC-REPEAT: Function: abc_repeat
+; CHECK-BB-LEVEL-ABC-REPEAT-NEXT: entry: [ 878.00  889.00  900.00 ]
+
+; CHECK-INST-LEVEL-ABC-REPEAT: Function: abc_repeat
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %a.addr = alloca i32, align 4 [ 91.00  92.00  93.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %b.addr = alloca float, align 4 [ 91.00  92.00  93.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: store i32 %a, ptr %a.addr, align 4 [ 97.00  98.00  99.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: store float %b, ptr %b.addr, align 4 [ 97.00  98.00  99.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %0 = load i32, ptr %a.addr, align 4 [ 94.00  95.00  96.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %1 = load i32, ptr %a.addr, align 4 [ 94.00  95.00  96.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %mul = mul nsw i32 %0, %1 [ 49.00  50.00  51.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %conv = sitofp i32 %mul to float [ 130.00  131.00  132.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %2 = load float, ptr %b.addr, align 4 [ 94.00  95.00  96.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: %add = fadd float %conv, %2 [ 40.00  41.00  42.00 ]
+; CHECK-INST-LEVEL-ABC-REPEAT-NEXT: ret float %add [ 1.00  2.00  3.00 ]
diff --git a/llvm/test/tools/llvm-ir2vec/triplets.ll b/llvm/test/tools/llvm-ir2vec/triplets.ll
@@ -0,0 +1,38 @@
+; RUN: llvm-ir2vec --mode=triplets %s | FileCheck %s -check-prefix=TRIPLETS
+
+define i32 @simple_add(i32 %a, i32 %b) {
+entry:
+  %add = add i32 %a, %b
+  ret i32 %add
+}
+
+define i32 @simple_mul(i32 %x, i32 %y) {
+entry:
+  %mul = mul i32 %x, %y
+  ret i32 %mul
+}
+
+define i32 @test_function(i32 %arg1, i32 %arg2) {
+entry:
+  %local1 = alloca i32, align 4
+  %local2 = alloca i32, align 4
+  store i32 %arg1, ptr %local1, align 4
+  store i32 %arg2, ptr %local2, align 4
+  %load1 = load i32, ptr %local1, align 4
+  %load2 = load i32, ptr %local2, align 4
+  %result = add i32 %load1, %load2
+  ret i32 %result
+}
+
+; TRIPLETS: Add IntegerTy Variable Variable
+; TRIPLETS-NEXT: Ret VoidTy Variable
+; TRIPLETS-NEXT: Mul IntegerTy Variable Variable
+; TRIPLETS-NEXT: Ret VoidTy Variable
+; TRIPLETS-NEXT: Alloca PointerTy Constant
+; TRIPLETS-NEXT: Alloca PointerTy Constant
+; TRIPLETS-NEXT: Store VoidTy Variable Pointer
+; TRIPLETS-NEXT: Store VoidTy Variable Pointer
+; TRIPLETS-NEXT: Load IntegerTy Pointer
+; TRIPLETS-NEXT: Load IntegerTy Pointer
+; TRIPLETS-NEXT: Add IntegerTy Variable Variable
+; TRIPLETS-NEXT: Ret VoidTy Variable
diff --git a/llvm/tools/llvm-ir2vec/CMakeLists.txt b/llvm/tools/llvm-ir2vec/CMakeLists.txt
@@ -0,0 +1,10 @@
+set(LLVM_LINK_COMPONENTS
+  Analysis
+  Core
+  IRReader
+  Support
+  )
+
+add_llvm_tool(llvm-ir2vec
+  llvm-ir2vec.cpp
+  )
diff --git a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp