Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 62 additions & 58 deletions llvm/docs/CommandGuide/llvm-ir2vec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,27 @@ llvm-ir2vec - IR2Vec Embedding Generation Tool
SYNOPSIS
--------

:program:`llvm-ir2vec` [*options*] *input-file*
:program:`llvm-ir2vec` [*subcommand*] [*options*]

DESCRIPTION
-----------

:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
generates IR2Vec embeddings for LLVM IR and supports triplet generation
for vocabulary training. It provides three main operation modes:
for vocabulary training. The tool provides three main subcommands:

1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary
1. **triplets**: Generates numeric triplets in train2id format for vocabulary
training from LLVM IR.

2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
training.

3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).

The tool is designed to facilitate machine learning applications that work with
LLVM IR by converting the IR into numerical representations that can be used by
ML models. The triplet mode generates numeric IDs directly instead of string
ML models. The `triplets` subcommand generates numeric IDs directly instead of string
triplets, streamlining the training data preparation workflow.

.. note::
Expand All @@ -53,111 +53,115 @@ for details).
See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how
these two modes are used to generate the triplets and entity mappings.

Triplet Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~
Triplet Generation
~~~~~~~~~~~~~~~~~~

In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
are generated in the standard format used for knowledge graph embedding training.
The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts
numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
are generated in the standard format used for knowledge graph embedding training.
The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
infrastructure, eliminating the need for string-to-ID preprocessing.

Usage:

.. code-block:: bash

llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
llvm-ir2vec triplets input.bc -o triplets_train2id.txt

Entity Mapping Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Entity Mapping Generation
~~~~~~~~~~~~~~~~~~~~~~~~~

In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by
IR2Vec in the standard format used for knowledge graph embedding training. This
mode outputs all supported entities (opcodes, types, and operands) with their
corresponding numeric IDs, and is not specific for an LLVM IR file.
With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
supported by IR2Vec in the standard format used for knowledge graph embedding
training. This subcommand outputs all supported entities (opcodes, types, and
operands) with their corresponding numeric IDs, and is not specific for an
LLVM IR file.

Usage:

.. code-block:: bash

llvm-ir2vec --mode=entities -o entity2id.txt
llvm-ir2vec entities -o entity2id.txt

Embedding Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~~~~
Embedding Generation
~~~~~~~~~~~~~~~~~~~~

In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
generate numerical embeddings for LLVM IR at different levels of granularity.

Example Usage:

.. code-block:: bash

llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt

OPTIONS
-------

.. option:: --mode=<mode>
Global options:

.. option:: -o <filename>

Specify the output filename. Use ``-`` to write to standard output (default).

.. option:: --help

Print a summary of command line options.

Specify the operation mode. Valid values are:
Subcommand-specific options:

* ``triplets`` - Generate triplets for vocabulary training
* ``entities`` - Generate entity mappings for vocabulary training
* ``embeddings`` - Generate embeddings using trained vocabulary (default)
**embeddings** subcommand:

.. option:: <input-file>

The input LLVM IR or bitcode file to process. This positional argument is
required for the `embeddings` subcommand.

.. option:: --level=<level>

Specify the embedding generation level. Valid values are:
Specify the embedding generation level. Valid values are:

* ``inst`` - Generate instruction-level embeddings
* ``bb`` - Generate basic block-level embeddings
* ``func`` - Generate function-level embeddings (default)
* ``inst`` - Generate instruction-level embeddings
* ``bb`` - Generate basic block-level embeddings
* ``func`` - Generate function-level embeddings (default)

.. option:: --function=<name>

Process only the specified function instead of all functions in the module.
Process only the specified function instead of all functions in the module.

.. option:: --ir2vec-vocab-path=<path>

Specify the path to the vocabulary file (required for embedding mode).
The vocabulary file should be in JSON format and contain the trained
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
for pre-trained vocabulary files.
Specify the path to the vocabulary file (required for embedding generation).
The vocabulary file should be in JSON format and contain the trained
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
for pre-trained vocabulary files.

.. option:: --ir2vec-opc-weight=<weight>

Specify the weight for opcode embeddings (default: 1.0). This controls
the relative importance of instruction opcodes in the final embedding.
Specify the weight for opcode embeddings (default: 1.0). This controls
the relative importance of instruction opcodes in the final embedding.

.. option:: --ir2vec-type-weight=<weight>

Specify the weight for type embeddings (default: 0.5). This controls
the relative importance of type information in the final embedding.
Specify the weight for type embeddings (default: 0.5). This controls
the relative importance of type information in the final embedding.

.. option:: --ir2vec-arg-weight=<weight>

Specify the weight for argument embeddings (default: 0.2). This controls
the relative importance of operand information in the final embedding.
Specify the weight for argument embeddings (default: 0.2). This controls
the relative importance of operand information in the final embedding.

.. option:: -o <filename>

Specify the output filename. Use ``-`` to write to standard output (default).
**triplets** subcommand:

.. option:: --help

Print a summary of command line options.

.. note::
.. option:: <input-file>

``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
mode. These options are ignored in triplet and entity modes.
The input LLVM IR or bitcode file to process. This positional argument is
required for the `triplets` subcommand.

INPUT FILE FORMAT
-----------------
**entities** subcommand:

:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files
(``.ll``) as input. The input file should contain valid LLVM IR.
No subcommand-specific options.

OUTPUT FORMAT
-------------
Expand Down
14 changes: 7 additions & 7 deletions llvm/test/tools/llvm-ir2vec/embeddings.ll
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
; RUN: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
; RUN: llvm-ir2vec --mode=embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
; RUN: llvm-ir2vec --mode=embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC
; RUN: not llvm-ir2vec --mode=embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF
; RUN: llvm-ir2vec --mode=embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
; RUN: llvm-ir2vec --mode=embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT
; RUN: llvm-ir2vec --mode=embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT
; RUN: llvm-ir2vec embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
; RUN: llvm-ir2vec embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
; RUN: llvm-ir2vec embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC
; RUN: not llvm-ir2vec embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF
; RUN: llvm-ir2vec embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
; RUN: llvm-ir2vec embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT
; RUN: llvm-ir2vec embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT

define dso_local noundef float @abc(i32 noundef %a, float noundef %b) #0 {
entry:
Expand Down
2 changes: 1 addition & 1 deletion llvm/test/tools/llvm-ir2vec/entities.ll
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
; RUN: llvm-ir2vec --mode=entities | FileCheck %s
; RUN: llvm-ir2vec entities | FileCheck %s

CHECK: 92
CHECK-NEXT: Ret 0
Expand Down
13 changes: 2 additions & 11 deletions llvm/test/tools/llvm-ir2vec/error-handling.ll
Original file line number Diff line number Diff line change
@@ -1,14 +1,7 @@
; Test error handling and input validation for llvm-ir2vec tool

; RUN: not llvm-ir2vec --mode=embeddings %s 2>&1 | FileCheck %s -check-prefix=CHECK-NO-VOCAB

; RUN: not llvm-ir2vec --mode=embeddings --function=nonexistent --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-NOT-FOUND

; RUN: llvm-ir2vec --mode=triplets --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json --level=inst %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-LEVEL
; RUN: llvm-ir2vec --mode=entities --level=inst %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-LEVEL

; RUN: llvm-ir2vec --mode=triplets --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json --function=dummy %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-FUNC
; RUN: llvm-ir2vec --mode=entities --function=dummy %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-FUNC
; RUN: not llvm-ir2vec embeddings %s 2>&1 | FileCheck %s -check-prefix=CHECK-NO-VOCAB
; RUN: not llvm-ir2vec embeddings --function=nonexistent --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-NOT-FOUND

; Simple test function for valid IR
define i32 @test_func(i32 %a) {
Expand All @@ -18,5 +11,3 @@ entry:

; CHECK-NO-VOCAB: error: IR2Vec vocabulary file path not specified; You may need to set it using --ir2vec-vocab-path
; CHECK-FUNC-NOT-FOUND: Error: Function 'nonexistent' not found
; CHECK-UNUSED-LEVEL: Warning: --level option is ignored
; CHECK-UNUSED-FUNC: Warning: --function option is ignored
2 changes: 1 addition & 1 deletion llvm/test/tools/llvm-ir2vec/triplets.ll
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
; RUN: llvm-ir2vec --mode=triplets %s | FileCheck %s -check-prefix=TRIPLETS
; RUN: llvm-ir2vec triplets %s | FileCheck %s -check-prefix=TRIPLETS

define i32 @simple_add(i32 %a, i32 %b) {
entry:
Expand Down
65 changes: 28 additions & 37 deletions llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,22 @@
/// \file
/// This file implements the IR2Vec embedding generation tool.
///
/// This tool provides three main modes:
/// This tool provides three main subcommands:
///
/// 1. Triplet Generation Mode (--mode=triplets):
/// 1. Triplet Generation (triplets):
/// Generates numeric triplets (head, tail, relation) for vocabulary
/// training. Output format: MAX_RELATION=N header followed by
/// head\ttail\trelation lines. Relations: 0=Type, 1=Next, 2+=Arg0,Arg1,...
/// Usage: llvm-ir2vec --mode=triplets input.bc -o train2id.txt
/// Usage: llvm-ir2vec triplets input.bc -o train2id.txt
///
/// 2. Entities Generation Mode (--mode=entities):
/// 2. Entity Mappings (entities):
/// Generates entity mappings for vocabulary training.
/// Output format: <total_entities> header followed by entity\tid lines.
/// Usage: llvm-ir2vec --mode=entities input.bc -o entity2id.txt
/// Usage: llvm-ir2vec entities input.bc -o entity2id.txt
///
/// 3. Embedding Generation Mode (--mode=embeddings):
/// 3. Embedding Generation (embeddings):
/// Generates IR2Vec embeddings using a trained vocabulary.
/// Usage: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json
/// Usage: llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json
/// --level=func input.bc -o embeddings.txt Levels: --level=inst
/// (instructions), --level=bb (basic blocks), --level=func (functions)
/// (See IR2Vec.cpp for more embedding generation options)
Expand Down Expand Up @@ -55,36 +55,33 @@ namespace ir2vec {

static cl::OptionCategory IR2VecToolCategory("IR2Vec Tool Options");

// Subcommands
static cl::SubCommand
TripletsSubCmd("triplets", "Generate triplets for vocabulary training");
static cl::SubCommand
EntitiesSubCmd("entities",
"Generate entity mappings for vocabulary training");
static cl::SubCommand
EmbeddingsSubCmd("embeddings",
"Generate embeddings using trained vocabulary");

// Common options
static cl::opt<std::string>
InputFilename(cl::Positional,
cl::desc("<input bitcode file or '-' for stdin>"),
cl::init("-"), cl::cat(IR2VecToolCategory));
cl::init("-"), cl::sub(TripletsSubCmd),
cl::sub(EmbeddingsSubCmd), cl::cat(IR2VecToolCategory));

static cl::opt<std::string> OutputFilename("o", cl::desc("Output filename"),
cl::value_desc("filename"),
cl::init("-"),
cl::cat(IR2VecToolCategory));

enum ToolMode {
TripletMode, // Generate triplets for vocabulary training
EntityMode, // Generate entity mappings for vocabulary training
EmbeddingMode // Generate embeddings using trained vocabulary
};

static cl::opt<ToolMode> Mode(
"mode", cl::desc("Tool operation mode:"),
cl::values(clEnumValN(TripletMode, "triplets",
"Generate triplets for vocabulary training"),
clEnumValN(EntityMode, "entities",
"Generate entity mappings for vocabulary training"),
clEnumValN(EmbeddingMode, "embeddings",
"Generate embeddings using trained vocabulary")),
cl::init(EmbeddingMode), cl::cat(IR2VecToolCategory));

// Embedding-specific options
static cl::opt<std::string>
FunctionName("function", cl::desc("Process specific function only"),
cl::value_desc("name"), cl::Optional, cl::init(""),
cl::cat(IR2VecToolCategory));
cl::sub(EmbeddingsSubCmd), cl::cat(IR2VecToolCategory));

enum EmbeddingLevel {
InstructionLevel, // Generate instruction-level embeddings
Expand All @@ -93,14 +90,15 @@ enum EmbeddingLevel {
};

static cl::opt<EmbeddingLevel>
Level("level", cl::desc("Embedding generation level (for embedding mode):"),
Level("level", cl::desc("Embedding generation level:"),
cl::values(clEnumValN(InstructionLevel, "inst",
"Generate instruction-level embeddings"),
clEnumValN(BasicBlockLevel, "bb",
"Generate basic block-level embeddings"),
clEnumValN(FunctionLevel, "func",
"Generate function-level embeddings")),
cl::init(FunctionLevel), cl::cat(IR2VecToolCategory));
cl::init(FunctionLevel), cl::sub(EmbeddingsSubCmd),
cl::cat(IR2VecToolCategory));

namespace {

Expand Down Expand Up @@ -291,7 +289,7 @@ class IR2VecTool {
Error processModule(Module &M, raw_ostream &OS) {
IR2VecTool Tool(M);

if (Mode == EmbeddingMode) {
if (EmbeddingsSubCmd) {
// Initialize vocabulary for embedding generation
// Note: Requires --ir2vec-vocab-path option to be set
auto VocabStatus = Tool.initializeVocabulary();
Expand All @@ -311,6 +309,7 @@ Error processModule(Module &M, raw_ostream &OS) {
Tool.generateEmbeddings(OS);
}
} else {
// Both triplets and entities use triplet generation
Tool.generateTriplets(OS);
}
return Error::success();
Expand All @@ -334,22 +333,14 @@ int main(int argc, char **argv) {
"See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more "
"information.\n");

// Validate command line options
if (Mode != EmbeddingMode) {
if (Level.getNumOccurrences() > 0)
errs() << "Warning: --level option is ignored\n";
if (FunctionName.getNumOccurrences() > 0)
errs() << "Warning: --function option is ignored\n";
}

std::error_code EC;
raw_fd_ostream OS(OutputFilename, EC);
if (EC) {
errs() << "Error opening output file: " << EC.message() << "\n";
return 1;
}

if (Mode == EntityMode) {
if (EntitiesSubCmd) {
// Just dump entity mappings without processing any IR
IR2VecTool::generateEntityMappings(OS);
return 0;
Expand Down
Loading