Skip to content

Commit 21f1f95

Browse files
authored
[IR2Vec][llvm-ir2vec] Changing clEnumValN to cl::SubCommand (#151384)
Refactor llvm-ir2vec to use subcommands instead of a mode flag for better CLI usability. - Converted the `--mode` flag to three distinct subcommands: `triplets`, `entities`, and `embeddings` - Updated documentation, tests, and python script
1 parent 5f83387 commit 21f1f95

File tree

7 files changed

+103
-117
lines changed

7 files changed

+103
-117
lines changed

llvm/docs/CommandGuide/llvm-ir2vec.rst

Lines changed: 62 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -6,27 +6,27 @@ llvm-ir2vec - IR2Vec Embedding Generation Tool
66
SYNOPSIS
77
--------
88

9-
:program:`llvm-ir2vec` [*options*] *input-file*
9+
:program:`llvm-ir2vec` [*subcommand*] [*options*]
1010

1111
DESCRIPTION
1212
-----------
1313

1414
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
1515
generates IR2Vec embeddings for LLVM IR and supports triplet generation
16-
for vocabulary training. It provides three main operation modes:
16+
for vocabulary training. The tool provides three main subcommands:
1717

18-
1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary
18+
1. **triplets**: Generates numeric triplets in train2id format for vocabulary
1919
training from LLVM IR.
2020

21-
2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary
21+
2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary
2222
training.
2323

24-
3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
24+
3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary
2525
at different granularity levels (instruction, basic block, or function).
2626

2727
The tool is designed to facilitate machine learning applications that work with
2828
LLVM IR by converting the IR into numerical representations that can be used by
29-
ML models. The triplet mode generates numeric IDs directly instead of string
29+
ML models. The `triplets` subcommand generates numeric IDs directly instead of string
3030
triplets, streamlining the training data preparation workflow.
3131

3232
.. note::
@@ -53,111 +53,115 @@ for details).
5353
See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how
5454
these two modes are used to generate the triplets and entity mappings.
5555

56-
Triplet Generation Mode
57-
~~~~~~~~~~~~~~~~~~~~~~~
56+
Triplet Generation
57+
~~~~~~~~~~~~~~~~~~
5858

59-
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
60-
triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
61-
are generated in the standard format used for knowledge graph embedding training.
62-
The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
59+
With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts
60+
numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
61+
are generated in the standard format used for knowledge graph embedding training.
62+
The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
6363
infrastructure, eliminating the need for string-to-ID preprocessing.
6464

6565
Usage:
6666

6767
.. code-block:: bash
6868
69-
llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
69+
llvm-ir2vec triplets input.bc -o triplets_train2id.txt
7070
71-
Entity Mapping Generation Mode
72-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
71+
Entity Mapping Generation
72+
~~~~~~~~~~~~~~~~~~~~~~~~~
7373

74-
In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by
75-
IR2Vec in the standard format used for knowledge graph embedding training. This
76-
mode outputs all supported entities (opcodes, types, and operands) with their
77-
corresponding numeric IDs, and is not specific for an LLVM IR file.
74+
With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings
75+
supported by IR2Vec in the standard format used for knowledge graph embedding
76+
training. This subcommand outputs all supported entities (opcodes, types, and
77+
operands) with their corresponding numeric IDs, and is not specific for an
78+
LLVM IR file.
7879

7980
Usage:
8081

8182
.. code-block:: bash
8283
83-
llvm-ir2vec --mode=entities -o entity2id.txt
84+
llvm-ir2vec entities -o entity2id.txt
8485
85-
Embedding Generation Mode
86-
~~~~~~~~~~~~~~~~~~~~~~~~~~
86+
Embedding Generation
87+
~~~~~~~~~~~~~~~~~~~~
8788

88-
In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
89+
With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
8990
generate numerical embeddings for LLVM IR at different levels of granularity.
9091

9192
Example Usage:
9293

9394
.. code-block:: bash
9495
95-
llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
96+
llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
9697
9798
OPTIONS
9899
-------
99100

100-
.. option:: --mode=<mode>
101+
Global options:
102+
103+
.. option:: -o <filename>
104+
105+
Specify the output filename. Use ``-`` to write to standard output (default).
106+
107+
.. option:: --help
108+
109+
Print a summary of command line options.
101110

102-
Specify the operation mode. Valid values are:
111+
Subcommand-specific options:
103112

104-
* ``triplets`` - Generate triplets for vocabulary training
105-
* ``entities`` - Generate entity mappings for vocabulary training
106-
* ``embeddings`` - Generate embeddings using trained vocabulary (default)
113+
**embeddings** subcommand:
114+
115+
.. option:: <input-file>
116+
117+
The input LLVM IR or bitcode file to process. This positional argument is
118+
required for the `embeddings` subcommand.
107119

108120
.. option:: --level=<level>
109121

110-
Specify the embedding generation level. Valid values are:
122+
Specify the embedding generation level. Valid values are:
111123

112-
* ``inst`` - Generate instruction-level embeddings
113-
* ``bb`` - Generate basic block-level embeddings
114-
* ``func`` - Generate function-level embeddings (default)
124+
* ``inst`` - Generate instruction-level embeddings
125+
* ``bb`` - Generate basic block-level embeddings
126+
* ``func`` - Generate function-level embeddings (default)
115127

116128
.. option:: --function=<name>
117129

118-
Process only the specified function instead of all functions in the module.
130+
Process only the specified function instead of all functions in the module.
119131

120132
.. option:: --ir2vec-vocab-path=<path>
121133

122-
Specify the path to the vocabulary file (required for embedding mode).
123-
The vocabulary file should be in JSON format and contain the trained
124-
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
125-
for pre-trained vocabulary files.
134+
Specify the path to the vocabulary file (required for embedding generation).
135+
The vocabulary file should be in JSON format and contain the trained
136+
vocabulary for embedding generation. See `llvm/lib/Analysis/models`
137+
for pre-trained vocabulary files.
126138

127139
.. option:: --ir2vec-opc-weight=<weight>
128140

129-
Specify the weight for opcode embeddings (default: 1.0). This controls
130-
the relative importance of instruction opcodes in the final embedding.
141+
Specify the weight for opcode embeddings (default: 1.0). This controls
142+
the relative importance of instruction opcodes in the final embedding.
131143

132144
.. option:: --ir2vec-type-weight=<weight>
133145

134-
Specify the weight for type embeddings (default: 0.5). This controls
135-
the relative importance of type information in the final embedding.
146+
Specify the weight for type embeddings (default: 0.5). This controls
147+
the relative importance of type information in the final embedding.
136148

137149
.. option:: --ir2vec-arg-weight=<weight>
138150

139-
Specify the weight for argument embeddings (default: 0.2). This controls
140-
the relative importance of operand information in the final embedding.
151+
Specify the weight for argument embeddings (default: 0.2). This controls
152+
the relative importance of operand information in the final embedding.
141153

142-
.. option:: -o <filename>
143154

144-
Specify the output filename. Use ``-`` to write to standard output (default).
155+
**triplets** subcommand:
145156

146-
.. option:: --help
147-
148-
Print a summary of command line options.
149-
150-
.. note::
157+
.. option:: <input-file>
151158

152-
``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
153-
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
154-
mode. These options are ignored in triplet and entity modes.
159+
The input LLVM IR or bitcode file to process. This positional argument is
160+
required for the `triplets` subcommand.
155161

156-
INPUT FILE FORMAT
157-
-----------------
162+
**entities** subcommand:
158163

159-
:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files
160-
(``.ll``) as input. The input file should contain valid LLVM IR.
164+
No subcommand-specific options.
161165

162166
OUTPUT FORMAT
163167
-------------

llvm/test/tools/llvm-ir2vec/embeddings.ll

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
; RUN: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
2-
; RUN: llvm-ir2vec --mode=embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
3-
; RUN: llvm-ir2vec --mode=embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC
4-
; RUN: not llvm-ir2vec --mode=embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF
5-
; RUN: llvm-ir2vec --mode=embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
6-
; RUN: llvm-ir2vec --mode=embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT
7-
; RUN: llvm-ir2vec --mode=embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT
1+
; RUN: llvm-ir2vec embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT
2+
; RUN: llvm-ir2vec embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL
3+
; RUN: llvm-ir2vec embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC
4+
; RUN: not llvm-ir2vec embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF
5+
; RUN: llvm-ir2vec embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL
6+
; RUN: llvm-ir2vec embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT
7+
; RUN: llvm-ir2vec embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT
88

99
define dso_local noundef float @abc(i32 noundef %a, float noundef %b) #0 {
1010
entry:

llvm/test/tools/llvm-ir2vec/entities.ll

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
; RUN: llvm-ir2vec --mode=entities | FileCheck %s
1+
; RUN: llvm-ir2vec entities | FileCheck %s
22

33
CHECK: 92
44
CHECK-NEXT: Ret 0
Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,7 @@
11
; Test error handling and input validation for llvm-ir2vec tool
22

3-
; RUN: not llvm-ir2vec --mode=embeddings %s 2>&1 | FileCheck %s -check-prefix=CHECK-NO-VOCAB
4-
5-
; RUN: not llvm-ir2vec --mode=embeddings --function=nonexistent --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-NOT-FOUND
6-
7-
; RUN: llvm-ir2vec --mode=triplets --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json --level=inst %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-LEVEL
8-
; RUN: llvm-ir2vec --mode=entities --level=inst %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-LEVEL
9-
10-
; RUN: llvm-ir2vec --mode=triplets --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json --function=dummy %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-FUNC
11-
; RUN: llvm-ir2vec --mode=entities --function=dummy %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-FUNC
3+
; RUN: not llvm-ir2vec embeddings %s 2>&1 | FileCheck %s -check-prefix=CHECK-NO-VOCAB
4+
; RUN: not llvm-ir2vec embeddings --function=nonexistent --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-NOT-FOUND
125

136
; Simple test function for valid IR
147
define i32 @test_func(i32 %a) {
@@ -18,5 +11,3 @@ entry:
1811

1912
; CHECK-NO-VOCAB: error: IR2Vec vocabulary file path not specified; You may need to set it using --ir2vec-vocab-path
2013
; CHECK-FUNC-NOT-FOUND: Error: Function 'nonexistent' not found
21-
; CHECK-UNUSED-LEVEL: Warning: --level option is ignored
22-
; CHECK-UNUSED-FUNC: Warning: --function option is ignored

llvm/test/tools/llvm-ir2vec/triplets.ll

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
; RUN: llvm-ir2vec --mode=triplets %s | FileCheck %s -check-prefix=TRIPLETS
1+
; RUN: llvm-ir2vec triplets %s | FileCheck %s -check-prefix=TRIPLETS
22

33
define i32 @simple_add(i32 %a, i32 %b) {
44
entry:

llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp

Lines changed: 28 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,22 @@
99
/// \file
1010
/// This file implements the IR2Vec embedding generation tool.
1111
///
12-
/// This tool provides three main modes:
12+
/// This tool provides three main subcommands:
1313
///
14-
/// 1. Triplet Generation Mode (--mode=triplets):
14+
/// 1. Triplet Generation (triplets):
1515
/// Generates numeric triplets (head, tail, relation) for vocabulary
1616
/// training. Output format: MAX_RELATION=N header followed by
1717
/// head\ttail\trelation lines. Relations: 0=Type, 1=Next, 2+=Arg0,Arg1,...
18-
/// Usage: llvm-ir2vec --mode=triplets input.bc -o train2id.txt
18+
/// Usage: llvm-ir2vec triplets input.bc -o train2id.txt
1919
///
20-
/// 2. Entities Generation Mode (--mode=entities):
20+
/// 2. Entity Mappings (entities):
2121
/// Generates entity mappings for vocabulary training.
2222
/// Output format: <total_entities> header followed by entity\tid lines.
23-
/// Usage: llvm-ir2vec --mode=entities input.bc -o entity2id.txt
23+
/// Usage: llvm-ir2vec entities input.bc -o entity2id.txt
2424
///
25-
/// 3. Embedding Generation Mode (--mode=embeddings):
25+
/// 3. Embedding Generation (embeddings):
2626
/// Generates IR2Vec embeddings using a trained vocabulary.
27-
/// Usage: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json
27+
/// Usage: llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json
2828
/// --level=func input.bc -o embeddings.txt Levels: --level=inst
2929
/// (instructions), --level=bb (basic blocks), --level=func (functions)
3030
/// (See IR2Vec.cpp for more embedding generation options)
@@ -55,36 +55,33 @@ namespace ir2vec {
5555

5656
static cl::OptionCategory IR2VecToolCategory("IR2Vec Tool Options");
5757

58+
// Subcommands
59+
static cl::SubCommand
60+
TripletsSubCmd("triplets", "Generate triplets for vocabulary training");
61+
static cl::SubCommand
62+
EntitiesSubCmd("entities",
63+
"Generate entity mappings for vocabulary training");
64+
static cl::SubCommand
65+
EmbeddingsSubCmd("embeddings",
66+
"Generate embeddings using trained vocabulary");
67+
68+
// Common options
5869
static cl::opt<std::string>
5970
InputFilename(cl::Positional,
6071
cl::desc("<input bitcode file or '-' for stdin>"),
61-
cl::init("-"), cl::cat(IR2VecToolCategory));
72+
cl::init("-"), cl::sub(TripletsSubCmd),
73+
cl::sub(EmbeddingsSubCmd), cl::cat(IR2VecToolCategory));
6274

6375
static cl::opt<std::string> OutputFilename("o", cl::desc("Output filename"),
6476
cl::value_desc("filename"),
6577
cl::init("-"),
6678
cl::cat(IR2VecToolCategory));
6779

68-
enum ToolMode {
69-
TripletMode, // Generate triplets for vocabulary training
70-
EntityMode, // Generate entity mappings for vocabulary training
71-
EmbeddingMode // Generate embeddings using trained vocabulary
72-
};
73-
74-
static cl::opt<ToolMode> Mode(
75-
"mode", cl::desc("Tool operation mode:"),
76-
cl::values(clEnumValN(TripletMode, "triplets",
77-
"Generate triplets for vocabulary training"),
78-
clEnumValN(EntityMode, "entities",
79-
"Generate entity mappings for vocabulary training"),
80-
clEnumValN(EmbeddingMode, "embeddings",
81-
"Generate embeddings using trained vocabulary")),
82-
cl::init(EmbeddingMode), cl::cat(IR2VecToolCategory));
83-
80+
// Embedding-specific options
8481
static cl::opt<std::string>
8582
FunctionName("function", cl::desc("Process specific function only"),
8683
cl::value_desc("name"), cl::Optional, cl::init(""),
87-
cl::cat(IR2VecToolCategory));
84+
cl::sub(EmbeddingsSubCmd), cl::cat(IR2VecToolCategory));
8885

8986
enum EmbeddingLevel {
9087
InstructionLevel, // Generate instruction-level embeddings
@@ -93,14 +90,15 @@ enum EmbeddingLevel {
9390
};
9491

9592
static cl::opt<EmbeddingLevel>
96-
Level("level", cl::desc("Embedding generation level (for embedding mode):"),
93+
Level("level", cl::desc("Embedding generation level:"),
9794
cl::values(clEnumValN(InstructionLevel, "inst",
9895
"Generate instruction-level embeddings"),
9996
clEnumValN(BasicBlockLevel, "bb",
10097
"Generate basic block-level embeddings"),
10198
clEnumValN(FunctionLevel, "func",
10299
"Generate function-level embeddings")),
103-
cl::init(FunctionLevel), cl::cat(IR2VecToolCategory));
100+
cl::init(FunctionLevel), cl::sub(EmbeddingsSubCmd),
101+
cl::cat(IR2VecToolCategory));
104102

105103
namespace {
106104

@@ -291,7 +289,7 @@ class IR2VecTool {
291289
Error processModule(Module &M, raw_ostream &OS) {
292290
IR2VecTool Tool(M);
293291

294-
if (Mode == EmbeddingMode) {
292+
if (EmbeddingsSubCmd) {
295293
// Initialize vocabulary for embedding generation
296294
// Note: Requires --ir2vec-vocab-path option to be set
297295
auto VocabStatus = Tool.initializeVocabulary();
@@ -311,6 +309,7 @@ Error processModule(Module &M, raw_ostream &OS) {
311309
Tool.generateEmbeddings(OS);
312310
}
313311
} else {
312+
// Both triplets and entities use triplet generation
314313
Tool.generateTriplets(OS);
315314
}
316315
return Error::success();
@@ -334,22 +333,14 @@ int main(int argc, char **argv) {
334333
"See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more "
335334
"information.\n");
336335

337-
// Validate command line options
338-
if (Mode != EmbeddingMode) {
339-
if (Level.getNumOccurrences() > 0)
340-
errs() << "Warning: --level option is ignored\n";
341-
if (FunctionName.getNumOccurrences() > 0)
342-
errs() << "Warning: --function option is ignored\n";
343-
}
344-
345336
std::error_code EC;
346337
raw_fd_ostream OS(OutputFilename, EC);
347338
if (EC) {
348339
errs() << "Error opening output file: " << EC.message() << "\n";
349340
return 1;
350341
}
351342

352-
if (Mode == EntityMode) {
343+
if (EntitiesSubCmd) {
353344
// Just dump entity mappings without processing any IR
354345
IR2VecTool::generateEntityMappings(OS);
355346
return 0;

0 commit comments

Comments
 (0)