@@ -6,27 +6,27 @@ llvm-ir2vec - IR2Vec Embedding Generation Tool
66SYNOPSIS
77--------
88
9- :program: `llvm-ir2vec ` [*options *] * input-file *
9+ :program: `llvm-ir2vec ` [*subcommand *] [* options *]
1010
1111DESCRIPTION
1212-----------
1313
1414:program: `llvm-ir2vec ` is a standalone command-line tool for IR2Vec. It
1515generates IR2Vec embeddings for LLVM IR and supports triplet generation
16- for vocabulary training. It provides three main operation modes :
16+ for vocabulary training. The tool provides three main subcommands :
1717
18- 1. **Triplet Mode **: Generates numeric triplets in train2id format for vocabulary
18+ 1. **triplets **: Generates numeric triplets in train2id format for vocabulary
1919 training from LLVM IR.
2020
21- 2. **Entity Mode **: Generates entity mapping files (entity2id.txt) for vocabulary
21+ 2. **entities **: Generates entity mapping files (entity2id.txt) for vocabulary
2222 training.
2323
24- 3. **Embedding Mode **: Generates IR2Vec embeddings using a trained vocabulary
24+ 3. **embeddings **: Generates IR2Vec embeddings using a trained vocabulary
2525 at different granularity levels (instruction, basic block, or function).
2626
2727The tool is designed to facilitate machine learning applications that work with
2828LLVM IR by converting the IR into numerical representations that can be used by
29- ML models. The triplet mode generates numeric IDs directly instead of string
29+ ML models. The ` triplets ` subcommand generates numeric IDs directly instead of string
3030triplets, streamlining the training data preparation workflow.
3131
3232.. note ::
@@ -53,111 +53,115 @@ for details).
5353See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py ` for more details on how
5454these two modes are used to generate the triplets and entity mappings.
5555
56- Triplet Generation Mode
57- ~~~~~~~~~~~~~~~~~~~~~~~
56+ Triplet Generation
57+ ~~~~~~~~~~~~~~~~~~
5858
59- In triplet mode , :program: `llvm-ir2vec ` analyzes LLVM IR and extracts numeric
60- triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
61- are generated in the standard format used for knowledge graph embedding training.
62- The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
59+ With the ` triplets ` subcommand , :program: `llvm-ir2vec ` analyzes LLVM IR and extracts
60+ numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
61+ are generated in the standard format used for knowledge graph embedding training.
62+ The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping
6363infrastructure, eliminating the need for string-to-ID preprocessing.
6464
6565Usage:
6666
6767.. code-block :: bash
6868
69- llvm-ir2vec --mode= triplets input.bc -o triplets_train2id.txt
69+ llvm-ir2vec triplets input.bc -o triplets_train2id.txt
7070
71- Entity Mapping Generation Mode
72- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
71+ Entity Mapping Generation
72+ ~~~~~~~~~~~~~~~~~~~~~~~~~
7373
74- In entity mode, :program: `llvm-ir2vec ` generates the entity mappings supported by
75- IR2Vec in the standard format used for knowledge graph embedding training. This
76- mode outputs all supported entities (opcodes, types, and operands) with their
77- corresponding numeric IDs, and is not specific for an LLVM IR file.
74+ With the `entities ` subcommand, :program: `llvm-ir2vec ` generates the entity mappings
75+ supported by IR2Vec in the standard format used for knowledge graph embedding
76+ training. This subcommand outputs all supported entities (opcodes, types, and
77+ operands) with their corresponding numeric IDs, and is not specific for an
78+ LLVM IR file.
7879
7980Usage:
8081
8182.. code-block :: bash
8283
83- llvm-ir2vec --mode= entities -o entity2id.txt
84+ llvm-ir2vec entities -o entity2id.txt
8485
85- Embedding Generation Mode
86- ~~~~~~~~~~~~~~~~~~~~~~~~~~
86+ Embedding Generation
87+ ~~~~~~~~~~~~~~~~~~~~
8788
88- In embedding mode , :program: `llvm-ir2vec ` uses a pre-trained vocabulary to
89+ With the ` embeddings ` subcommand , :program: `llvm-ir2vec ` uses a pre-trained vocabulary to
8990generate numerical embeddings for LLVM IR at different levels of granularity.
9091
9192Example Usage:
9293
9394.. code-block :: bash
9495
95- llvm-ir2vec --mode= embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
96+ llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
9697
9798 OPTIONS
9899-------
99100
100- .. option :: --mode= <mode >
101+ Global options:
102+
103+ .. option :: -o <filename >
104+
105+ Specify the output filename. Use ``- `` to write to standard output (default).
106+
107+ .. option :: --help
108+
109+ Print a summary of command line options.
101110
102- Specify the operation mode. Valid values are :
111+ Subcommand-specific options :
103112
104- * ``triplets `` - Generate triplets for vocabulary training
105- * ``entities `` - Generate entity mappings for vocabulary training
106- * ``embeddings `` - Generate embeddings using trained vocabulary (default)
113+ **embeddings ** subcommand:
114+
115+ .. option :: <input-file >
116+
117+ The input LLVM IR or bitcode file to process. This positional argument is
118+ required for the `embeddings ` subcommand.
107119
108120.. option :: --level= <level >
109121
110- Specify the embedding generation level. Valid values are:
122+ Specify the embedding generation level. Valid values are:
111123
112- * ``inst `` - Generate instruction-level embeddings
113- * ``bb `` - Generate basic block-level embeddings
114- * ``func `` - Generate function-level embeddings (default)
124+ * ``inst `` - Generate instruction-level embeddings
125+ * ``bb `` - Generate basic block-level embeddings
126+ * ``func `` - Generate function-level embeddings (default)
115127
116128.. option :: --function= <name >
117129
118- Process only the specified function instead of all functions in the module.
130+ Process only the specified function instead of all functions in the module.
119131
120132.. option :: --ir2vec-vocab-path= <path >
121133
122- Specify the path to the vocabulary file (required for embedding mode ).
123- The vocabulary file should be in JSON format and contain the trained
124- vocabulary for embedding generation. See `llvm/lib/Analysis/models `
125- for pre-trained vocabulary files.
134+ Specify the path to the vocabulary file (required for embedding generation ).
135+ The vocabulary file should be in JSON format and contain the trained
136+ vocabulary for embedding generation. See `llvm/lib/Analysis/models `
137+ for pre-trained vocabulary files.
126138
127139.. option :: --ir2vec-opc-weight= <weight >
128140
129- Specify the weight for opcode embeddings (default: 1.0). This controls
130- the relative importance of instruction opcodes in the final embedding.
141+ Specify the weight for opcode embeddings (default: 1.0). This controls
142+ the relative importance of instruction opcodes in the final embedding.
131143
132144.. option :: --ir2vec-type-weight= <weight >
133145
134- Specify the weight for type embeddings (default: 0.5). This controls
135- the relative importance of type information in the final embedding.
146+ Specify the weight for type embeddings (default: 0.5). This controls
147+ the relative importance of type information in the final embedding.
136148
137149.. option :: --ir2vec-arg-weight= <weight >
138150
139- Specify the weight for argument embeddings (default: 0.2). This controls
140- the relative importance of operand information in the final embedding.
151+ Specify the weight for argument embeddings (default: 0.2). This controls
152+ the relative importance of operand information in the final embedding.
141153
142- .. option :: -o <filename >
143154
144- Specify the output filename. Use `` - `` to write to standard output (default).
155+ ** triplets ** subcommand:
145156
146- .. option :: --help
147-
148- Print a summary of command line options.
149-
150- .. note ::
157+ .. option :: <input-file >
151158
152- ``--level ``, ``--function ``, ``--ir2vec-vocab-path ``, ``--ir2vec-opc-weight ``,
153- ``--ir2vec-type-weight ``, and ``--ir2vec-arg-weight `` are only used in embedding
154- mode. These options are ignored in triplet and entity modes.
159+ The input LLVM IR or bitcode file to process. This positional argument is
160+ required for the `triplets ` subcommand.
155161
156- INPUT FILE FORMAT
157- -----------------
162+ **entities ** subcommand:
158163
159- :program: `llvm-ir2vec ` accepts LLVM bitcode files (``.bc ``) and LLVM IR files
160- (``.ll ``) as input. The input file should contain valid LLVM IR.
164+ No subcommand-specific options.
161165
162166OUTPUT FORMAT
163167-------------
0 commit comments