@@ -27,8 +27,180 @@ of models during training.
2727Corpus Tooling
2828==============
2929
30- ..
31- TODO(boomanaiden154): Write this section.
30+ Within the LLVM monorepo, there is the ``mlgo-utils `` python packages that
31+ lives at ``llvm/utils/mlgo-utils ``. This package primarily contains tooling
32+ for working with corpora, or collections of LLVM bitcode. We use these corpora
33+ to train and evaluate ML models. Corpora consist of a description in JSON
34+ format at ``corpus_description.json `` in the root of the corpus, and then
35+ a bitcode file and command line flags file for each extracted module. The
36+ corpus structure is designed to contain sufficient information to fully
37+ compile the bitcode to bit-identical object files.
38+
39+ .. program :: extract_ir.py
40+
41+ Synopsis
42+ --------
43+
44+ Extracts a corpus from some form of a structured compilation database. This
45+ tool supports a variety of different scenarios and input types.
46+
47+ Options
48+ -------
49+
50+ .. option :: --input
51+
52+ The path to the input. This should be a path to a supported structured
53+ compilation database. Currently only ``compile_commands.json `` files, linker
54+ parameter files, a directory containing object files (for the local
55+ ThinLTO case only), or a JSON file containing a bazel aquery result are
56+ supported.
57+
58+ .. option :: --input_type
59+
60+ The type of input that has been passed to the ``--input `` flag.
61+
62+ .. option :: --output_dir
63+
64+ The output directory to place the corpus in.
65+
66+ .. option :: --num_workers
67+
68+ The number of workers to use for extracting bitcode into the corpus. This
69+ defaults to the number of hardware threads available on the host system.
70+
71+ .. option :: --llvm_objcopy_path
72+
73+ The path to the llvm-objcopy binary to use when extracting bitcode.
74+
75+ .. option :: --obj_base_dir
76+
77+ The base directory for object files. Bitcode files that get extracted into
78+ the corpus will be placed into the output directory based on where their
79+ source object files are placed relative to this path.
80+
81+ .. option :: --cmd_filter
82+
83+ Allows filtering of modules by command line. If set, only modules that much
84+ the filter will be extracted into the corpus. Regular expressions are
85+ supported in some instances.
86+
87+ .. option :: --thinlto_build
88+
89+ If the build was performed with ThinLTO, this should be set to either
90+ ``distributed `` or ``local `` depending upon how the build was performed.
91+
92+ .. option :: --cmd_section_name
93+
94+ This flag allows specifying the command line section name. This is needed
95+ on non-ELF platforms where the section name might differ.
96+
97+ .. option :: --bitcode_section_name
98+
99+ This flag allows specifying the bitcode section name. This is needed on
100+ non-ELF platforms where the section name might differ.
101+
102+ Example: CMake
103+ --------------
104+
105+ CMake can output a ``compilation_commands.json `` compilation database if the
106+ ``CMAKE_EXPORT_COMPILE_COMMANDS `` switch is turned on at compile time. It is
107+ also necessary to enable bitcode embedding (done by passing
108+ ``-Xclang -fembed-bitcode=all `` to all C/C++ compilation actions in the
109+ non-ThinLTO case). For example, to extract a corpus from clang, you would
110+ run the following commands (assuming that the system C/C++ compiler is clang):
111+
112+ .. code-block :: bash
113+
114+ cmake -GNinja \
115+ -DCMAKE_BUILD_TYPE=Release \
116+ -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
117+ -DCMAKE_C_FLAGS=" -Xclang -fembed-bitcode=all" \
118+ -DCMAKE_CXX_FLAGS=" -Xclang -fembed-bitcode-all"
119+ ../llvm
120+ ninja
121+
122+ After running CMake and building the project, there should be a
123+ ``compilation_commands.json `` file within the build directory. You can then
124+ run the following command to create a corpus:
125+
126+ .. code-block :: bash
127+
128+ python3 ./extract_ir.py \
129+ --input=./build/compile_commands.json \
130+ --input_type=json \
131+ --output_dir=./corpus
132+
133+ After running the above command, there should be a full
134+ corpus of bitcode within the ``./corpus `` directory.
135+
136+ Example: Bazel Aquery
137+ ---------------------
138+
139+ This tool also supports extracting bitcode from bazel in multiple ways
140+ depending upon the exact configuration. For ThinLTO, a linker parameters file
141+ is preferred. For the non-ThinLTO case, the script will accept the output of
142+ ``bazel aquery `` which it will use to find all the object files that are linked
143+ into a specific target and then extract bitcode from them. First, you need
144+ to generate the aquery output:
145+
146+ .. code-block :: bash
147+
148+ bazel aquery --output=jsonproto //path/to:target > /path/to/aquery.json
149+
150+ Afterwards, assuming that the build is already complete, you can run this
151+ script to create a corpus:
152+
153+ .. code-block :: bash
154+
155+ python3 ./extract_ir.py \
156+ --input=/path/to/aquery.json \
157+ --input_type=bazel_aqeury \
158+ --output_dir=./corpus \
159+ --obj_base_dir=./bazel-bin
160+
161+ This will again leave a corpus that contains all the bitcode files. This mode
162+ does not capture all object files in the build however, only the ones that
163+ are involved in the link for the binary passed to the ``bazel aquery ``
164+ invocation.
165+
166+ .. program :: make_corpus.py
167+
168+ Synopsis
169+ --------
170+
171+ Creates a corpus from a collection of bitcode files.
172+
173+ Options
174+ -------
175+
176+ .. option :: --input_dir
177+
178+ The input directory to search for bitcode files in.
179+
180+ .. option :: --output_dir
181+
182+ The output directory to place the constructed corpus in.
183+
184+ .. option :: --default_args
185+
186+ A list of space separated flags that are put into the corpus description.
187+ These are used by some tooling when compiling the modules within the corpus.
188+
189+ .. program :: combine_training_corpus.py
190+
191+ Synopsis
192+ --------
193+
194+ Combines two training corpora that share the same parent folder by generating
195+ a new ``corpus_description.json `` that contains all the modules in both corpora.
196+
197+ Options
198+ -------
199+
200+ .. option :: --root_dir
201+
202+ The root directory that contains subfolders consisting of the corpora that
203+ should be combined.
32204
33205Interacting with ML models
34206==========================
0 commit comments