-
Notifications
You must be signed in to change notification settings - Fork 112
Description
[Questions] Problems during tring to train regalloc-evict model
Hi, I'm glad to see you have open sourced the model and code about training a regalloc-evict model with ES. I'm an intern for a company, and I'm researching the "new" techology for compiler optimization. While using and testing MLGO's performance, I've encountered some difficulties.
- No performance improvement in my local test. I test it by llvm-test-suite with flag '-DTEST_SUITE_BENCHMARKING_ONLY=ON'. no matter I use the default benchmarks or specify the SPEC2017 I do not see any performance improvement which averagely decrease 0.2% ~ 0.7% in about 50 times test (The baseline is clang 17.0.6 with flag '-O3' and MLGO is based clang 17.0.6 with flag '-mllvm -regalloc-enable-advisor=release -O3'). The build commands as follows and I'm not sure if I miss something.
$ cmake -G Ninja \
-DCMAKE_C_COMPILER=$HOME/MLGO_test/build_clang_mlgo/bin/clang \
-DCMAKE_CXX_COMPILER=$HOME/MLGO_test/build_clang_mlgo/bin/clang++ \
-DTEST_SUITE_SPEC2017_ROOT=$HOME/MLGO_test/SPEC/speccpu \
-DTEST_SUITE_COLLECT_CODE_SIZE=Off \
-DTEST_SUITE_SUBDIRS=External/SPEC/CINT2017speed \
-DCMAKE_C_FLAGS="-mllvm -regalloc-enable-advisor=release -O3" \
-DCMAKE_CXX_FLAGS="-mllvm -regalloc-enable-advisor=release -O3" \
-DTEST_SUITE_BENCHMARKING_ONLY=ON \
-C../test-suite/cmake/caches/O3.cmake \
../test-suite
$ ninja -v -d keeprsp -j $(nproc)
$ llvm-lit -v -j 1 -o result_mlgo.json .
A test result with SPEC 2017:

mlgo 51.63522 vs normal(baseline) 51.412342 decrease 0.433511%.
- After test in question 1, I'm wandering if I retrain the model with the company's business code, will I get a better performance? So I try to retrain the model. However, I ncountered some difficulties.
2.1 If I retrain the regalloc-evict-v1.1 model with commit ’d9cdc38‘.
$ PYTHONPATH=$PYTHONPATH:. python3 compiler_opt/es/es_trainer.py \
--gin_bindings=RegallocTraceWorker.clang_path="'$WORKING_DIR/llvm-build/bin/clang'" \
--gin_bindings=RegallocTraceWorker.corpus_path="'$WORKING_DIR/corpus'" \
--gin_files=compiler_opt/es/regalloc_trace/gin_configs/regalloc_trace.gin \
--gin_files=compiler_opt/es/regalloc_trace/gin_configs/blackbox_learner.gin \
--output_path=$WORKING_DIR/output_model \
--pretrained_policy_path=$WORKING_DIR/regalloc_model \
--train_corpora=$WORKING_DIR/corpus
I0320 10:57:48.507808 140335615665984 es_trainer_lib.py:99] Reading policy parameters from $HOME/MLGO_test/regalloc_model
Traceback (most recent call last):
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 42, in <module>
app.run(main)
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 35, in main
final_weights = es_trainer_lib.train()
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer_lib.py", line 109, in train
raise ValueError("Pretrained policy dimension is incorrect")
ValueError: Pretrained policy dimension is incorrect
(I modified the path to '$HOME')
The dimension of pretrained_policy and policy = policy_utils.create_actor_policy() is different.
2.2 If not pass the parameter '--pretrained_policy_path' the policy will randomly initialized. I get another error:
2025-03-20 11:02:23.346219: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2989] Estimated count of arithmetic ops: 0.437 M ops, equivalently 0.218 M MACs
Traceback (most recent call last):
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 42, in <module>
app.run(main)
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 35, in main
final_weights = es_trainer_lib.train()
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer_lib.py", line 223, in train
learner.set_baseline(pool)
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/blackbox_learner.py", line 234, in set_baseline
self._evaluator.set_baseline(pool)
File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/blackbox_evaluator.py", line 162, in set_baseline
self._baseline = futures[0].result()
File "$HOME/software/Python3.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "$HOME/software/Python3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '<basic block trace model path>'
In call to configurable 'train' (<function train at 0x7f22f44b43a0>)
I don't know which path should I pass to 'basic block trace model path'.
2.3 If I retrain the regalloc-evict-v1.1 model with source code released in regalloc-evict-v1.1. It seems the es directory is not completed (no gin files and different args).
- If I collecting a training corpus with other project, will the follow command work? Here the corpus based on llvm.
# no-thinLTO
$ cmake -B out/Release -G Ninja \
-DCMAKE_CXX_COMPILER=$HOME/MLGO_test/build_clang_normal/bin/clang++ \
-DCMAKE_C_COMPILER=$HOME/MLGO_test/build_clang_normal/bin/clang \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_TARGETS_TO_BUILD=X86 \
-DCMAKE_CXX_FLAGS="-fembed-bitcode=all" \
./llvm
$ ninja -C ./out/Release -j$(nproc)
Thanks for your help! Btw, a tutorial like docs/regalloc-demo/demo.md for ES traing is needed.