[Questions] Problems during trying to train regalloc-evict model

[Questions] Problems during tring to train regalloc-evict model

Hi, I'm glad to see you have open sourced the model and code about training a regalloc-evict model with ES. I'm an intern for a company, and I'm researching the "new" techology for compiler optimization. While using and testing MLGO's performance, I've encountered some difficulties.

1. No performance improvement in my local test. I test it by llvm-test-suite with flag '-DTEST_SUITE_BENCHMARKING_ONLY=ON'. no matter I use the default benchmarks or specify the SPEC2017 I do not see any performance improvement which averagely decrease 0.2% ~ 0.7% in about 50 times test (The baseline is clang 17.0.6 with flag '-O3' and MLGO is based clang 17.0.6 with flag '-mllvm -regalloc-enable-advisor=release -O3'). The build commands as follows and I'm not sure if I miss something.

```
$ cmake -G Ninja \
-DCMAKE_C_COMPILER=$HOME/MLGO_test/build_clang_mlgo/bin/clang \
-DCMAKE_CXX_COMPILER=$HOME/MLGO_test/build_clang_mlgo/bin/clang++ \
-DTEST_SUITE_SPEC2017_ROOT=$HOME/MLGO_test/SPEC/speccpu \
-DTEST_SUITE_COLLECT_CODE_SIZE=Off \
-DTEST_SUITE_SUBDIRS=External/SPEC/CINT2017speed \
-DCMAKE_C_FLAGS="-mllvm -regalloc-enable-advisor=release -O3" \
-DCMAKE_CXX_FLAGS="-mllvm -regalloc-enable-advisor=release -O3" \
-DTEST_SUITE_BENCHMARKING_ONLY=ON \
-C../test-suite/cmake/caches/O3.cmake \
../test-suite
$ ninja -v -d keeprsp -j $(nproc)
$ llvm-lit -v -j 1 -o result_mlgo.json .
```
A test result with SPEC 2017:
<img width="1648" alt="Image" src="https://github.com/user-attachments/assets/e2ac6584-990e-41b1-9008-6388e2c16cf3" />
mlgo 51.63522 vs normal(baseline) 51.412342 decrease 0.433511%.

2. After test in question 1, I'm wandering if I retrain the model with the company's business code, will I get a better performance? So I try to retrain the model. However, I ncountered some difficulties.

2.1 If I retrain the regalloc-evict-v1.1 model with commit ’d9cdc38‘.
```
$ PYTHONPATH=$PYTHONPATH:. python3 compiler_opt/es/es_trainer.py \
  --gin_bindings=RegallocTraceWorker.clang_path="'$WORKING_DIR/llvm-build/bin/clang'" \
  --gin_bindings=RegallocTraceWorker.corpus_path="'$WORKING_DIR/corpus'" \
  --gin_files=compiler_opt/es/regalloc_trace/gin_configs/regalloc_trace.gin \
  --gin_files=compiler_opt/es/regalloc_trace/gin_configs/blackbox_learner.gin \
  --output_path=$WORKING_DIR/output_model \
  --pretrained_policy_path=$WORKING_DIR/regalloc_model \
  --train_corpora=$WORKING_DIR/corpus

I0320 10:57:48.507808 140335615665984 es_trainer_lib.py:99] Reading policy parameters from $HOME/MLGO_test/regalloc_model
Traceback (most recent call last):
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 42, in <module>
    app.run(main)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 35, in main
    final_weights = es_trainer_lib.train()
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer_lib.py", line 109, in train
    raise ValueError("Pretrained policy dimension is incorrect")
ValueError: Pretrained policy dimension is incorrect
```
(I modified the path to '$HOME')

The dimension of pretrained_policy and policy = policy_utils.create_actor_policy() is different.

2.2 If not pass the parameter '--pretrained_policy_path' the policy will randomly initialized. I get another error:

```

2025-03-20 11:02:23.346219: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2989] Estimated count of arithmetic ops: 0.437 M  ops, equivalently 0.218 M  MACs
Traceback (most recent call last):
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 42, in <module>
    app.run(main)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 35, in main
    final_weights = es_trainer_lib.train()
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer_lib.py", line 223, in train
    learner.set_baseline(pool)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/blackbox_learner.py", line 234, in set_baseline
    self._evaluator.set_baseline(pool)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/blackbox_evaluator.py", line 162, in set_baseline
    self._baseline = futures[0].result()
  File "$HOME/software/Python3.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "$HOME/software/Python3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '<basic block trace model path>'
    In call to configurable 'train' (<function train at 0x7f22f44b43a0>)
```
I don't know which path should I pass to 'basic block trace model path'.

2.3  If I retrain the regalloc-evict-v1.1 model with source code released in regalloc-evict-v1.1. It seems the es directory is not completed (no gin files and different args).

3. If I collecting a training corpus with other project, will the follow command work? Here the corpus based on llvm.

```
# no-thinLTO
$ cmake -B out/Release -G Ninja \
-DCMAKE_CXX_COMPILER=$HOME/MLGO_test/build_clang_normal/bin/clang++ \
-DCMAKE_C_COMPILER=$HOME/MLGO_test/build_clang_normal/bin/clang \
-DBUILD_SHARED_LIBS=ON  \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_TARGETS_TO_BUILD=X86 \
-DCMAKE_CXX_FLAGS="-fembed-bitcode=all" \
./llvm
$ ninja -C ./out/Release -j$(nproc)
```

Thanks for your help! Btw, a tutorial like docs/regalloc-demo/demo.md for ES traing is needed. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Questions] Problems during trying to train regalloc-evict model #475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Questions] Problems during trying to train regalloc-evict model #475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions