Skip to content

[Questions] Problems during trying to train regalloc-evict model #475

@NanQin555

Description

@NanQin555

[Questions] Problems during tring to train regalloc-evict model

Hi, I'm glad to see you have open sourced the model and code about training a regalloc-evict model with ES. I'm an intern for a company, and I'm researching the "new" techology for compiler optimization. While using and testing MLGO's performance, I've encountered some difficulties.

  1. No performance improvement in my local test. I test it by llvm-test-suite with flag '-DTEST_SUITE_BENCHMARKING_ONLY=ON'. no matter I use the default benchmarks or specify the SPEC2017 I do not see any performance improvement which averagely decrease 0.2% ~ 0.7% in about 50 times test (The baseline is clang 17.0.6 with flag '-O3' and MLGO is based clang 17.0.6 with flag '-mllvm -regalloc-enable-advisor=release -O3'). The build commands as follows and I'm not sure if I miss something.
$ cmake -G Ninja \
-DCMAKE_C_COMPILER=$HOME/MLGO_test/build_clang_mlgo/bin/clang \
-DCMAKE_CXX_COMPILER=$HOME/MLGO_test/build_clang_mlgo/bin/clang++ \
-DTEST_SUITE_SPEC2017_ROOT=$HOME/MLGO_test/SPEC/speccpu \
-DTEST_SUITE_COLLECT_CODE_SIZE=Off \
-DTEST_SUITE_SUBDIRS=External/SPEC/CINT2017speed \
-DCMAKE_C_FLAGS="-mllvm -regalloc-enable-advisor=release -O3" \
-DCMAKE_CXX_FLAGS="-mllvm -regalloc-enable-advisor=release -O3" \
-DTEST_SUITE_BENCHMARKING_ONLY=ON \
-C../test-suite/cmake/caches/O3.cmake \
../test-suite
$ ninja -v -d keeprsp -j $(nproc)
$ llvm-lit -v -j 1 -o result_mlgo.json .

A test result with SPEC 2017:
Image
mlgo 51.63522 vs normal(baseline) 51.412342 decrease 0.433511%.

  1. After test in question 1, I'm wandering if I retrain the model with the company's business code, will I get a better performance? So I try to retrain the model. However, I ncountered some difficulties.

2.1 If I retrain the regalloc-evict-v1.1 model with commit ’d9cdc38‘.

$ PYTHONPATH=$PYTHONPATH:. python3 compiler_opt/es/es_trainer.py \
  --gin_bindings=RegallocTraceWorker.clang_path="'$WORKING_DIR/llvm-build/bin/clang'" \
  --gin_bindings=RegallocTraceWorker.corpus_path="'$WORKING_DIR/corpus'" \
  --gin_files=compiler_opt/es/regalloc_trace/gin_configs/regalloc_trace.gin \
  --gin_files=compiler_opt/es/regalloc_trace/gin_configs/blackbox_learner.gin \
  --output_path=$WORKING_DIR/output_model \
  --pretrained_policy_path=$WORKING_DIR/regalloc_model \
  --train_corpora=$WORKING_DIR/corpus

I0320 10:57:48.507808 140335615665984 es_trainer_lib.py:99] Reading policy parameters from $HOME/MLGO_test/regalloc_model
Traceback (most recent call last):
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 42, in <module>
    app.run(main)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 35, in main
    final_weights = es_trainer_lib.train()
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer_lib.py", line 109, in train
    raise ValueError("Pretrained policy dimension is incorrect")
ValueError: Pretrained policy dimension is incorrect

(I modified the path to '$HOME')

The dimension of pretrained_policy and policy = policy_utils.create_actor_policy() is different.

2.2 If not pass the parameter '--pretrained_policy_path' the policy will randomly initialized. I get another error:


2025-03-20 11:02:23.346219: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:2989] Estimated count of arithmetic ops: 0.437 M  ops, equivalently 0.218 M  MACs
Traceback (most recent call last):
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 42, in <module>
    app.run(main)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer.py", line 35, in main
    final_weights = es_trainer_lib.train()
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "$HOME/MLGO_test/mlgo-venv/lib/python3.10/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/es_trainer_lib.py", line 223, in train
    learner.set_baseline(pool)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/blackbox_learner.py", line 234, in set_baseline
    self._evaluator.set_baseline(pool)
  File "$HOME/MLGO_test/ml-compiler-opt/compiler_opt/es/blackbox_evaluator.py", line 162, in set_baseline
    self._baseline = futures[0].result()
  File "$HOME/software/Python3.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "$HOME/software/Python3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '<basic block trace model path>'
    In call to configurable 'train' (<function train at 0x7f22f44b43a0>)

I don't know which path should I pass to 'basic block trace model path'.

2.3 If I retrain the regalloc-evict-v1.1 model with source code released in regalloc-evict-v1.1. It seems the es directory is not completed (no gin files and different args).

  1. If I collecting a training corpus with other project, will the follow command work? Here the corpus based on llvm.
# no-thinLTO
$ cmake -B out/Release -G Ninja \
-DCMAKE_CXX_COMPILER=$HOME/MLGO_test/build_clang_normal/bin/clang++ \
-DCMAKE_C_COMPILER=$HOME/MLGO_test/build_clang_normal/bin/clang \
-DBUILD_SHARED_LIBS=ON  \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_TARGETS_TO_BUILD=X86 \
-DCMAKE_CXX_FLAGS="-fembed-bitcode=all" \
./llvm
$ ninja -C ./out/Release -j$(nproc)

Thanks for your help! Btw, a tutorial like docs/regalloc-demo/demo.md for ES traing is needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions