Debugging miscompilations with bugpoint (using cross compilation and a remote host)

This is a short writeup of what I did in order to debug https://github.com/CTSRD-CHERI/llvm-project/issues/385 Most of it should apply to any miscompilation bug, but later steps are specific to this particular miscompilation.

Step 1: Generate a reproducer

First you will have to generate a reproducer (this does not need to be minimal since bugpoint will take care of that). The reproducer should be a simple program with a main function that prints some given output on success and different output on failure (miscompilation).

For the remainder of this page we will assume that the success output is "SUCCESSFUL" and the failure is either "FAILED" or "received signal 34" (i.e. a CHERI trap). Exiting with a non-zero exit code (or crashing) is also acceptable for the failure case

For example:

int miscompiled_function(int arg) { ... }
int main(void) {
   if (miscompiled_function(1) == 1) {
      printf("SUCCESSFUL");
   } else {
       __builtin_trap();
   }
}

Ideally, you then create a pre-processed reproducer from that (or use the Makefile listed below to achieve that result).

Step 2: Validate the reproducer works

For creating the reproducer I use the following Makefile to verify that the reproducer works.

all: broken-reproducer.exe good-reproducer.exe

# Tunables
CHERI_SDK_ROOT?=${HOME}/cheri/output/sdk
CHERI_SYSROOT?=$(CHERI_SDK_ROOT)/sysroot128
CHERI_CHERIBSD?=${HOME}/cheri/cheribsd
SSH_HOSTNAME?=cheribsd
GOOD_OPTFLAGS=-O1
BROKEN_OPTFLAGS=-O2

# Compiler and compile flags
CC=$(CHERI_SDK_ROOT)/bin/clang
OPT_BIN=$(CHERI_SDK_ROOT)/bin/opt
CFLAGS= -target cheri-unknown-freebsd13.0 -integrated-as -fcolor-diagnostics -mcpu=beri -fuse-ld=lld -Qunused-arguments -target cheri-unknown-freebsd13.0 --sysroot=$(CHERI_SYSROOT) -B$(CHERI_SYSROOT)/usr/bin -ftls-model=local-exec -ftls-model=initial-exec -ftls-model=initial-exec -O -pipe -G0 -mcpu=beri -EB -mabi=purecap -integrated-as -fpic -cheri-cap-table-abi=pcrel -Wno-deprecated-declarations -cheri=128 -mstack-alignment=16 -D__LP64__=1 -Qunused-arguments -Werror=cheri-bitwise-operations -msoft-float   -DNO__SCCSID -DNO__RCSID -I$(CHERI_CHERIBSD)/lib/libc/include -I$(CHERI_CHERIBSD)/include -I$(CHERI_CHERIBSD)/lib/libc/mips -DNLS  -D__DBINTERFACE_PRIVATE -I$(CHERI_CHERIBSD)/contrib/gdtoa -DNO_COMPAT7 -I$(CHERI_CHERIBSD)/contrib/libc-vis -DINET6 -I$(CHERI_CHERIBSD)/lib/libc/resolv -D_ACL_PRIVATE -DPOSIX_MISTAKE -I$(CHERI_CHERIBSD)/lib/libmd -I$(CHERI_CHERIBSD)/contrib/jemalloc/include -DMALLOC_PRODUCTION -I$(CHERI_CHERIBSD)/contrib/tzcode/stdtime -I$(CHERI_CHERIBSD)/lib/libc/stdtime -I$(CHERI_CHERIBSD)/lib/libc/locale -DBROKEN_DES -DPORTMAP -DDES_BUILTIN -I$(CHERI_CHERIBSD)/lib/libc/rpc -I$(CHERI_CHERIBSD)/lib/libc/mips/softfloat  -I$(CHERI_CHERIBSD)/lib/libc/softfloat -DSOFTFLOAT_FOR_GCC -DYP -DSYMBOL_VERSIONING -g -MD  -MF.depend.getaddrinfo.o -MTgetaddrinfo.o -std=gnu99 -Wno-format-zero-length -nobuiltininc -Wsystem-headers -Werror -Wall -Wno-format-y2k -Wno-uninitialized -Wno-pointer-sign -Wno-error=pass-failed -Wno-error=misleading-indentation -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-tautological-compare -Wno-unused-value -Wno-parentheses-equality -Wno-unused-function -Wno-enum-conversion -Wno-unused-local-typedef -Wno-address-of-packed-member -Wno-switch -Wno-switch-enum -Wno-knr-promoted-parameter  -Qunused-arguments -I$(CHERI_CHERIBSD)/lib/libutil -I$(CHERI_CHERIBSD)/lib/msun/mips -I$(CHERI_CHERIBSD)/lib/msun/src -I$(CHERI_CHERIBSD)/lib/libc/net
CFLAGS+=-Wno-unused-variable -Wunused-function
LDFLAGS=-lc++

.PHONY: echo-ir-compile-command
echo-ir-compile-command:
	@if [ -z "$(INPUT_FILE)" ] || [ -z "$(OUTPUT_FILE)" ]; then echo "Must set INPUT_FILE and OUTPUT_FILE"; false; fi
	@echo "$(CC) $(CFLAGS) $(GOOD_OPTFLAGS) $(LDFLAGS) -o $(OUTPUT_FILE) -x ir $(INPUT_FILE)"

reproducer.preprocessed.c: reproducer.c Makefile
	$(CC) $(CFLAGS) -E - -o $@ reproducer.c
reproducer-O0.ll: reproducer.c Makefile
	$(CC) $(CFLAGS) -O0 -Xclang -disable-O0-optnone -emit-llvm -S -o $@ reproducer.c
	if grep "Function Attrs" "$@" | grep optnone; then \
		echo "Found optnone attribute in -O0 IR?"; \
		rm -f "$@"; \
		false; \
	fi
reproducer-O1.ll: reproducer.c Makefile
	$(CC) $(CFLAGS) -O1 -disable-llvm-optzns -emit-llvm -S -o $@ reproducer.c
reproducer-O2.ll: reproducer.c Makefile
	$(CC) $(CFLAGS) -O2 -disable-llvm-optzns -emit-llvm -S -o $@ reproducer.c
reproducer-O3.ll: reproducer.c Makefile
	$(CC) $(CFLAGS) -O3 -disable-llvm-optzns -emit-llvm -S -o $@ reproducer.c

broken-reproducer.exe: reproducer.c Makefile
	$(CC) $(CFLAGS) $(BROKEN_OPTFLAGS) $(LDFLAGS) -o $@ reproducer.c
good-reproducer.exe:  reproducer.c Makefile
	$(CC) $(CFLAGS) $(GOOD_OPTFLAGS) $(LDFLAGS) -o $@ reproducer.c

broken-reproducer-from-%-ir.exe: reproducer-%.ll Makefile
	$(CC) $(CFLAGS) $(BROKEN_OPTFLAGS) $(LDFLAGS) -o $@ -x ir $<
good-reproducer-from-%-ir.exe: reproducer-%.ll Makefile
	$(CC) $(CFLAGS) $(GOOD_OPTFLAGS) $(LDFLAGS) -o $@ -x ir $<

LLVM_OPTFLAGS?=-O3
good-reproducer-%-with-opt.exe: reproducer-%.ll Makefile
	$(OPT_BIN) -S $(LLVM_OPTFLAGS) -o $<.opt3.ll $<
	$(CC) $(CFLAGS) $(GOOD_OPTFLAGS) $(LDFLAGS) -o $@ -x ir $<.opt3.ll

# Force TTY allocation with -tt to report the right exit code
RUN_TARGETS=run-good-reproducer run-broken-reproducer \
	run-good-reproducer-from-O0-ir run-broken-reproducer-from-O0-ir \
	run-good-reproducer-from-O1-ir run-broken-reproducer-from-O1-ir \
	run-good-reproducer-from-O2-ir run-broken-reproducer-from-O2-ir \
	run-good-reproducer-from-O3-ir run-broken-reproducer-from-O3-ir \
	run-good-reproducer-O1-with-opt run-good-reproducer-O0-with-opt
.PHONY: $(RUN_TARGETS)
$(RUN_TARGETS): run-%: %.exe
	scp -q "$^" "$(SSH_HOSTNAME):/tmp/$<"
	ssh -tt $(SSH_HOSTNAME) -- "/tmp/$<"; echo "Exit code was $$?"

Step 2a: SSH setup

Example SSH config to allow fast upload (ControlMaster setting) to a CheriBSD instance. Adjust host and port if you want to run e.g. on an FPGA instead of localhost (e.g. started by cheribuild).

Host cheribsd
	User root
	HostName localhost
	Port 12374
	StrictHostKeyChecking no
	ControlPath ~/.ssh/controlmasters/%r@%h:%p
	ControlMaster auto
	ControlPersist 5m

Step 2b: Run reproducer generated from C

Using the makefile above you can generate LLVM IR and check reproducers by running make run-good-reproducer and make run-broken-reproducer

The run-good-reproducer target should produce output similar to the following (the exit code should be zero):

scp -q "good-reproducer.exe" "cheribsd:/tmp/good-reproducer.exe"
ssh -tt cheribsd -- "/tmp/good-reproducer.exe"; echo "Exit code was $?"
TEST SUCCESSFUL!
Shared connection to localhost closed.
Exit code was 0

The run-broken-reproducer should produce output similar to the following (the exit code should be non-zero):

ssh -tt cheribsd -- "/tmp/broken-reproducer.exe"; echo "Exit code was $?"
TEST FAILED!
Shared connection to localhost closed.
Exit code was 1

Step 2b: Find the correct LLVM IR input

Using the makefile above you can generate LLVM IR and check reproducers by running make run-{good,broken}-reproducer-from-{O0,O1,O2,O3}-ir

Ideally you should get the success output when running make run-good-reproducer-from-O0-ir and the failure output when running make run-broken-reproducer-from-O0-ir. However, it is possible that the bug is not exposed by the -O0 IR since e.g. clang lifetime markers will not be emitted. In that case you can try if make run-{good,broken}-reproducer-from-{O1,O2,O3}-ir works as expected. Once you have found the right IR target, the LLVM IR input will be in the same directory as reproducer-<optlevel>.ll (e.g. for O0, it will be reproducer-O0.ll).

Step 3: Run bugpoint

Now that you have the LLVM IR input (we will assume it's called reproducer-O1.ll here), you can start using the LLVM bugpoint tool (more docs can be found here).

The command line options of bugpoint are quite difficult to use (it took me many hours to initially figure out how to do this miscompilation debugging), so I've written some scripts to make this easier.

We make use of the --run-custom flag to pass a custom script for running the compiled binary remotely.

To do this we need two helper scripts.

The first script is run-bugpoint.sh (make sure to set LLVM_BINDIR and adjust the -opt-args flags). Since we are compiling for CHERI-MIPS in this example, we need to pass -mtriple=mips64c128-unknown-freebsd13-purecap -mcpu=cheri128 -mattr=+cheri128 to opt in order to compile for the right target architecture.

#!/bin/sh
set -xe
# bugpoint searches for tools in $PATH
if ! test -e "$LLVM_BINDIR/opt"; then
   echo "FATAL: cannot find opt command. Please set $LLVM_BINDIR"
   exit 1
fi
if ! test -e "$LLVM_BINDIR/bugpoint"; then
   echo "FATAL: cannot find bugpoint command. Please set $LLVM_BINDIR"
   exit 1
fi
input_file=$1
export PATH=$LLVM_BINDIR:$PATH
# Select the set of passes to debug for miscompilation (default to -O3). This can be any valid argument to the opt tool
# passes=-O2
# passes="-simplifycfg -memcpyopt"
passes=-O3
bugpoint -verbose-errors $passes "$input_file" -compile-command /usr/bin/false --safe-run-custom --run-custom -exec-command $(pwd)/run_remote.py --opt-args -mcpu=cheri128 -mattr=+cheri128

TODO: integrate this with the makefile

We tell bugpoint to not attempt to compile the .ll file (we do that as part of of the -run-custom script).

TODO: I can't remember why I had to do that

The run-custom script takes care of compiling, uploading and running the test binary based on the (partially) optimized input that bugpoint passes to it. It is important that we treat cases where we fail to compile the input (e.g. due to linker errors when building the executable) as a "success", i.e., returning zero since otherwise bugpoint will assume that this invalid input is actually an interesting test case and reduce it to an empty IR file.

The script can be a lot faster if you can avoid the SCP overhead by using a NFS/SMB network share. This is significant for CHERI since the SSH authentication overhead is significant on a slow QEMU VM. Using a network file system allows us to avoid one authentication (the scp step) and only execute the binary.

Step 3a: Validate that the remote_run script works

If the run_remote.py script is not working as expected, bugpoint will not produce useful output. Try executing sh -x ./run_remote.sh reproducer-O1.ll; echo $?. If this prints a non-zero exit status it is likely that some of the flags are wrong.

One potential error that happened to me was using a network file system but forgetting to mount it after rebooting the guest VM. The command output will show something like /nfsroot//tmp/bugpoint-test.exe: Command not found.

Another useful step for sanity checking (if the -O3 LLVM IR is broken), is to run sh -x ./run_remote.sh reproducer-O1.ll; echo $?.

Step 3b WAAAAIIIT (possibly a very long time)

Run LLVM_BINDIR=/path/to/llvm sh ./run_bugpoint.sh reproducer-O1.ll and wait for bugpoint to complete. It should print the pass that causes the problem and try to produce a minimal IR file by optimizing individual functions and basic blocks with the broken pass until a minimal output has been produced.

The output should look like the following:

Generating reference output from raw program: 
Reference output is: bugpoint.reference.out-48bc76b

*** Checking the code generator...

*** Output matches: Debugging miscompilation!
Checking to see if '' compiles correctly:  yup.
Checking to see if '-ee-instrument -tbaa -scoped-noalias -simplifycfg -sroa -early-cse -lower-expect -forceattrs -tbaa -scoped-noalias -inferattrs -callsite-splitting -ipsccp -called-value-propagation -attributor -globalopt -mem2reg -deadargelim -instcombine -simplifycfg -globals-aa -prune-eh -inline -functionattrs -argpromotion -sroa -early-cse-memssa -speculative-execution -jump-threading -correlated-propagation -simplifycfg -aggressive-instcombine -instcombine -libcalls-shrinkwrap -pgo-memop-opt -tailcallelim -simplifycfg -reassociate -loop-rotate -licm -loop-unswitch -simplifycfg -instcombine -indvars -loop-idiom -loop-deletion -loop-unroll -mldst-motion -gvn -memcpyopt -sccp -bdce -instcombine -jump-threading -correlated-propagation -dse -licm -adce -simplifycfg -instcombine -barrier -elim-avail-extern -rpo-functionattrs -globalopt -globaldce -globals-aa -float2int -lower-constant-intrinsics -loop-rotate -loop-distribute -loop-vectorize -loop-load-elim -instcombine -simplifycfg -instcombine -loop-unroll -instcombine -licm -transform-warning -alignment-from-assumptions -strip-dead-prototypes -globaldce -constmerge -loop-sink -instsimplify -div-rem-pairs -simplifycfg' compiles correctly:  nope.
Checking to see if '-indvars -loop-idiom -loop-deletion -loop-unroll -mldst-motion -gvn -memcpyopt -sccp -bdce -instcombine -jump-threading -correlated-propagation -dse -licm -adce -simplifycfg -instcombine -barrier -elim-avail-extern -rpo-functionattrs -globalopt -globaldce -globals-aa -float2int -lower-constant-intrinsics -loop-rotate -loop-distribute -loop-vectorize -loop-load-elim -instcombine -simplifycfg -instcombine -loop-unroll -instcombine -licm -transform-warning -alignment-from-assumptions -strip-dead-prototypes -globaldce -constmerge -loop-sink -instsimplify -div-rem-pairs -simplifycfg' compiles correctly:  yup.
Checking to see if '-ee-instrument -tbaa -scoped-noalias -simplifycfg -sroa -early-cse -lower-expect -forceattrs -tbaa -scoped-noalias -inferattrs -callsite-splitting -ipsccp -called-value-propagation -attributor -globalopt -mem2reg -deadargelim -instcombine -simplifycfg -globals-aa -prune-eh -inline -functionattrs -argpromotion -sroa -early-cse-memssa -speculative-execution -jump-threading -correlated-propagation -simplifycfg -aggressive-instcombine -instcombine -libcalls-shrinkwrap -pgo-memop-opt -tailcallelim -simplifycfg -reassociate -loop-rotate -licm -loop-unswitch -simplifycfg -instcombine' compiles correctly:  yup.

If bugpoint starts debugging a code generator crash instead, see the next section for some common problems.

Step 3c: bugpoint not working? Some possible solutions

One problem that often happens is that bugpoint cannot detect that you are debugging a miscompilation, prints bugpoint can't help you with your problem! and starts trying to debug a code generator crash.

*** Checking the code generator...

*** Output matches: Debugging miscompilation!

*** Optimized program matches reference output!  No problem detected...
bugpoint can't help you with your problem!

In that case it makes sense to check if the reference output is as expected. For example, in one case I was getting a -Werror failure for the reference program, so the broken problem produces the same output.

[0;1;31merror: [0moverriding the module target triple with mips64c128-unknown-freebsd13.0-purecap [-Werror,-Woverride-module][0m
1 error generated.
Compiling test binary...
FAILED TO COMPILE

run_remote.py:

#!/usr/bin/env python3
import subprocess
import sys
import shlex
import datetime
import os
from pathlib import Path

# Adjust these variables as required
NFS_DIR_ON_HOST = Path.home() / "cheri/output/rootfs128/"
NFS_DIR_IN_GUEST = "/nfsroot"
SSH_HOST = "cheribsd"
LOG = Path(__file__).parent / ("log-" + str(datetime.datetime.now()) + "-pid" + str(os.getpid()) + ".txt")


def main():
    ## Main script body
    input_ll = sys.argv[1]
    exe_name = "bugpoint-test.exe"
    output_exe = Path(exe_name).absolute()
    exepath_in_shared_dir_rel = "tmp/" + exe_name
    host_nfs_exepath = NFS_DIR_ON_HOST / exepath_in_shared_dir_rel
    guest_nfs_exepath = Path(NFS_DIR_IN_GUEST, exepath_in_shared_dir_rel)


    print("Compiling test binary...")
    subprocess.check_call(["rm", "-f", str(output_exe)])
    compile_command = subprocess.check_output(["make", "-C", Path(__file__).parent,
                                               "INPUT_FILE=" + str(input_ll),
                                               "OUTPUT_FILE=" + str(output_exe),
                                               "echo-ir-compile-command"]).decode().strip()
    compile_command += " -Wno-error"    # Avoid random compilation failures
    # print("Running", compile_command)
    if subprocess.call(compile_command, shell=True) != 0:
      print("FAILED TO COMPILE")
      sys.exit(0)  # Failed to compile -> not and interesting test case

    print("Running test binary...")
    # TODO: handle the non-NFS case where we have to upload first
    subprocess.check_call(["cp", "-f", str(output_exe), str(host_nfs_exepath)])
    result = subprocess.run(["ssh", "-tt", SSH_HOST, "--", str(guest_nfs_exepath)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    # TODO: could check stuff here
    print(result.stdout.decode().strip())  # Print the expected output
    print("EXIT CODE =", result.returncode)

    LOG.write_bytes(result.stdout)
    sys.exit(result.returncode)


if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        print("GOT EXCEPTION!!", e)
        LOG.write_text(str(e))
        sys.exit(0)  # not interesting test case, but should be debugged

Step 4: Look at the pass that causes the miscompilation

In the example miscompilation (https://github.com/CTSRD-CHERI/llvm-project/issues/385), bugpoint reported that the pass causing the miscompilation is GVN.

Step 4a: Pass commandline flags to disable parts of the affected pass

Many LLVM passes have internal options (grep for cl::opt<) that can be used to change the behaviour of the optimization pass. Looking at GVN.cpp, I saw there are various boolean flags that are true by default and some integer recursion depth flags. As a first step I therefore extended the Makefile to pass -mllvm flags that disable those parts of GVN:

CFLAGS+=-mllvm -enable-pre=false
CFLAGS+=-mllvm -enable-load-pre=false
CFLAGS+=-mllvm -enable-gvn-memdep=false
CFLAGS+=-mllvm -gvn-max-recurse-depth=0
CFLAGS+=-mllvm -gvn-max-num-deps=0

None of these had an effect on the resulting miscompilation, so that means I had to modify GVN for finer-grained control.

Step 4b: Add flags to disable parts of the pass

To see what transforms are being performed by GVN I added -mllvm -stats. This flag dumps the value of all LLVM statistics counters on exit (I believe it's only available with assertions enabled builds of LLVM). To see which statistics are available you can search for STATISTIC( in LLVM. These statistics can either be dumped on stderr, or to a file in JSON format (-Xclang -stats-file=/path/to/file)

GVN defines various statistics and the following was printed by the broken reproducer:

    2 gvn                          - Number of blocks merged
    6 gvn                          - Number of equalities propagated
   74 gvn                          - Number of instructions deleted
    4 gvn                          - Number of loads deleted
   58 gvn                          - Number of instructions simplified

I then checked the GVN pass to see where those statistics are being incremented and add some new flags to GVN to skip those parts of GVN:

-mllvm -gvn-propagate-equality=false to skip GVN::propagateEquality(...) -mllvm -gvn-process-loads=false to skip GVN::processLoad(LoadInst *L)

Step 4c: Finding the culprit

After adding these flags, it turned out that -O2 (with the GVN flags listed above) and -mllvm -gvn-process-loads=false was succeeding, but -mllvm -gvn-process-loads=true produced the wrong output. Therefore, we now know that the miscompilation is in GVN::processLoad(LoadInst *L) (or at least that a transformation is happening there that exposes a bug in a later pass).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debugging miscompilations with bugpoint (using cross compilation and a remote host)

Step 1: Generate a reproducer

Step 2: Validate the reproducer works

Step 2a: SSH setup

Step 2b: Run reproducer generated from C

Step 2b: Find the correct LLVM IR input

Step 3: Run bugpoint

Step 3a: Validate that the remote_run script works

Step 3b WAAAAIIIT (possibly a very long time)

Step 3c: bugpoint not working? Some possible solutions

run_remote.py:

Step 4: Look at the pass that causes the miscompilation

Step 4a: Pass commandline flags to disable parts of the affected pass

Step 4b: Add flags to disable parts of the pass

Step 4c: Finding the culprit

Step 4c: Find out what is being changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally