-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Hi,
I have sunk a few days of my time into this and was unable to come up with a solution. I'm documenting it here in case anybody else wants to pick this up where I left of.
The problem is, that on the Debian buildds, on the Debian gitlab CI as well as on tests.reproducible-builds.org, the bmaptool test suite runs into a dead-lock around ~50% of the time it is run. I am unable to reproduce this problem on my own laptop, so debugging this was not trivial. I hacked in a periodic call to pstree -l -a to have a look at what the process tree looks like when the deadlock happens:
- bzip2 stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
| `-pstree -l -a 1641
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c bzip2 -c -k /tmp/reprotest.GU9bw5/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_vq64huef.img
- xz stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 13086 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 13086 >&2; done
| `-pstree -l -a 13086
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c xz -c -k /tmp/reprotest.jfoiN4/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_kqm3eugf.img
- tar stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 13126 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 13126 >&2; done
| `-pstree -l -a 13126
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-sh -c tar -c -j -O -P -C / /tmp/reprotest.gP4xrW/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_gn6_77ax.img
- gzip stalls:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1640 >&2; done' & pid=$!; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- -vvv; \\\012kill $pid; \\\012wait $pid || :
|-sh -c while sleep 10; do pstree -l -a 1640 >&2; done
| `-pstree -l -a 1640
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- -vvv
`-dh_auto_test /usr/bin/dh_auto_test -- -vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 -vvv
`-sh -c cd /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest tests
`-python3.13 -m pytest tests
`-sh -c gzip -c /tmp/reprotest.M8T3OF/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_minus_1_4up2jzoj.img
All of these tools are invoked by _generate_compressed_files() in tests/test_api_base.py so my guess was that the way that the tools are called using subprocess.Popen(), followed by Popen.wait() somehow causes the kind of deadlock that is described in the documentation of Popen.wait(). So I replaced those subprocess calls with calls to subprocess.check_output() and slurped the whole standard output into a Python string (it's only ~2 MB of data). But this did not change the behaviour at all.
Searching online for deadlocks involving the subprocess module revealed that in Python 2.7 there was a problem of deadlocks when the subprocess module was called from Python threads. The fix was supposedly to add close_fds=True to the Popen call. I tried that too but it resulted no change in behaviour.
Then I stumbled across a slightly different deadlock:
rules -f debian/rules override_dh_auto_test
`-sh -c sh -c 'while sleep 10; do pstree -l -a 1641 >&2; done' & pid=$!; \\\012ret=0; \\\012timeout 15m faketime '2022-06-26 17:10:50' dh_auto_test -- --test-args=-vvv || ret=$?; \\\012kill $pid; \\\012wait $pid || : ; \\\012exit $ret
|-sh -c while sleep 10; do pstree -l -a 1641 >&2; done
| `-pstree -l -a 1641
`-timeout 15m faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-faketime 2022-06-26 17:10:50 dh_auto_test -- --test-args=-vvv
`-dh_auto_test /usr/bin/dh_auto_test -- --test-args=-vvv
`-pybuild /usr/bin/pybuild --test --test-pytest -i python{version} -p 3.13 --test-args=-vvv
`-sh -c cd /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build; python3.13 -m pytest -vvv
`-python3.13 -m pytest -vvv
`-df -PT -- /tmp/reprotest.dL6CBt/const_build_path/const_build_path/.pybuild/cpython3_3.13/build/4Khole_plus_1_3wg1y132.img
The invocation of "df -PT" is from a completely different part of the code. It comes from get_file_system_type() in bmaptool/BmapHelpers.py which just runs:
proc = subprocess.Popen(["df", "-PT", "--", abspath], stdout=PIPE, stderr=PIPE)
stdout, stderr = proc.communicate()But this should be completely harmless, no???
So maybe the problem is not with tests/test_api_base.py but the problem is a more fundamental one. I looked into other parts of the codebase and _open_compressed_file() in bmaptool/TransRead.py does an aweful lot of threading in combination with opening decompressors via the subprocess module.
I have no solution to the problem yet but I do have a workaround:
--- a/tests/helpers.py
+++ b/tests/helpers.py
@@ -284,29 +284,19 @@ def calculate_chksum(file_path):
return hash_obj.hexdigest()
█
█
+import subprocess
+
+
def copy_and_verify_image(image, dest, bmap, image_chksum, image_size):
"""
Copy image 'image' using bmap file 'bmap' to the destination file 'dest'
and verify the resulting image checksum.
"""
█
- f_image = TransRead.TransRead(image)
- f_dest = open(dest, "w+b")
- if bmap:
- f_bmap = open(bmap, "r")
- else:
- f_bmap = None
-
- writer = BmapCopy.BmapCopy(f_image, f_dest, f_bmap, image_size)
- # Randomly decide whether we want the progress bar or not
- if bool(random.getrandbits(1)) and sys.stdout.isatty():
- writer.set_progress_indicator(sys.stdout, None)
- writer.copy(bool(random.getrandbits(1)), bool(random.getrandbits(1)))
+ subprocess.check_call(
+ ["python3", "-m", "bmaptool", "copy"]
+ + (["--bmap", bmap] if bmap else ["--nobmap"])
+ + [image, dest]
+ )
█
- # Compare the original file and the copy are identical
assert calculate_chksum(dest) == image_chksum
-
- if f_bmap:
- f_bmap.close()
- f_dest.close()
- f_image.close()Instead of calling BmapCopy.BmapCopy().copy(), spawn a new process running the
full bmaptool utility with the given options. This is slower than the existing
code because bmaptool gets spawned a lot but it will effectively test the same
thing. Except that with this change we do not anymore test using BmapCopy as a
library. But I do not believe that this test is useful for Debian because we do
not ship this as a library but we ship it as a program that is to be called
from the terminal. And when run like that, the deadlock problem never occurs.
I also documented this in the associated Debian bug: https://bugs.debian.org/1081336
Maybe somebody else feels motivated to look into this. It would probably help being able to reproduce this issue locally instead of a remote machine without ssh access. :)