-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
We got unexpected errors on a random single test on lldb-remote-linux-ubuntu and lldb-remote-linux-win 1-4 times per day.
Error 1: Unresolved some test with the exception failed to create a socket to the launched debug monitor after 20 tries.
Usually we got this error on the Linux host (lldb-remote-linux-ubuntu), e.g. TestGdbRemoteMemoryAllocation.py, TestNonStop.py, TestGdbRemoteSingleStep.py. But we saw the same error (very rarely) on Windows host (lldb-remote-linux-win) too: TestGdbRemoteHostInfo.py.
Error 2: 600 seconds timeout.
Usually (99%) we got this error on the Windows host (lldb-remote-linux-win) with the test TestModuleLoadedNotifys.py and less often with any other test, e.g. TestLldbGdbServer.py. We also saw the same error (very rarely) on Linux host (lldb-remote-linux-ubuntu) too: TestCancelAttach.py.
I believe that the cause of both issues is the same - leaking sockets.
Error 1 is raised in connect_to_debug_monitor() in gdbremote_testcase.py.
It uses a random port 12000 + random.randint(0, 3999) to launch a new instance of lldb-server gdbserver *:port on the target.
Then it tries to connect to the lldb-server up to 10 times with 0.5 sec delay and terminates the lldb-server if connection failed.
Then it tries another port up to 20 times with a random delay 1-5 seconds to avoid collisions.
We checked netstat during the tests in the beginning and got 164 connections in the state TIME_WAIT between the host and the target:
24 connections to target IP:1234 (platform)
100 connections to target IP:43107 (gdbserver)
40 connections to target IP with a random port
and 2 connections in the state ESTABLISHED.
We checked netstat during the tests after 15 minutes and got 641 connections in the state TIME_WAIT between the host and the target
310 connections to target IP:1234 (platform)
331 connections to target IP:43107 (gdbserver)
and 9 connections in the state ESTABLISHED.
Usually the state TIME_WAIT means timeout 4 minutes before releasing resources.
Both buildbots run tests in 8 threads.
Both buildbots use python 3.12. Note the results with python 3.13 are worse probably due to an incremental GC. The average build/test time with python 3.13 is longer.
Probably increasing MAX_ATTEMPTS = 20 in connect_to_debug_monitor() may be enough to fix the error 1.
But I have no idea how to fix and even debug the error 2. It is very hard to reproduce.