Skip to content

[lldb][tests] Sockets leaks in API tests with a remote target #118032

@slydiman

Description

@slydiman

We got unexpected errors on a random single test on lldb-remote-linux-ubuntu and lldb-remote-linux-win 1-4 times per day.

Error 1: Unresolved some test with the exception failed to create a socket to the launched debug monitor after 20 tries.
Usually we got this error on the Linux host (lldb-remote-linux-ubuntu), e.g. TestGdbRemoteMemoryAllocation.py, TestNonStop.py, TestGdbRemoteSingleStep.py. But we saw the same error (very rarely) on Windows host (lldb-remote-linux-win) too: TestGdbRemoteHostInfo.py.

Error 2: 600 seconds timeout.
Usually (99%) we got this error on the Windows host (lldb-remote-linux-win) with the test TestModuleLoadedNotifys.py and less often with any other test, e.g. TestLldbGdbServer.py. We also saw the same error (very rarely) on Linux host (lldb-remote-linux-ubuntu) too: TestCancelAttach.py.

I believe that the cause of both issues is the same - leaking sockets.

Error 1 is raised in connect_to_debug_monitor() in gdbremote_testcase.py.
It uses a random port 12000 + random.randint(0, 3999) to launch a new instance of lldb-server gdbserver *:port on the target.
Then it tries to connect to the lldb-server up to 10 times with 0.5 sec delay and terminates the lldb-server if connection failed.
Then it tries another port up to 20 times with a random delay 1-5 seconds to avoid collisions.

We checked netstat during the tests in the beginning and got 164 connections in the state TIME_WAIT between the host and the target:
24 connections to target IP:1234 (platform)
100 connections to target IP:43107 (gdbserver)
40 connections to target IP with a random port
and 2 connections in the state ESTABLISHED.

We checked netstat during the tests after 15 minutes and got 641 connections in the state TIME_WAIT between the host and the target
310 connections to target IP:1234 (platform)
331 connections to target IP:43107 (gdbserver)
and 9 connections in the state ESTABLISHED.

Usually the state TIME_WAIT means timeout 4 minutes before releasing resources.

Both buildbots run tests in 8 threads.

Both buildbots use python 3.12. Note the results with python 3.13 are worse probably due to an incremental GC. The average build/test time with python 3.13 is longer.

Probably increasing MAX_ATTEMPTS = 20 in connect_to_debug_monitor() may be enough to fix the error 1.
But I have no idea how to fix and even debug the error 2. It is very hard to reproduce.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions