Skip to content

Conversation

@ashgti
Copy link
Contributor

@ashgti ashgti commented Oct 31, 2025

We need to use an unbuffered socket / pipe to the lldb-dap subprocess to make sure that when we perform a select, there is no partial messages in the file buffer. If there are partial messages in the file buffer we could end up doing a read that crosses the buffer size and results in a hang until a new message is sent, that may or may not occur.

As a result, when we perform a read we can get less than the requested number of bytes. We need to retry the read operation if this occurs.

I think this should resolve issue #165784

We need to use an unbuffered socket / pipe to the lldb-dap subprocess to make sure that when we perform a select, there is no partial messages in the file buffer. If there are partial messages in the file buffer we could end up doing a read that crosses the buffer size and results in a hang until a new message is sent, that may or may not occur.

As a result, when we perform a read we can get less than the requested number of bytes. We need to retry the read operation if this occurs.
@ashgti ashgti requested a review from JDevlieghere as a code owner October 31, 2025 04:04
@ashgti ashgti requested a review from da-viper October 31, 2025 04:04
@llvmbot llvmbot added the lldb label Oct 31, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 31, 2025

@llvm/pr-subscribers-lldb

Author: John Harrison (ashgti)

Changes

We need to use an unbuffered socket / pipe to the lldb-dap subprocess to make sure that when we perform a select, there is no partial messages in the file buffer. If there are partial messages in the file buffer we could end up doing a read that crosses the buffer size and results in a hang until a new message is sent, that may or may not occur.

As a result, when we perform a read we can get less than the requested number of bytes. We need to retry the read operation if this occurs.

I think this should resolve issue #165784


Full diff: https://github.com/llvm/llvm-project/pull/165823.diff

1 Files Affected:

  • (modified) lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py (+13-2)
diff --git a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py
index 8f3652172dfdf..5000a8ec0a0ce 100644
--- a/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py
+++ b/lldb/packages/Python/lldbsuite/test/tools/lldb-dap/dap_server.py
@@ -220,6 +220,11 @@ def _read_packet(
         followed by the JSON bytes from self.recv. Returns None on EOF.
         """
 
+        # NOTE: We open the socket or pipe to the subprocess in an unbuffered
+        # mode to ensure we do not end up with a partial message in the buffer
+        # when we perform our select. Otherwise, we may run into a case where we
+        # attempt a read and the buffer is partially full but a new message is
+        # not sent to fill in the requested buffer size.
         ready = self.selector.select(timeout)
         if not ready:
             warnings.warn(
@@ -243,10 +248,16 @@ def _read_packet(
             if separator != "":
                 Exception("malformed DAP content header, unexpected line: " + separator)
             # Read JSON bytes
-            json_str = self.recv.read(length).decode()
+            # NOTE: Because the read channel is unbuffered we may receive less
+            # than the requested number of bytes. In, which case we need to
+            # perform a new read.
+            json_str = b""
+            while len(json_str) < length:
+                json_str += self.recv.read(length - len(json_str))
+
             if self.trace_file:
                 self.trace_file.write(
-                    "%s from adapter:\n%s\n" % (time.time(), json_str)
+                    "%s from adapter:\n%s\n" % (time.time(), json_str.decode())
                 )
             # Decode the JSON bytes into a python dictionary
             return json.loads(json_str)

@ashgti ashgti requested a review from DavidSpickett October 31, 2025 04:05
Copy link
Collaborator

@DavidSpickett DavidSpickett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this in theory makes the tests more stable and does not defeat the point of them in the process, fine with me.

Someone more familiar with lldb-dap should approve.

(I'm still going to make changes to Linaro's Arm bot, but stability has been an issue elsewhere so I'm sure this change will help some anyway)

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM, but it makes me wonder if we should adopt non-blocking reads like we do in JSONTransport. We already have a thread queueing up ProtocolMessages, so maybe it's time to double down on that and prevent anyone from doing a blocking read directly?

That would also help with the randomized delay idea I posted yesterday (where we could add the messages to the queue after a random amount of time to simulate slow CI nodes and non-deterministic ordering between request/responses and events).

@ashgti
Copy link
Contributor Author

ashgti commented Oct 31, 2025

A further improvement would be to switch to the python asyncio library for making the reads truly async. I had prototyped that some in ashgti@4468af2 but it was getting to be pretty big/invasive so I was trying to make this into a more targeted change, which is where I switched to using the selectors module to do this without any of the other async parts.

Its a bit of a refactor to push the async operations down from the read but you can use asyncio.loop.run_until_complete docs to block a thread until a task/coroutine is complete.

@JDevlieghere
Copy link
Member

Is there a middle ground that does async reads but still provides an abstraction over that that's not necessarily async, or is that just going to leave us in a "worst-of-both-worlds" situation? While I'm sure that using an async model would be a better fit for the kinds of tests we write for DAP, I'm also somewhat worried that it's not a common paradigm in LLDB.

I think a contributing factor to the instability of the DAP tests is that we (LLDB developers) are not really used to writing tests that have this asynchronous nature. Finding a principled way to do this (with something like asyncio) may force that, or it may lead to the same kind of issues if we use it wrong. I'm not arguing against it, just trying to think out loud about how we can make DAP tests easy to write correctly and hard to write incorrectly.

@ashgti
Copy link
Contributor Author

ashgti commented Oct 31, 2025

I also think that it would be beneficial to move towards waiting on specific state instead of on specific events. E.g. most of the places we use wait_for_event could be updated to track and check a specific state.

When we trigger a request, any events emitted during the request COULD come before or after the response, depending on the timing.

Moving to tracking state specifically, could help improve this since we could then better handle out of order packets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants