[lldb][CPlusPlus] Always use CPlusPlusNameParser for parsing C++ function names #137072

Michael137 · 2025-04-23T22:10:26Z

When doing function lookup in an LLDB module by name, we may need to hand-parse C++ method names to determine properties such as their "basename". LLDB currently has two ways of parsing methods:

Using TrySimplifiedParse, which was historically the go-to parser
Using the CPlusPlusNameParser, which is the more recent parser and has more knowledge of C++ syntax

When CPlusPlusNameParser was introduced (https://reviews.llvm.org/D31451), TrySimplifiedParse was kept around as the first parser in order to keep performance in check. Though the performance difference didn't seem very convincing from the numbers provided. 11s vs 10s when setting a breakpoint in a project with 500k functions. The main difference being that we're invoking the clang::Lexer in the new parser.

It would be nice to just get rid of it entirely. There's an edge-case for which I added a test-case where the TrySimplifiedParse code-path succeeds but incorrectly deduces the basename of operator().

Happy to do benchmarking to see what the performance impact of this is these days.

…ction names When doing function lookup in an LLDB module by name, we may need to hand-parse C++ method names to determine properties of the function name such as its "basename". LLDB currently has two ways of parsing methods: 1. Using `TrySimplifiedParse`, which was historically the go-to parser 2. Using the `CPlusPlusNameParser`, which is the more recent parser and has more knowledge of C++ syntax When `CPlusPlusNameParser` was introduced (https://reviews.llvm.org/D31451), `TrySimplifiedParse` was kept around as the first parser to keep performance in check. Though the performance difference didn't seem very convincing from the numbers provided. 11s vs 10s when setting a breakpoint in a project with 500k functions. The main difference being that we're invoking the clang::Lexer in the new parser. It would be nice to just get rid of it entirely. There's an edge-case for which I added a test-case where the `TrySimplifiedParse` code-path succeeds but incorrectly deduces the basename of `operator()`. Happy to do benchmarking to see what the performance impact of this is these days.

llvmbot · 2025-04-23T22:11:01Z

@llvm/pr-subscribers-lldb

Author: Michael Buch (Michael137)

Changes

When doing function lookup in an LLDB module by name, we may need to hand-parse C++ method names to determine properties of the function name such as its "basename". LLDB currently has two ways of parsing methods:

Using TrySimplifiedParse, which was historically the go-to parser
Using the CPlusPlusNameParser, which is the more recent parser and has more knowledge of C++ syntax

When CPlusPlusNameParser was introduced (https://reviews.llvm.org/D31451), TrySimplifiedParse was kept around as the first parser to keep performance in check. Though the performance difference didn't seem very convincing from the numbers provided. 11s vs 10s when setting a breakpoint in a project with 500k functions. The main difference being that we're invoking the clang::Lexer in the new parser.

It would be nice to just get rid of it entirely. There's an edge-case for which I added a test-case where the TrySimplifiedParse code-path succeeds but incorrectly deduces the basename of operator().

Happy to do benchmarking to see what the performance impact of this is these days.

Full diff: https://github.com/llvm/llvm-project/pull/137072.diff

2 Files Affected:

(modified) lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp (+9-83)
(modified) lldb/test/API/lang/cpp/operators/main.cpp (+2)

diff --git a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
index 40b6563aeb410..f8b53c0837a54 100644
--- a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
+++ b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
@@ -173,34 +173,6 @@ static bool ReverseFindMatchingChars(const llvm::StringRef &s,
   return false;
 }
 
-static bool IsTrivialBasename(const llvm::StringRef &basename) {
-  // Check that the basename matches with the following regular expression
-  // "^~?([A-Za-z_][A-Za-z_0-9]*)$" We are using a hand written implementation
-  // because it is significantly more efficient then using the general purpose
-  // regular expression library.
-  size_t idx = 0;
-  if (basename.starts_with('~'))
-    idx = 1;
-
-  if (basename.size() <= idx)
-    return false; // Empty string or "~"
-
-  if (!std::isalpha(basename[idx]) && basename[idx] != '_')
-    return false; // First character (after removing the possible '~'') isn't in
-                  // [A-Za-z_]
-
-  // Read all characters matching [A-Za-z_0-9]
-  ++idx;
-  while (idx < basename.size()) {
-    if (!std::isalnum(basename[idx]) && basename[idx] != '_')
-      break;
-    ++idx;
-  }
-
-  // We processed all characters. It is a vaild basename.
-  return idx == basename.size();
-}
-
 /// Writes out the function name in 'full_name' to 'out_stream'
 /// but replaces each argument type with the variable name
 /// and the corresponding pretty-printed value
@@ -235,66 +207,20 @@ static bool PrettyPrintFunctionNameWithArgs(Stream &out_stream,
   return true;
 }
 
-bool CPlusPlusLanguage::CxxMethodName::TrySimplifiedParse() {
-  // This method tries to parse simple method definitions which are presumably
-  // most comman in user programs. Definitions that can be parsed by this
-  // function don't have return types and templates in the name.
-  // A::B::C::fun(std::vector<T> &) const
-  size_t arg_start, arg_end;
-  llvm::StringRef full(m_full.GetCString());
-  llvm::StringRef parens("()", 2);
-  if (ReverseFindMatchingChars(full, parens, arg_start, arg_end)) {
-    m_arguments = full.substr(arg_start, arg_end - arg_start + 1);
-    if (arg_end + 1 < full.size())
-      m_qualifiers = full.substr(arg_end + 1).ltrim();
-
-    if (arg_start == 0)
-      return false;
-    size_t basename_end = arg_start;
-    size_t context_start = 0;
-    size_t context_end = full.rfind(':', basename_end);
-    if (context_end == llvm::StringRef::npos)
-      m_basename = full.substr(0, basename_end);
-    else {
-      if (context_start < context_end)
-        m_context = full.substr(context_start, context_end - 1 - context_start);
-      const size_t basename_begin = context_end + 1;
-      m_basename = full.substr(basename_begin, basename_end - basename_begin);
-    }
-
-    if (IsTrivialBasename(m_basename)) {
-      return true;
-    } else {
-      // The C++ basename doesn't match our regular expressions so this can't
-      // be a valid C++ method, clear everything out and indicate an error
-      m_context = llvm::StringRef();
-      m_basename = llvm::StringRef();
-      m_arguments = llvm::StringRef();
-      m_qualifiers = llvm::StringRef();
-      m_return_type = llvm::StringRef();
-      return false;
-    }
-  }
-  return false;
-}
-
 void CPlusPlusLanguage::CxxMethodName::Parse() {
   if (!m_parsed && m_full) {
-    if (TrySimplifiedParse()) {
+    CPlusPlusNameParser parser(m_full.GetStringRef());
+    if (auto function = parser.ParseAsFunctionDefinition()) {
+      m_basename = function->name.basename;
+      m_context = function->name.context;
+      m_arguments = function->arguments;
+      m_qualifiers = function->qualifiers;
+      m_return_type = function->return_type;
       m_parse_error = false;
     } else {
-      CPlusPlusNameParser parser(m_full.GetStringRef());
-      if (auto function = parser.ParseAsFunctionDefinition()) {
-        m_basename = function->name.basename;
-        m_context = function->name.context;
-        m_arguments = function->arguments;
-        m_qualifiers = function->qualifiers;
-        m_return_type = function->return_type;
-        m_parse_error = false;
-      } else {
-        m_parse_error = true;
-      }
+      m_parse_error = true;
     }
+
     if (m_context.empty()) {
       m_scope_qualified = std::string(m_basename);
     } else {
diff --git a/lldb/test/API/lang/cpp/operators/main.cpp b/lldb/test/API/lang/cpp/operators/main.cpp
index c52ef1c8cac47..139c16b02ffe9 100644
--- a/lldb/test/API/lang/cpp/operators/main.cpp
+++ b/lldb/test/API/lang/cpp/operators/main.cpp
@@ -174,6 +174,8 @@ int main(int argc, char **argv) {
   //% self.expect("expr (new struct C[1]); side_effect", endstr=" = 4\n")
   //% self.expect("expr delete c2; side_effect", endstr=" = 1\n")
   //% self.expect("expr delete[] c3; side_effect", endstr=" = 2\n")
+  //% self.expect("image lookup -n operator()", substrs=["C::operator()(int)"])
+  //% self.expect("image lookup -n C::operator()", substrs=["C::operator()(int)"])
   delete c2;
   delete[] c3;
   return 0;

bulbazord · 2025-04-24T00:19:06Z

Happy to do benchmarking to see what the performance impact of this is these days.

I think that would be a good idea personally. This certainly simplifies the code and even improves correctness in some situations, but it would be a shame if the perf was noticeably worse.

labath · 2025-04-24T06:35:40Z

Yeah, getting some numbers would be nice. Maybe just repeating the benchmark in that diff. It might be a good idea to do it with and without the accelerator tables, just to rule out any effect from those.

FWIW, I expect this to not matter these days. It looks like this change (https://reviews.llvm.org/D31451) predates our use of the ItaniumPartialDemangler (https://reviews.llvm.org/D50071), which means that we were getting the name information by demangling the name and then parsing the result. Now, we should get this straight from the demangler, so the parsing code should be less hot.

labath · 2025-04-24T06:36:44Z

lldb/test/API/lang/cpp/operators/main.cpp

  //% self.expect("expr (new struct C[1]); side_effect", endstr=" = 4\n")
  //% self.expect("expr delete c2; side_effect", endstr=" = 1\n")
  //% self.expect("expr delete[] c3; side_effect", endstr=" = 2\n")
+  //% self.expect("image lookup -n operator()", substrs=["C::operator()(int)"])


I take it this means that the "simplified" parser got operator() wrong somehow? Could you also add this to the unit tests in unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp

Yea it was deducing that operator is the basename because it just does a simple search for () and then checks whether the rest of the name looks like a valid function identifier.

Could you also add this to the unit tests in unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp

Yup!

Michael137 · 2025-04-24T09:26:43Z

Yeah, getting some numbers would be nice. Maybe just repeating the benchmark in that diff. It might be a good idea to do it with and without the accelerator tables, just to rule out any effect from those.

Will do

On Darwin (without accelerator tables and statically linked Clang) I don't see this codepath being hit at all when setting a named breakpoint. If the function name doesn't exist we don't seem to call CxxMethodName at all. If the name does exist we hit it a couple of times (~10). Possibly missing something. I'll try on Linux too.

Michael137 requested review from jimingham and labath April 23, 2025 22:10

Michael137 requested a review from JDevlieghere as a code owner April 23, 2025 22:10

llvmbot added the lldb label Apr 23, 2025

fixup! clang-format

45aeebe

labath reviewed Apr 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lldb][CPlusPlus] Always use CPlusPlusNameParser for parsing C++ function names #137072

[lldb][CPlusPlus] Always use CPlusPlusNameParser for parsing C++ function names #137072

Uh oh!

Michael137 commented Apr 23, 2025 •

edited

Loading

Uh oh!

llvmbot commented Apr 23, 2025

Uh oh!

bulbazord commented Apr 24, 2025

Uh oh!

labath commented Apr 24, 2025

Uh oh!

labath Apr 24, 2025

Uh oh!

Michael137 Apr 24, 2025 •

edited

Loading

Uh oh!

Michael137 commented Apr 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[lldb][CPlusPlus] Always use CPlusPlusNameParser for parsing C++ function names #137072

Are you sure you want to change the base?

[lldb][CPlusPlus] Always use CPlusPlusNameParser for parsing C++ function names #137072

Uh oh!

Conversation

Michael137 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 23, 2025

Uh oh!

bulbazord commented Apr 24, 2025

Uh oh!

labath commented Apr 24, 2025

Uh oh!

labath Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Michael137 Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Michael137 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Michael137 commented Apr 23, 2025 •

edited

Loading

Michael137 Apr 24, 2025 •

edited

Loading

Michael137 commented Apr 24, 2025 •

edited

Loading