Skip to content

Conversation

@Michael137
Copy link
Member

@Michael137 Michael137 commented Apr 23, 2025

When doing function lookup in an LLDB module by name, we may need to hand-parse C++ method names to determine properties such as their "basename". LLDB currently has two ways of parsing methods:

  1. Using TrySimplifiedParse, which was historically the go-to parser
  2. Using the CPlusPlusNameParser, which is the more recent parser and has more knowledge of C++ syntax

When CPlusPlusNameParser was introduced (https://reviews.llvm.org/D31451), TrySimplifiedParse was kept around as the first parser in order to keep performance in check. Though the performance difference didn't seem very convincing from the numbers provided. 11s vs 10s when setting a breakpoint in a project with 500k functions. The main difference being that we're invoking the clang::Lexer in the new parser.

It would be nice to just get rid of it entirely. There's an edge-case for which I added a test-case where the TrySimplifiedParse code-path succeeds but incorrectly deduces the basename of operator().

Happy to do benchmarking to see what the performance impact of this is these days.

…ction names

When doing function lookup in an LLDB module by name, we may need to hand-parse C++ method names to determine properties of the function name such as its "basename". LLDB currently has two ways of parsing methods:
1. Using `TrySimplifiedParse`, which was historically the go-to parser
2. Using the `CPlusPlusNameParser`, which is the more recent parser and
   has more knowledge of C++ syntax

When `CPlusPlusNameParser` was introduced (https://reviews.llvm.org/D31451), `TrySimplifiedParse` was kept around as the first parser to keep performance in check. Though the performance difference didn't seem very convincing from the numbers provided. 11s vs 10s when setting a breakpoint in a project with 500k functions. The main difference being that we're invoking the clang::Lexer in the new parser.

It would be nice to just get rid of it entirely. There's an edge-case for which I added a test-case where the `TrySimplifiedParse` code-path succeeds but incorrectly deduces the basename of `operator()`.

Happy to do benchmarking to see what the performance impact of this is these days.
@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2025

@llvm/pr-subscribers-lldb

Author: Michael Buch (Michael137)

Changes

When doing function lookup in an LLDB module by name, we may need to hand-parse C++ method names to determine properties of the function name such as its "basename". LLDB currently has two ways of parsing methods:

  1. Using TrySimplifiedParse, which was historically the go-to parser
  2. Using the CPlusPlusNameParser, which is the more recent parser and has more knowledge of C++ syntax

When CPlusPlusNameParser was introduced (https://reviews.llvm.org/D31451), TrySimplifiedParse was kept around as the first parser to keep performance in check. Though the performance difference didn't seem very convincing from the numbers provided. 11s vs 10s when setting a breakpoint in a project with 500k functions. The main difference being that we're invoking the clang::Lexer in the new parser.

It would be nice to just get rid of it entirely. There's an edge-case for which I added a test-case where the TrySimplifiedParse code-path succeeds but incorrectly deduces the basename of operator().

Happy to do benchmarking to see what the performance impact of this is these days.


Full diff: https://github.com/llvm/llvm-project/pull/137072.diff

2 Files Affected:

  • (modified) lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp (+9-83)
  • (modified) lldb/test/API/lang/cpp/operators/main.cpp (+2)
diff --git a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
index 40b6563aeb410..f8b53c0837a54 100644
--- a/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
+++ b/lldb/source/Plugins/Language/CPlusPlus/CPlusPlusLanguage.cpp
@@ -173,34 +173,6 @@ static bool ReverseFindMatchingChars(const llvm::StringRef &s,
   return false;
 }
 
-static bool IsTrivialBasename(const llvm::StringRef &basename) {
-  // Check that the basename matches with the following regular expression
-  // "^~?([A-Za-z_][A-Za-z_0-9]*)$" We are using a hand written implementation
-  // because it is significantly more efficient then using the general purpose
-  // regular expression library.
-  size_t idx = 0;
-  if (basename.starts_with('~'))
-    idx = 1;
-
-  if (basename.size() <= idx)
-    return false; // Empty string or "~"
-
-  if (!std::isalpha(basename[idx]) && basename[idx] != '_')
-    return false; // First character (after removing the possible '~'') isn't in
-                  // [A-Za-z_]
-
-  // Read all characters matching [A-Za-z_0-9]
-  ++idx;
-  while (idx < basename.size()) {
-    if (!std::isalnum(basename[idx]) && basename[idx] != '_')
-      break;
-    ++idx;
-  }
-
-  // We processed all characters. It is a vaild basename.
-  return idx == basename.size();
-}
-
 /// Writes out the function name in 'full_name' to 'out_stream'
 /// but replaces each argument type with the variable name
 /// and the corresponding pretty-printed value
@@ -235,66 +207,20 @@ static bool PrettyPrintFunctionNameWithArgs(Stream &out_stream,
   return true;
 }
 
-bool CPlusPlusLanguage::CxxMethodName::TrySimplifiedParse() {
-  // This method tries to parse simple method definitions which are presumably
-  // most comman in user programs. Definitions that can be parsed by this
-  // function don't have return types and templates in the name.
-  // A::B::C::fun(std::vector<T> &) const
-  size_t arg_start, arg_end;
-  llvm::StringRef full(m_full.GetCString());
-  llvm::StringRef parens("()", 2);
-  if (ReverseFindMatchingChars(full, parens, arg_start, arg_end)) {
-    m_arguments = full.substr(arg_start, arg_end - arg_start + 1);
-    if (arg_end + 1 < full.size())
-      m_qualifiers = full.substr(arg_end + 1).ltrim();
-
-    if (arg_start == 0)
-      return false;
-    size_t basename_end = arg_start;
-    size_t context_start = 0;
-    size_t context_end = full.rfind(':', basename_end);
-    if (context_end == llvm::StringRef::npos)
-      m_basename = full.substr(0, basename_end);
-    else {
-      if (context_start < context_end)
-        m_context = full.substr(context_start, context_end - 1 - context_start);
-      const size_t basename_begin = context_end + 1;
-      m_basename = full.substr(basename_begin, basename_end - basename_begin);
-    }
-
-    if (IsTrivialBasename(m_basename)) {
-      return true;
-    } else {
-      // The C++ basename doesn't match our regular expressions so this can't
-      // be a valid C++ method, clear everything out and indicate an error
-      m_context = llvm::StringRef();
-      m_basename = llvm::StringRef();
-      m_arguments = llvm::StringRef();
-      m_qualifiers = llvm::StringRef();
-      m_return_type = llvm::StringRef();
-      return false;
-    }
-  }
-  return false;
-}
-
 void CPlusPlusLanguage::CxxMethodName::Parse() {
   if (!m_parsed && m_full) {
-    if (TrySimplifiedParse()) {
+    CPlusPlusNameParser parser(m_full.GetStringRef());
+    if (auto function = parser.ParseAsFunctionDefinition()) {
+      m_basename = function->name.basename;
+      m_context = function->name.context;
+      m_arguments = function->arguments;
+      m_qualifiers = function->qualifiers;
+      m_return_type = function->return_type;
       m_parse_error = false;
     } else {
-      CPlusPlusNameParser parser(m_full.GetStringRef());
-      if (auto function = parser.ParseAsFunctionDefinition()) {
-        m_basename = function->name.basename;
-        m_context = function->name.context;
-        m_arguments = function->arguments;
-        m_qualifiers = function->qualifiers;
-        m_return_type = function->return_type;
-        m_parse_error = false;
-      } else {
-        m_parse_error = true;
-      }
+      m_parse_error = true;
     }
+
     if (m_context.empty()) {
       m_scope_qualified = std::string(m_basename);
     } else {
diff --git a/lldb/test/API/lang/cpp/operators/main.cpp b/lldb/test/API/lang/cpp/operators/main.cpp
index c52ef1c8cac47..139c16b02ffe9 100644
--- a/lldb/test/API/lang/cpp/operators/main.cpp
+++ b/lldb/test/API/lang/cpp/operators/main.cpp
@@ -174,6 +174,8 @@ int main(int argc, char **argv) {
   //% self.expect("expr (new struct C[1]); side_effect", endstr=" = 4\n")
   //% self.expect("expr delete c2; side_effect", endstr=" = 1\n")
   //% self.expect("expr delete[] c3; side_effect", endstr=" = 2\n")
+  //% self.expect("image lookup -n operator()", substrs=["C::operator()(int)"])
+  //% self.expect("image lookup -n C::operator()", substrs=["C::operator()(int)"])
   delete c2;
   delete[] c3;
   return 0;

@bulbazord
Copy link
Member

Happy to do benchmarking to see what the performance impact of this is these days.

I think that would be a good idea personally. This certainly simplifies the code and even improves correctness in some situations, but it would be a shame if the perf was noticeably worse.

@labath
Copy link
Collaborator

labath commented Apr 24, 2025

Yeah, getting some numbers would be nice. Maybe just repeating the benchmark in that diff. It might be a good idea to do it with and without the accelerator tables, just to rule out any effect from those.

FWIW, I expect this to not matter these days. It looks like this change (https://reviews.llvm.org/D31451) predates our use of the ItaniumPartialDemangler (https://reviews.llvm.org/D50071), which means that we were getting the name information by demangling the name and then parsing the result. Now, we should get this straight from the demangler, so the parsing code should be less hot.

//% self.expect("expr (new struct C[1]); side_effect", endstr=" = 4\n")
//% self.expect("expr delete c2; side_effect", endstr=" = 1\n")
//% self.expect("expr delete[] c3; side_effect", endstr=" = 2\n")
//% self.expect("image lookup -n operator()", substrs=["C::operator()(int)"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it this means that the "simplified" parser got operator() wrong somehow? Could you also add this to the unit tests in unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp

Copy link
Member Author

@Michael137 Michael137 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it was deducing that operator is the basename because it just does a simple search for () and then checks whether the rest of the name looks like a valid function identifier.

Could you also add this to the unit tests in unittests/Language/CPlusPlus/CPlusPlusLanguageTest.cpp

Yup!

@Michael137
Copy link
Member Author

Michael137 commented Apr 24, 2025

Yeah, getting some numbers would be nice. Maybe just repeating the benchmark in that diff. It might be a good idea to do it with and without the accelerator tables, just to rule out any effect from those.

Will do

On Darwin (without accelerator tables and statically linked Clang) I don't see this codepath being hit at all when setting a named breakpoint. If the function name doesn't exist we don't seem to call CxxMethodName at all. If the name does exist we hit it a couple of times (~10). Possibly missing something. I'll try on Linux too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants