Skip to content

Conversation

Bertik23
Copy link
Contributor

@Bertik23 Bertik23 commented Aug 28, 2025

This PR is part of the LLVM IR LSP server project (RFC)

To be able to make a LSP server, it's crutial to have location information about the LLVM objects (Functions, BasicBlocks and Instructions).

This PR adds:

  • Position tracking to the Lexer
  • A new AsmParserContext class, to hold the new position info
  • Fixes some OOB issues in the Lexer
  • Tests to check if the location is correct

The AsmParserContext can be passed as an optional parameter into the parser. Which populates it and it can be then used by other tools, such as the LSP server.

The AsmParserContext idea was borrowed from MLIR. As we didn't want to store data no one else uses inside the objects themselves. But the implementation is different, this class holds several maps of Functions, BasicBlocks and Instructions, to map them to their location.

The Lexer had all of it's manual pointer increments changed to getNextChar() which checks for linebreaks and updates the line and column counters accordingly.
And some utility methods were added to get the positions of the processed tokens.

Copy link

github-actions bot commented Aug 28, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@Bertik23 Bertik23 marked this pull request as ready for review August 28, 2025 11:41
@llvmbot
Copy link
Member

llvmbot commented Aug 28, 2025

@llvm/pr-subscribers-llvm-ir

Author: Bertik23 (Bertik23)

Changes

This PR is part of the LLVM IR LSP server project (RFC)

To be able to make a LSP server, it's crutial to have location information about the LLVM objects (Functions, BasicBlocks and Instructions).

This PR adds:

  • Position tracking to the Lexer
  • A new AsmParserContext class, to hold the new position info
  • Fixes some OOB issues in the Lexer
  • Tests to check if the location is correct

The AsmParserContext can be passed as an optional parameter into the parser. Which populates it and it can be then used by other tools, such as the LSP server.

The AsmParserContext idea was borrowed from MLIR. As we didn't want to store data no one else uses inside the objects themselves. But the implementation is different, this class holds several maps of Functions, BasicBlocks and Instructions, to map them to their location.

The Lexer had all of it's manual pointer increments changed to getNextChar() which checks for linebreaks and updates the line and column counters accordingly.
And some utility methods were added to get the positions of the processed tokens.


Patch is 39.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/155797.diff

13 Files Affected:

  • (added) llvm/include/llvm/AsmParser/AsmParserContext.h (+53)
  • (modified) llvm/include/llvm/AsmParser/LLLexer.h (+35-3)
  • (modified) llvm/include/llvm/AsmParser/LLParser.h (+7-2)
  • (modified) llvm/include/llvm/AsmParser/Parser.h (+9-7)
  • (modified) llvm/include/llvm/IR/Value.h (+32)
  • (modified) llvm/include/llvm/IRReader/IRReader.h (+9-8)
  • (added) llvm/lib/AsmParser/AsmParserContext.cpp (+91)
  • (modified) llvm/lib/AsmParser/CMakeLists.txt (+1)
  • (modified) llvm/lib/AsmParser/LLLexer.cpp (+108-61)
  • (modified) llvm/lib/AsmParser/LLParser.cpp (+24-3)
  • (modified) llvm/lib/AsmParser/Parser.cpp (+19-12)
  • (modified) llvm/lib/IRReader/IRReader.cpp (+9-4)
  • (modified) llvm/unittests/AsmParser/AsmParserTest.cpp (+60)
diff --git a/llvm/include/llvm/AsmParser/AsmParserContext.h b/llvm/include/llvm/AsmParser/AsmParserContext.h
new file mode 100644
index 0000000000000..78ea32ac7ca08
--- /dev/null
+++ b/llvm/include/llvm/AsmParser/AsmParserContext.h
@@ -0,0 +1,53 @@
+//===-- AsmParserContext.h --------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_ASMPARSER_ASMPARSER_STATE_H
+#define LLVM_ASMPARSER_ASMPARSER_STATE_H
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/IR/Value.h"
+#include <optional>
+
+namespace llvm {
+
+/// Registry of file location information for LLVM IR constructs
+///
+/// This class provides access to the file location information
+/// for various LLVM IR constructs. Currently, it supports Function,
+/// BasicBlock and Instruction locations.
+///
+/// When available, it can answer queries about what is at a given
+/// file location, as well as where in a file a given IR construct
+/// is.
+///
+/// This information is optionally emitted by the LLParser while
+/// it reads LLVM textual IR.
+class AsmParserContext {
+public:
+  std::optional<FileLocRange> getFunctionLocation(const Function *) const;
+  std::optional<FileLocRange> getBlockLocation(const BasicBlock *) const;
+  std::optional<FileLocRange> getInstructionLocation(const Instruction *) const;
+  std::optional<Function *> getFunctionAtLocation(const FileLocRange &) const;
+  std::optional<Function *> getFunctionAtLocation(const FileLoc &) const;
+  std::optional<BasicBlock *> getBlockAtLocation(const FileLocRange &) const;
+  std::optional<BasicBlock *> getBlockAtLocation(const FileLoc &) const;
+  std::optional<Instruction *>
+  getInstructionAtLocation(const FileLocRange &) const;
+  std::optional<Instruction *> getInstructionAtLocation(const FileLoc &) const;
+  bool addFunctionLocation(Function *, const FileLocRange &);
+  bool addBlockLocation(BasicBlock *, const FileLocRange &);
+  bool addInstructionLocation(Instruction *, const FileLocRange &);
+
+private:
+  DenseMap<Function *, FileLocRange> Functions;
+  DenseMap<BasicBlock *, FileLocRange> Blocks;
+  DenseMap<Instruction *, FileLocRange> Instructions;
+};
+} // namespace llvm
+
+#endif
diff --git a/llvm/include/llvm/AsmParser/LLLexer.h b/llvm/include/llvm/AsmParser/LLLexer.h
index 501a7aefccd7f..5008ef029f3ff 100644
--- a/llvm/include/llvm/AsmParser/LLLexer.h
+++ b/llvm/include/llvm/AsmParser/LLLexer.h
@@ -29,6 +29,20 @@ namespace llvm {
     const char *CurPtr;
     StringRef CurBuf;
 
+    // The line number at `CurPtr-1`, zero-indexed
+    unsigned CurLineNum = 0;
+    // The column number at `CurPtr-1`, zero-indexed
+    unsigned CurColNum = -1;
+    // The line number of the start of the current token, zero-indexed
+    unsigned CurTokLineNum = 0;
+    // The column number of the start of the current token, zero-indexed
+    unsigned CurTokColNum = 0;
+    // The line number of the end of the current token, zero-indexed
+    unsigned PrevTokEndLineNum = -1;
+    // The column number of the end (exclusive) of the current token,
+    // zero-indexed
+    unsigned PrevTokEndColNum = -1;
+
     enum class ErrorPriority {
       None,   // No error message present.
       Parser, // Errors issued by parser.
@@ -62,9 +76,7 @@ namespace llvm {
     explicit LLLexer(StringRef StartBuf, SourceMgr &SM, SMDiagnostic &,
                      LLVMContext &C);
 
-    lltok::Kind Lex() {
-      return CurKind = LexToken();
-    }
+    lltok::Kind Lex() { return CurKind = LexToken(); }
 
     typedef SMLoc LocTy;
     LocTy getLoc() const { return SMLoc::getFromPointer(TokStart); }
@@ -79,6 +91,21 @@ namespace llvm {
       IgnoreColonInIdentifiers = val;
     }
 
+    // Get the current line number, zero-indexed
+    unsigned getLineNum() { return CurLineNum; }
+    // Get the current column number, zero-indexed
+    unsigned getColNum() { return CurColNum; }
+    // Get the line number of the start of the current token, zero-indexed
+    unsigned getTokLineNum() { return CurTokLineNum; }
+    // Get the column number of the start of the current token, zero-indexed
+    unsigned getTokColNum() { return CurTokColNum; }
+    // Get the line number of the end of the previous token, zero-indexed,
+    // exclusive
+    unsigned getPrevTokEndLineNum() { return PrevTokEndLineNum; }
+    // Get the column number of the end of the previous token, zero-indexed,
+    // exclusive
+    unsigned getPrevTokEndColNum() { return PrevTokEndColNum; }
+
     // This returns true as a convenience for the parser functions that return
     // true on error.
     bool ParseError(LocTy ErrorLoc, const Twine &Msg) {
@@ -93,7 +120,12 @@ namespace llvm {
   private:
     lltok::Kind LexToken();
 
+    // Return closest pointer after `Ptr` that is an end of a label.
+    // Returns nullptr if `Ptr` doesn't point into a label.
+    const char *getLabelTail(const char *Ptr);
     int getNextChar();
+    const char *skipNChars(unsigned N);
+    void advancePositionTo(const char *Ptr);
     void SkipLineComment();
     bool SkipCComment();
     lltok::Kind ReadString(lltok::Kind kind);
diff --git a/llvm/include/llvm/AsmParser/LLParser.h b/llvm/include/llvm/AsmParser/LLParser.h
index c01de4a289a69..02460e5e52203 100644
--- a/llvm/include/llvm/AsmParser/LLParser.h
+++ b/llvm/include/llvm/AsmParser/LLParser.h
@@ -13,6 +13,7 @@
 #ifndef LLVM_ASMPARSER_LLPARSER_H
 #define LLVM_ASMPARSER_LLPARSER_H
 
+#include "AsmParserContext.h"
 #include "LLLexer.h"
 #include "llvm/ADT/StringMap.h"
 #include "llvm/AsmParser/NumberedValues.h"
@@ -177,6 +178,9 @@ namespace llvm {
     // Map of module ID to path.
     std::map<unsigned, StringRef> ModuleIdMap;
 
+    /// Keeps track of source locations for Values, BasicBlocks, and Functions
+    AsmParserContext *ParserContext;
+
     /// Only the llvm-as tool may set this to false to bypass
     /// UpgradeDebuginfo so it can generate broken bitcode.
     bool UpgradeDebugInfo;
@@ -189,10 +193,11 @@ namespace llvm {
   public:
     LLParser(StringRef F, SourceMgr &SM, SMDiagnostic &Err, Module *M,
              ModuleSummaryIndex *Index, LLVMContext &Context,
-             SlotMapping *Slots = nullptr)
+             SlotMapping *Slots = nullptr,
+             AsmParserContext *ParserContext = nullptr)
         : Context(Context), OPLex(F, SM, Err, Context),
           Lex(F, SM, Err, Context), M(M), Index(Index), Slots(Slots),
-          BlockAddressPFS(nullptr) {}
+          BlockAddressPFS(nullptr), ParserContext(ParserContext) {}
     bool Run(
         bool UpgradeDebugInfo,
         DataLayoutCallbackTy DataLayoutCallback = [](StringRef, StringRef) {
diff --git a/llvm/include/llvm/AsmParser/Parser.h b/llvm/include/llvm/AsmParser/Parser.h
index c900b79665404..22b0881d92b53 100644
--- a/llvm/include/llvm/AsmParser/Parser.h
+++ b/llvm/include/llvm/AsmParser/Parser.h
@@ -15,6 +15,7 @@
 
 #include "llvm/ADT/STLFunctionalExtras.h"
 #include "llvm/ADT/StringRef.h"
+#include "llvm/AsmParser/AsmParserContext.h"
 #include "llvm/Support/Compiler.h"
 #include <memory>
 #include <optional>
@@ -62,7 +63,8 @@ parseAssemblyFile(StringRef Filename, SMDiagnostic &Err, LLVMContext &Context,
 ///              parsing.
 LLVM_ABI std::unique_ptr<Module>
 parseAssemblyString(StringRef AsmString, SMDiagnostic &Err,
-                    LLVMContext &Context, SlotMapping *Slots = nullptr);
+                    LLVMContext &Context, SlotMapping *Slots = nullptr,
+                    AsmParserContext *ParserContext = nullptr);
 
 /// Holds the Module and ModuleSummaryIndex returned by the interfaces
 /// that parse both.
@@ -128,9 +130,9 @@ parseSummaryIndexAssemblyString(StringRef AsmString, SMDiagnostic &Err);
 LLVM_ABI std::unique_ptr<Module> parseAssembly(
     MemoryBufferRef F, SMDiagnostic &Err, LLVMContext &Context,
     SlotMapping *Slots = nullptr,
-    DataLayoutCallbackTy DataLayoutCallback = [](StringRef, StringRef) {
-      return std::nullopt;
-    });
+    DataLayoutCallbackTy DataLayoutCallback =
+        [](StringRef, StringRef) { return std::nullopt; },
+    AsmParserContext *ParserContext = nullptr);
 
 /// Parse LLVM Assembly including the summary index from a MemoryBuffer.
 ///
@@ -169,9 +171,9 @@ parseSummaryIndexAssembly(MemoryBufferRef F, SMDiagnostic &Err);
 LLVM_ABI bool parseAssemblyInto(
     MemoryBufferRef F, Module *M, ModuleSummaryIndex *Index, SMDiagnostic &Err,
     SlotMapping *Slots = nullptr,
-    DataLayoutCallbackTy DataLayoutCallback = [](StringRef, StringRef) {
-      return std::nullopt;
-    });
+    DataLayoutCallbackTy DataLayoutCallback =
+        [](StringRef, StringRef) { return std::nullopt; },
+    AsmParserContext *ParserContext = nullptr);
 
 /// Parse a type and a constant value in the given string.
 ///
diff --git a/llvm/include/llvm/IR/Value.h b/llvm/include/llvm/IR/Value.h
index 04d0391c04098..9e27797d151e2 100644
--- a/llvm/include/llvm/IR/Value.h
+++ b/llvm/include/llvm/IR/Value.h
@@ -55,6 +55,38 @@ class User;
 
 using ValueName = StringMapEntry<Value *>;
 
+struct FileLoc {
+  unsigned Line;
+  unsigned Col;
+
+  bool operator<=(const FileLoc &RHS) const {
+    return Line < RHS.Line || (Line == RHS.Line && Col <= RHS.Col);
+  }
+
+  bool operator<(const FileLoc &RHS) const {
+    return Line < RHS.Line || (Line == RHS.Line && Col < RHS.Col);
+  }
+
+  FileLoc(unsigned L, unsigned C) : Line(L), Col(C) {}
+};
+
+struct FileLocRange {
+  FileLoc Start;
+  FileLoc End;
+
+  FileLocRange() : Start(0, 0), End(0, 0) {}
+
+  FileLocRange(FileLoc S, FileLoc E) : Start(S), End(E) {
+    assert(Start <= End);
+  }
+
+  bool contains(FileLoc L) const { return Start <= L && L <= End; }
+
+  bool contains(FileLocRange LR) const {
+    return contains(LR.Start) && contains(LR.End);
+  }
+};
+
 //===----------------------------------------------------------------------===//
 //                                 Value Class
 //===----------------------------------------------------------------------===//
diff --git a/llvm/include/llvm/IRReader/IRReader.h b/llvm/include/llvm/IRReader/IRReader.h
index 790140f19934e..00cf12d342ae0 100644
--- a/llvm/include/llvm/IRReader/IRReader.h
+++ b/llvm/include/llvm/IRReader/IRReader.h
@@ -15,6 +15,7 @@
 #define LLVM_IRREADER_IRREADER_H
 
 #include "llvm/ADT/StringRef.h"
+#include "llvm/AsmParser/AsmParserContext.h"
 #include "llvm/Bitcode/BitcodeReader.h"
 #include "llvm/Support/Compiler.h"
 #include <memory>
@@ -50,19 +51,19 @@ getLazyIRFileModule(StringRef Filename, SMDiagnostic &Err, LLVMContext &Context,
 /// for it.  Otherwise, attempt to parse it as LLVM Assembly and return
 /// a Module for it.
 /// \param DataLayoutCallback Override datalayout in the llvm assembly.
-LLVM_ABI std::unique_ptr<Module> parseIR(MemoryBufferRef Buffer,
-                                         SMDiagnostic &Err,
-                                         LLVMContext &Context,
-                                         ParserCallbacks Callbacks = {});
+LLVM_ABI std::unique_ptr<Module>
+parseIR(MemoryBufferRef Buffer, SMDiagnostic &Err, LLVMContext &Context,
+        ParserCallbacks Callbacks = {},
+        AsmParserContext *ParserContext = nullptr);
 
 /// If the given file holds a bitcode image, return a Module for it.
 /// Otherwise, attempt to parse it as LLVM Assembly and return a Module
 /// for it.
 /// \param DataLayoutCallback Override datalayout in the llvm assembly.
-LLVM_ABI std::unique_ptr<Module> parseIRFile(StringRef Filename,
-                                             SMDiagnostic &Err,
-                                             LLVMContext &Context,
-                                             ParserCallbacks Callbacks = {});
+LLVM_ABI std::unique_ptr<Module>
+parseIRFile(StringRef Filename, SMDiagnostic &Err, LLVMContext &Context,
+            ParserCallbacks Callbacks = {},
+            AsmParserContext *ParserContext = nullptr);
 }
 
 #endif
diff --git a/llvm/lib/AsmParser/AsmParserContext.cpp b/llvm/lib/AsmParser/AsmParserContext.cpp
new file mode 100644
index 0000000000000..f5e3d83f5d346
--- /dev/null
+++ b/llvm/lib/AsmParser/AsmParserContext.cpp
@@ -0,0 +1,91 @@
+//===-- AsmParserContext.cpp ------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/AsmParser/AsmParserContext.h"
+
+namespace llvm {
+
+std::optional<FileLocRange>
+AsmParserContext::getFunctionLocation(const Function *F) const {
+  if (!Functions.contains(F))
+    return std::nullopt;
+  return Functions.at(F);
+}
+
+std::optional<FileLocRange>
+AsmParserContext::getBlockLocation(const BasicBlock *BB) const {
+  if (!Blocks.contains(BB))
+    return std::nullopt;
+  return Blocks.at(BB);
+}
+
+std::optional<FileLocRange>
+AsmParserContext::getInstructionLocation(const Instruction *I) const {
+  if (!Instructions.contains(I))
+    return std::nullopt;
+  return Instructions.at(I);
+}
+
+std::optional<Function *>
+AsmParserContext::getFunctionAtLocation(const FileLocRange &Query) const {
+  for (auto &[F, Loc] : Functions) {
+    if (Loc.contains(Query))
+      return F;
+  }
+  return std::nullopt;
+}
+
+std::optional<Function *>
+AsmParserContext::getFunctionAtLocation(const FileLoc &Query) const {
+  return getFunctionAtLocation(FileLocRange(Query, Query));
+}
+
+std::optional<BasicBlock *>
+AsmParserContext::getBlockAtLocation(const FileLocRange &Query) const {
+  for (auto &[BB, Loc] : Blocks) {
+    if (Loc.contains(Query))
+      return BB;
+  }
+  return std::nullopt;
+}
+
+std::optional<BasicBlock *>
+AsmParserContext::getBlockAtLocation(const FileLoc &Query) const {
+  return getBlockAtLocation(FileLocRange(Query, Query));
+}
+
+std::optional<Instruction *>
+AsmParserContext::getInstructionAtLocation(const FileLocRange &Query) const {
+  for (auto &[I, Loc] : Instructions) {
+    if (Loc.contains(Query))
+      return I;
+  }
+  return std::nullopt;
+}
+
+std::optional<Instruction *>
+AsmParserContext::getInstructionAtLocation(const FileLoc &Query) const {
+  return getInstructionAtLocation(FileLocRange(Query, Query));
+}
+
+bool AsmParserContext::addFunctionLocation(Function *F,
+                                           const FileLocRange &Loc) {
+  return Functions.insert({F, Loc}).second;
+}
+
+bool AsmParserContext::addBlockLocation(BasicBlock *BB,
+                                        const FileLocRange &Loc) {
+  return Blocks.insert({BB, Loc}).second;
+}
+
+bool AsmParserContext::addInstructionLocation(Instruction *I,
+                                              const FileLocRange &Loc) {
+  return Instructions.insert({I, Loc}).second;
+}
+
+} // namespace llvm
diff --git a/llvm/lib/AsmParser/CMakeLists.txt b/llvm/lib/AsmParser/CMakeLists.txt
index 20d0c50a029ca..dcfcc06f093a7 100644
--- a/llvm/lib/AsmParser/CMakeLists.txt
+++ b/llvm/lib/AsmParser/CMakeLists.txt
@@ -1,5 +1,6 @@
 # AsmParser
 add_llvm_component_library(LLVMAsmParser
+  AsmParserContext.cpp
   LLLexer.cpp
   LLParser.cpp
   Parser.cpp
diff --git a/llvm/lib/AsmParser/LLLexer.cpp b/llvm/lib/AsmParser/LLLexer.cpp
index 3d5bd6155536e..be5b2b9bce0ca 100644
--- a/llvm/lib/AsmParser/LLLexer.cpp
+++ b/llvm/lib/AsmParser/LLLexer.cpp
@@ -155,15 +155,6 @@ static bool isLabelChar(char C) {
          C == '.' || C == '_';
 }
 
-/// isLabelTail - Return true if this pointer points to a valid end of a label.
-static const char *isLabelTail(const char *CurPtr) {
-  while (true) {
-    if (CurPtr[0] == ':') return CurPtr+1;
-    if (!isLabelChar(CurPtr[0])) return nullptr;
-    ++CurPtr;
-  }
-}
-
 //===----------------------------------------------------------------------===//
 // Lexer definition.
 //===----------------------------------------------------------------------===//
@@ -174,27 +165,75 @@ LLLexer::LLLexer(StringRef StartBuf, SourceMgr &SM, SMDiagnostic &Err,
   CurPtr = CurBuf.begin();
 }
 
+const char *LLLexer::getLabelTail(const char *Ptr) {
+  while (Ptr != CurBuf.end()) {
+    if (Ptr[0] == ':')
+      return Ptr + 1;
+    if (!isLabelChar(Ptr[0]))
+      return nullptr;
+    ++Ptr;
+  }
+  return nullptr;
+}
+
 int LLLexer::getNextChar() {
-  char CurChar = *CurPtr++;
-  switch (CurChar) {
-  default: return (unsigned char)CurChar;
-  case 0:
-    // A nul character in the stream is either the end of the current buffer or
-    // a random nul in the file.  Disambiguate that here.
-    if (CurPtr-1 != CurBuf.end())
-      return 0;  // Just whitespace.
-
-    // Otherwise, return end of file.
-    --CurPtr;  // Another call to lex will return EOF again.
+  if (CurPtr == CurBuf.end())
     return EOF;
+  // Increment line number if this is the first character after a newline
+  if (CurPtr > CurBuf.begin() && *(CurPtr - 1) == '\n') {
+    CurLineNum++;
+    CurColNum = 0;
+  } else
+    CurColNum++;
+  return *CurPtr++;
+}
+
+const char *LLLexer::skipNChars(unsigned N) {
+  while (N--)
+    getNextChar();
+  return CurPtr;
+}
+
+void LLLexer::advancePositionTo(const char *Ptr) {
+  bool RecalculateColumn = false;
+  while (CurPtr != Ptr) {
+    if (CurPtr > Ptr) {
+      --CurPtr;
+      --CurColNum;
+      // Since CurPtr is one char ahead of the stored position, check if the
+      // previous char is not a newline
+      if (CurPtr != CurBuf.begin() && *(CurPtr - 1) == '\n') {
+        --CurLineNum;
+        RecalculateColumn = true;
+      }
+    } else
+      getNextChar();
+  }
+  if (RecalculateColumn) {
+    CurColNum = 0;
+    // Count the number of chars to the previous newline or start of buffer
+    for (const char *Ptr = CurPtr; Ptr != CurBuf.begin() && *(Ptr - 1) != '\n';
+         --Ptr, ++CurColNum)
+      ;
   }
 }
 
 lltok::Kind LLLexer::LexToken() {
+  // Set token end to next location, since the end is
+  // exclusive
+  if (CurPtr != CurBuf.begin() && *(CurPtr - 1) == '\n') {
+    PrevTokEndLineNum = CurLineNum + 1;
+    PrevTokEndColNum = 0;
+  } else {
+    PrevTokEndLineNum = CurLineNum;
+    PrevTokEndColNum = CurColNum + 1;
+  }
   while (true) {
     TokStart = CurPtr;
-
     int CurChar = getNextChar();
+    CurTokColNum = CurColNum;
+    CurTokLineNum = CurLineNum;
+
     switch (CurChar) {
     default:
       // Handle letters: [a-zA-Z_]
@@ -215,13 +254,13 @@ lltok::Kind LLLexer::LexToken() {
     case '%': return LexPercent();
     case '"': return LexQuote();
     case '.':
-      if (const char *Ptr = isLabelTail(CurPtr)) {
-        CurPtr = Ptr;
+      if (const char *Ptr = getLabelTail(CurPtr)) {
+        advancePositionTo(Ptr);
         StrVal.assign(TokStart, CurPtr-1);
         return lltok::LabelStr;
       }
       if (CurPtr[0] == '.' && CurPtr[1] == '.') {
-        CurPtr += 2;
+        skipNChars(2);
         return lltok::dotdotdot;
       }
       return lltok::Error;
@@ -298,15 +337,15 @@ lltok::Kind LLLexer::LexAt() {
 }
 
 lltok::Kind LLLexer::LexDollar() {
-  if (const char *Ptr = isLabelTail(TokStart)) {
-    CurPtr = Ptr;
+  if (const char *Ptr = getLabelTail(TokStart)) {
+    advancePositionTo(Ptr);
     StrVal.assign(TokStart, CurPtr - 1);
     return lltok::LabelStr;
   }
 
   // Handle DollarStringConstant: $\"[^\"]*\"
   if (CurPtr[0] == '"') {
-    ++CurPtr;
+    getNextChar();
 
     while (true) {
       int CurChar = getNextChar();
@@ -358,11 +397,11 @@ bool LLLexer::ReadVarName() {
   if (isalpha(static_cast<unsigned char>(CurPtr[0])) ||
       CurPtr[0] == '-' || CurPtr[0] == '$' ||
       CurPtr[0] == '.' || CurPtr[0] == '_') {
-    ++CurPtr;
+    getNextChar();
     while (isalnum(static_cast<unsigned char>(CurPtr[0])) ||
            CurPtr[0] == '-' || CurPtr[0] == '$' ||
            CurPtr[0] == '.' || CurPtr[0] == '_')
-      ++CurPtr;
+      getNextChar();
 
     StrVal.assign(NameStart, CurPtr);
     return true;
@@ -376,7 +415,8 @@ lltok::Kind LLLexer::LexUIntID(lltok::Kind Token) {
   if (!isdigit(static_cast<unsigned char>(CurPtr[0])))
     return lltok::Error;
 
-  for (++CurPtr; isdigit(static_cast<unsigned char>(CurPtr[0])); ++CurPtr)
+  for (getNextChar(); isdigit(static_cast<u...
[truncated]

Comment on lines 39 to 45
std::optional<Function *> getFunctionAtLocation(const FileLocRange &) const;
std::optional<Function *> getFunctionAtLocation(const FileLoc &) const;
std::optional<BasicBlock *> getBlockAtLocation(const FileLocRange &) const;
std::optional<BasicBlock *> getBlockAtLocation(const FileLoc &) const;
std::optional<Instruction *>
getInstructionAtLocation(const FileLocRange &) const;
std::optional<Instruction *> getInstructionAtLocation(const FileLoc &) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use nullptr here to indicate absence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 22 to 24
if (!Blocks.contains(BB))
return std::nullopt;
return Blocks.at(BB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps we can use DenseMap::lookup_or here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that suggestion, makes the code a bit cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it sadly is not possible. Because the return type of lookup_or is the value type of the map, so it can't return a nullopt

Comment on lines 490 to 495
dbgs() << #Loc1 " location: " << Loc1.Start.Line << ":" \
<< Loc1.Start.Col << " - " << Loc1.End.Line << ":" \
<< Loc1.End.Col << "\n"; \
dbgs() << #Loc2 " location: " << Loc2.Start.Line << ":" \
<< Loc2.Start.Col << " - " << Loc2.End.Line << ":" \
<< Loc2.End.Col << "\n"; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap these line with LLVM_DEBUG. Though it's a bit unusual to have debug prints in a unit test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapped it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtest has it's own support for printing additional debugging information if an expectation fails. In the simplest case you can write something like ASSERT_TRUE(AreLocsEqual) << .... As this test is not hot, you shouldn't need the more sophisticated methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also EXPECT seems more appropriate than ASSERT here.

@Bertik23
Copy link
Contributor Author

Gentle ping @mshockwave could you please have a look at the changes I did

@mshockwave mshockwave requested a review from nikic September 11, 2025 17:57
Copy link
Member

@mshockwave mshockwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. Could you tag other potential reviewers (and I'll help you to add them as I figured you probably don't have the permissions to do that yet) to give another pass?

@Bertik23
Copy link
Contributor Author

I have not really an idea, most code this PR changes is from the initial lexer from Chris
Maybe @jurahul might be interested

#include "llvm/Support/SourceMgr.h"
#include "gtest/gtest.h"

#define DEBUG_TYPE "Unittest-asm-parser-tests"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's relatively rare to capitalize DEBUG_TYPE nowadays

Suggested change
#define DEBUG_TYPE "Unittest-asm-parser-tests"
#define DEBUG_TYPE "unittest-asm-parser-tests"

///
/// This information is optionally emitted by the LLParser while
/// it reads LLVM textual IR.
class AsmParserContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this class makes it sound like something that should be owned by the LLParser. Maybe something like LLParserLocationInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AsmParserContext is better, as it's more future proof. In the future we might want to add some more info, other than location. As for the LLParser vs AsmParser, I think AsmParser is better since it uses what is parsed other then how it's parserd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @arichardson -- I'd expect AsmParserContext to be something internally used by AsmParser. What kind of future extensions do you have in mind here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is supposed to be used similarly to AsmParserState in MLIR, In tools, that need to know some things from the parser, in this case we want to use it to be able to access locations or instructions, functions, ... in IR files for the purposes of navigating in the file via a LSP server (such as goto definition/references)

@@ -0,0 +1,70 @@
//===-- AsmParserContext.h --------------------------------------*- C++ -*-===//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omit file name and emacs marker in new files.

///
/// This information is optionally emitted by the LLParser while
/// it reads LLVM textual IR.
class AsmParserContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @arichardson -- I'd expect AsmParserContext to be something internally used by AsmParser. What kind of future extensions do you have in mind here?

Function *
AsmParserContext::getFunctionAtLocation(const FileLocRange &Query) const {
for (auto &[F, Loc] : Functions) {
if (Loc.contains(Query))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very inefficient...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not that bad. When testing this even on large files the LSP was pretty responsive. So I wouldn't consider this an issue for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes in this file still necessary, given that you how determine the line/column based on the pointer, rather than updating it on the fly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes are needed, to track the locations of previous tokens.

Changing the manual pointer increments to getNextChar() is not crutial to this PR now, altho it would bring the benefit of eliminating some OOB errors. They were actually filed first in a separate PR #152103
I'm willing to remove them from this PR if requested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that some of these changes still make sense, but with the "everything must go through getNextChar" constraint removed, some things here are pretty weird now, like how skipNChars() works by doing N individual increments. I'd prefer dropping these changes. If there are OOB issues, we should address them separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the changes from this PR

Comment on lines 490 to 495
dbgs() << #Loc1 " location: " << Loc1.Start.Line << ":" \
<< Loc1.Start.Col << " - " << Loc1.End.Line << ":" \
<< Loc1.End.Col << "\n"; \
dbgs() << #Loc2 " location: " << Loc2.Start.Line << ":" \
<< Loc2.Start.Col << " - " << Loc2.End.Line << ":" \
<< Loc2.End.Col << "\n"; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtest has it's own support for printing additional debugging information if an expectation fails. In the simplest case you can write something like ASSERT_TRUE(AreLocsEqual) << .... As this test is not hot, you shouldn't need the more sophisticated methods.

Comment on lines 490 to 495
dbgs() << #Loc1 " location: " << Loc1.Start.Line << ":" \
<< Loc1.Start.Col << " - " << Loc1.End.Line << ":" \
<< Loc1.End.Col << "\n"; \
dbgs() << #Loc2 " location: " << Loc2.Start.Line << ":" \
<< Loc2.Start.Col << " - " << Loc2.End.Line << ":" \
<< Loc2.End.Col << "\n"; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also EXPECT seems more appropriate than ASSERT here.


TEST(AsmParserTest, ParserObjectLocations) {
// Expected to fail with function location starting one character later, needs
// a fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unclear on what this comment refers to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was left there from a development iteration where the test was failing, removed the comment

@Bertik23
Copy link
Contributor Author

Bertik23 commented Oct 8, 2025

I hope I have addressed / responded all of the issues that nikic addressed


Instruction *
AsmParserContext::getInstructionAtLocation(const FileLocRange &Query) const {
for (auto &[I, Loc] : Instructions) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over all instructions seems rather inefficient e.g. if you're trying to get info towards the end of large file. Can't we have a list sorted by file location and do a binary search?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it doesn't matter if the lookup by function is more common. Do you have a link to code using this new class?

I'm not worried about performance for the LSP, optimizations can always be done later. Please don't interpret my review comments as negative towards this change, I am very excited to see this support land eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants