String only plugin support by Hungerarray · Pull Request #49 · vertexcover-io/locatr

Hungerarray · 2025-09-15T13:51:55Z

String Only plugin

This PR adds an additional plugin for DOM text evaluation mode, raw-text.
This plugin allows users to add DOM text directly, without making
additional calls to the corresponding automation platforms. Along with page source,
users also need to provide information on whether the page source is html and xml,
and if the platform is android, ios, or web. Depending on these values, the
logic for page source changes to fit the selected element.

Additions

raw-text plugin, that accepts page source when creation and provides working
implementation for text only plugin that can be using with DOMEvaluation mode.

Limitations

For every page source, we tend to trim of elements that aren't visible. In case of
android and ios all the style values that affect is provided in the xml itself,
however for web platform, html alone isn't the entire truth, there are cases
where css might have hidden the element. Since we don't have access to DOM
API or the CSS to query, we evaluate just on the basis of HTML, which may not be
the entire truth

Summary by CodeRabbit

New Features
- Added raw-text plugin to minify HTML/XML, produce DOM+locator maps, and extract first-unique IDs.
- Generates concise XPath selectors and validates locators against provided content.
- Added ability to disable reranking (affects sorting); visual analysis now requires reranker.
Chores
- Added an HTML/XPath parsing dependency to support DOM processing and XPath evaluation.

coderabbitai · 2025-09-15T13:52:03Z

Walkthrough

Adds HTML and XML processing utilities: an HTML DOM adapter, XPath optimalizer, HTML minifier and locator map builder, XML XPath validator, and a raw-text plugin. Also updates go.mod with an indirect htmlquery dependency and adds an option to disable the reranker with corresponding conditional logic in modes and Locatr initialization.

Changes

Cohort / File(s)	Summary
Dependencies `go.mod`	Adds indirect dependency `github.com/antchfx/htmlquery v1.3.4`.
HTML DOM Adapter `pkg/internal/html/adapter.go`	Adds `Document` and `Node` interfaces with `HTMLDoc`/`HTMLNode` implementations, constructors, XPath querying via htmlquery, node utilities (TagName, IsElement, attributes, parent/children, Index, Equal), and `IsValidXPath`.
HTML Minifier & Locator Map `pkg/internal/html/minifier.go`	Adds HTML parsing/minification, visibility heuristics, recursive ElementSpec construction (`MinifySource`), locator-map creation (`CreateLocatorMap`), and debug printers.
XPath Generation (HTML) `pkg/internal/html/xpath.go`	Adds `GetOptimalXPath` with prioritized attribute heuristics, two-attribute candidates, semi-unique tracking, and hierarchical index fallback; includes helper functions for candidate generation and uniqueness checks.
XML XPath Utilities `pkg/internal/xml/adapter.go`	Adds `IsValidXPath(xpath, dom string) (bool, error)` using xmlquery to validate XPath against XML content; updates imports to include `strings`.
Raw-Text Plugin `pkg/plugins/raw-text.go`	Adds `rawTextPlugin` for raw HTML/XML: PageType/Platform enums, constructor, `ErrModeNotSupported`, `GetMinifiedDOM`, `ExtractFirstUniqueID`, `IsLocatorValid`, and unsupported-mode methods returning `ErrModeNotSupported`.
Locatr config / Reranker Option `pkg/locatr.go`	Adds `WithRerankerDisabled` option and internal flag to skip creating a reranker client; prevents disabling reranker in VisualAnalysisMode.
DOM Analysis Mode `pkg/mode/dom-analysis.go`	Adds nil-check for `rerankerClient`: if present, perform rerank and sort chunks; if nil, skip reranking.
Visual Analysis Mode `pkg/mode/visual-analysis.go`	Adds precondition: return error if `rerankerClient` is nil (required for visual analysis).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Plugin as rawTextPlugin
  participant HTML as pkg/internal/html
  participant XML as pkg/internal/xml
  participant Types as types

  rect rgba(230,245,255,0.6)
  Note over Client,Plugin: GetMinifiedDOM (HTML or XML)
  Client->>Plugin: GetMinifiedDOM(ctx)
  alt pageType == HTML
    Plugin->>HTML: MinifySource(content)
    HTML-->>Plugin: *types.ElementSpec
    Plugin->>HTML: CreateLocatorMap(content)
    HTML-->>Plugin: map[id][]xpath
    Plugin->>Types: Build types.DOM
  else pageType == XML
    Plugin->>XML: MinifySource(content)
    XML-->>Plugin: *types.ElementSpec
    Plugin->>XML: CreateLocatorMap(content)
    XML-->>Plugin: map[id][]xpath
    Plugin->>Types: Build types.DOM
  end
  Plugin-->>Client: *types.DOM
  end

sequenceDiagram
  autonumber
  actor Client
  participant Plugin as rawTextPlugin
  participant HTML as pkg/internal/html
  participant XML as pkg/internal/xml

  rect rgba(240,255,240,0.6)
  Note over Client,Plugin: IsLocatorValid
  Client->>Plugin: IsLocatorValid(ctx, locator)
  alt HTML page
    Plugin->>HTML: IsValidXPath(locator, content)
    HTML-->>Plugin: bool, err
  else XML page
    Plugin->>XML: IsValidXPath(locator, content)
    XML-->>Plugin: bool, err
  end
  Plugin-->>Client: bool, err
  end

sequenceDiagram
  autonumber
  participant Doc as HTMLDoc
  participant Node as HTMLNode
  participant X as GetOptimalXPath

  rect rgba(255,248,230,0.6)
  Note over Doc,X: High-level XPath selection flow
  Node->>X: GetOptimalXPath(doc, node)
  X->>Doc: doc.Find(xpathCandidate)
  alt Unique match found
    X-->>Node: return attribute-based XPath
  else No unique candidate
    X->>Node: GetParent()/Index()
    X-->>Node: return hierarchical /.../tag[n]
  end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibbled through tags in a moonlit scroll,
XPath trails twinkled—each a rabbit hole.
I stitched a map of nodes so neat,
With minified leaves beneath my feet.
Now plugins hum and locators bloom—🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The title “String only plugin support” is too vague and does not clearly convey that a new raw-text plugin for direct HTML/XML input was added; it uses non-descriptive terminology and omits the core change, making it hard for teammates to understand the purpose of the PR at a glance.	Consider renaming the title to explicitly reference the new raw-text plugin and its purpose, for example “Add raw-text plugin for direct HTML/XML string input” so that the main change is clear and specific.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch text-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/internal/xml/adapter.go (1)

55-57: Nil-safety: prevent panics when parent is nil and when receiver wraps a nil node.

Current value receivers and unconditional parent wrapping can panic (e.g., recursive calls that walk to a nil parent). Make receivers pointer-based with nil-guards and return nil when no parent.

Apply this diff:

-func (n XMLNode) TagName() string {
-	return n.node.Data
-}
+func (n *XMLNode) TagName() string {
+	if n == nil || n.node == nil {
+		return ""
+	}
+	return n.node.Data
+}

-func (n XMLNode) IsElement() bool {
-	return n.node.Type == xmlquery.ElementNode
-}
+func (n *XMLNode) IsElement() bool {
+	if n == nil || n.node == nil {
+		return false
+	}
+	return n.node.Type == xmlquery.ElementNode
+}

-func (n *XMLNode) GetParent() Node {
-	return NewXMLNode(n.node.Parent)
-}
+func (n *XMLNode) GetParent() Node {
+	if n == nil || n.node == nil || n.node.Parent == nil {
+		return nil
+	}
+	return NewXMLNode(n.node.Parent)
+}

-func (n *XMLNode) ChildNodes() []Node {
+func (n *XMLNode) ChildNodes() []Node {
+	if n == nil || n.node == nil {
+		return nil
+	}
 	var nodes []Node
@@
-}
+}

-func (n *XMLNode) Equal(n1 Node) bool {
+func (n *XMLNode) Equal(n1 Node) bool {
+	if n == nil {
+		_, ok := n1.(*XMLNode)
+		return !ok // nil receiver equal only to nil *XMLNode in interface
+	}
 	xn1, ok := n1.(*XMLNode)
 	if !ok {
 		return false
 	}
 	return n.node == xn1.node
 }

-func (n *XMLNode) Index() int {
-	if n.node.Parent == nil {
+func (n *XMLNode) Index() int {
+	if n == nil || n.node == nil || n.node.Parent == nil {
 		return 1
 	}

Also applies to: 59-61, 76-78, 80-88, 90-97, 99-116

🧹 Nitpick comments (10)

pkg/internal/xml/adapter.go (1)

118-129: Reasonable helper; consider parsed‑doc overload to avoid repeated parsing in loops.

IsValidXPath reparses the DOM string on each call. If used in a loop, an overload that accepts a parsed *xmlquery.Node (or your Document) would reduce CPU/allocs.

pkg/internal/html/adapter.go (1)

119-130: IsValidXPath is fine; align with XML helper and consider parsed‑doc overload.

Parity looks good. Add an overload that accepts *html.Node/Document to avoid repeated parsing hot paths.
pkg/internal/html/xpath.go (1)
35-39: Typos in comments.

Use “hierarchical” and “prepend”.
-// than hierarchial query.
+// than hierarchical query.
@@
-// otherwise fall back to a purely hierarchial expression
+// otherwise fall back to a purely hierarchical expression
@@
-// Make a recursive call to this nodes parents and preprend it to this xpath
+// Make a recursive call to this node's parent and prepend it to this xpath
Also applies to: 67-71, 90-93
pkg/internal/html/minifier.go (2)
83-90: Inline style visibility: match common “opacity:0” and collapsed visibility.

Include opacity:0 (no space) and visibility:collapse to catch more hidden cases.
-		if strings.Contains(s, "display:none") ||
-			strings.Contains(s, "visibility:hidden") ||
-			strings.Contains(s, "opacity: 0") {
+		if strings.Contains(s, "display:none") ||
+			strings.Contains(s, "visibility:hidden") ||
+			strings.Contains(s, "visibility:collapse") ||
+			strings.Contains(s, "opacity:0") ||
+			strings.Contains(s, "opacity: 0") {
 			return false
 		}
143-156: Typo: PrintLocatrs → PrintLocators (debug helper).

Minor naming nit; helps grepping when needed.
-func PrintLocatrs(locatrs []string) {
+func PrintLocators(locatrs []string) {
pkg/plugins/raw-text.go (5)
13-15: Typo in error message + add explicit error kinds used by guards.

Apply this diff:
 var (
-	ErrModeNotSupported = errors.New("raw-text plugin only supports DOM evalutaion mode")
+	ErrModeNotSupported = errors.New("raw-text plugin only supports DOM evaluation mode")
+	ErrInvalidPageType  = errors.New("invalid page type")
+	ErrInvalidPlatform  = errors.New("invalid platform")
+	ErrEmptyContent     = errors.New("empty page source")
 )
66-86: Wrap errors with context and fix locatrMap typo.

Apply this diff:
 	eSpec, err := html.MinifySource(pageSource)
 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("html minify: %w", err)
 	}
-	locatrMap, err := html.CreateLocatorMap(pageSource)
+	locatorMap, err := html.CreateLocatorMap(pageSource)
 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("html locator map: %w", err)
 	}
 
 	dom := &types.DOM{
 		RootElement: eSpec,
 		Metadata: &types.DOMMetadata{
 			LocatorType: types.XPathType,
-			LocatorMap:  locatrMap,
+			LocatorMap:  locatorMap,
 		},
 	}
 	return dom, nil
124-131: Validate empty locators and handle unknown PageType explicitly.

Apply this diff:
 func (plugin *rawTextPlugin) IsLocatorValid(ctx context.Context, locator string) (bool, error) {
-	pageSource := plugin.content
-
-	if plugin.pageType == HTMLPageType {
-		return html.IsValidXPath(locator, pageSource)
-	}
-	return xml.IsValidXPath(locator, pageSource)
+	if strings.TrimSpace(locator) == "" {
+		return false, errors.New("empty locator")
+	}
+	pageSource := plugin.content
+	switch plugin.pageType {
+	case HTMLPageType:
+		return html.IsValidXPath(locator, pageSource)
+	case XMLPageType:
+		return xml.IsValidXPath(locator, pageSource)
+	default:
+		return false, ErrInvalidPageType
+	}
 }
39-50: Optional: implement String() for Platform and make Str() a delegator.

Idiomatic Go favors Stringer. Keep Str() for compatibility and delegate to String().

Add:
func (p Platform) String() string {
	switch p {
	case AndroidPlatform:
		return androidText
	case IosPlatform:
		return iosText
	case WebPlatform:
		return webText
	default:
		return invalidText
	}
}
Then:
-func (platform Platform) Str() string {
-	switch platform {
-	case AndroidPlatform:
-		return androidText
-	case IosPlatform:
-		return iosText
-	case WebPlatform:
-		return webText
-	default:
-		return invalidText
-	}
-}
+func (platform Platform) Str() string { return platform.String() }
117-122: API naming clarity: method name implies HTML-only but returns XML IDs when PageType is XML.

Consider renaming to ExtractFirstUniqueID and keeping this as a shim for compatibility, or at least add a doc comment clarifying the behavior.

Would changing the interface to a neutral method name be acceptable across callers?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cfd0b2 and f8dd83e.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (6)

go.mod (1 hunks)
pkg/internal/html/adapter.go (1 hunks)
pkg/internal/html/minifier.go (1 hunks)
pkg/internal/html/xpath.go (1 hunks)
pkg/internal/xml/adapter.go (2 hunks)
pkg/plugins/raw-text.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

pkg/internal/html/xpath.go (1)

pkg/internal/html/adapter.go (2)

Document (10-12)

Node (14-23)

pkg/internal/html/minifier.go (4)

pkg/internal/html/adapter.go (3)

Node (14-23)

NewHTMLDoc (29-31)

NewHTMLNode (52-54)

pkg/types/dom.go (1)

ElementSpec (14-20)

pkg/internal/html/xpath.go (1)

GetOptimalXPath (18-93)

pkg/internal/utils/utils.go (1)

GenerateUniqueId (324-327)

pkg/plugins/raw-text.go (2)

pkg/types/dom.go (3)

DOM (49-52)

DOMMetadata (43-46)

XPathType (10-10)

pkg/internal/utils/utils.go (2)

ExtractFirstUniqueHTMLID (187-245)

ExtractFirstUniqueXMLID (152-173)

coderabbitai · 2025-09-15T14:03:12Z

go.mod

 )

 require (
+	github.com/antchfx/htmlquery v1.3.4 // indirect


💡 Verification agent

🧩 Analysis chain

Make htmlquery a direct dependency (not indirect).

github.com/antchfx/htmlquery is imported directly in the codebase, so it should not be marked // indirect. This can cause drift and unexpected tidy rewrites.

Apply this diff:

- github.com/antchfx/htmlquery v1.3.4 // indirect + github.com/antchfx/htmlquery v1.3.4

Run to confirm direct imports and tidy the module:

🏁 Script executed:

#!/bin/bash # Verify direct imports exist rg -nP --type=go '^\s*import\s*\(\s*(?s).*?github\.com/antchfx/htmlquery' -C2 || rg -nP --type=go '^\s*import\s*"github\.com/antchfx/htmlquery"' # Show current go.mod line rg -n '^(\s*)github\.com/antchfx/htmlquery\b' go.mod -n # Suggest tidying (to be run locally) echo "Run: go mod tidy"

Length of output: 68

🏁 Script executed:

#!/bin/bash # Search for direct imports of htmlquery package rg -n "github.com/antchfx/htmlquery" --type=go # Also search for any usage of htmlquery functions/types rg -n "htmlquery\." --type=go -A1 -B1 # Check if htmlquery is used in any import statements (both single line and multi-line) rg -nP --type=go 'import.*"github\.com/antchfx/htmlquery"' || rg -nP --type=go -U '(?s)import\s*\([^)]*github\.com/antchfx/htmlquery[^)]*\)'

Length of output: 1776

Make github.com/antchfx/htmlquery a direct dependency (remove // indirect from go.mod)

github.com/antchfx/htmlquery is imported directly in pkg/internal/html/adapter.go and pkg/internal/html/minifier.go — remove the // indirect marker in go.mod and run go mod tidy. (go.mod: line ~23)

- github.com/antchfx/htmlquery v1.3.4 // indirect + github.com/antchfx/htmlquery v1.3.4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

github.com/antchfx/htmlquery v1.3.4 // indirect

github.com/antchfx/htmlquery v1.3.4

🤖 Prompt for AI Agents

In go.mod around line 23, the dependency github.com/antchfx/htmlquery is marked as indirect though it is imported directly in pkg/internal/html/adapter.go and pkg/internal/html/minifier.go; remove the "// indirect" marker from that require line (making it a direct dependency) and then run `go mod tidy` to update module metadata and ensure go.sum is correct.

coderabbitai · 2025-09-15T14:03:13Z

pkg/internal/html/adapter.go

+func (n HTMLNode) TagName() string {
+	return n.node.Data
+}


⚠️ Potential issue

Nil-safety: fix potential panics on root recursion and nil receivers.

Same issues as XML adapter. Use pointer receivers with guards and return nil parent explicitly to keep GetOptimalXPath recursion safe.

Apply this diff:

-func (n HTMLNode) TagName() string { - return n.node.Data -} +func (n *HTMLNode) TagName() string { + if n == nil || n.node == nil { + return "" + } + return n.node.Data +} -func (n HTMLNode) IsElement() bool { - return n.node.Type == html.ElementNode -} +func (n *HTMLNode) IsElement() bool { + if n == nil || n.node == nil { + return false + } + return n.node.Type == html.ElementNode +} -func (n *HTMLNode) GetParent() Node { - return NewHTMLNode(n.node.Parent) -} +func (n *HTMLNode) GetParent() Node { + if n == nil || n.node == nil || n.node.Parent == nil { + return nil + } + return NewHTMLNode(n.node.Parent) +} -func (n *HTMLNode) ChildNodes() []Node { +func (n *HTMLNode) ChildNodes() []Node { + if n == nil || n.node == nil { + return nil + } var nodes []Node @@ -} +} -func (n *HTMLNode) Equal(n1 Node) bool { +func (n *HTMLNode) Equal(n1 Node) bool { + if n == nil { + _, ok := n1.(*HTMLNode) + return !ok + } xn1, ok := n1.(*HTMLNode) if !ok { return false } return n.node == xn1.node } -func (n *HTMLNode) Index() int { - if n.node.Parent == nil { +func (n *HTMLNode) Index() int { + if n == nil || n.node == nil || n.node.Parent == nil { return 1 }

Also applies to: 60-62, 77-79, 81-89, 91-98, 100-117

🤖 Prompt for AI Agents

In pkg/internal/html/adapter.go around lines 56-58 (and similarly for 60-62, 77-79, 81-89, 91-98, 100-117), change the HTMLNode methods to use pointer receivers and add nil-safety guards: if the receiver is nil or the underlying n.node is nil return safe zero values (for TagName and similar string-returning methods return empty string; for Parent/ParentNode-returning methods return nil explicitly) so recursive GetOptimalXPath calls don't panic; ensure every method checks receiver and n.node before accessing fields and adjust signatures to func (n *HTMLNode) ... accordingly.

coderabbitai · 2025-09-15T14:03:13Z

pkg/internal/html/minifier.go

+func getVisibleText(element *html.Node) string {
+	txt := element.Data
+	return escapeString(strings.TrimSpace(txt))
+}


🛠️ Refactor suggestion

⚠️ Potential issue

Incorrect element “text” extraction and inconsistent document root for XPath generation.

getVisibleText returns the tag name, not the element’s textual content.

MinifySource builds XPaths with the first element as the Document root, while CreateLocatorMap uses the parsed root. This inconsistency can yield different XPaths/IDs for the same element.

Apply this diff:

-func getVisibleText(element *html.Node) string { - txt := element.Data - return escapeString(strings.TrimSpace(txt)) -} +func getVisibleText(element *html.Node) string { + if element == nil { + return "" + } + var b strings.Builder + var walk func(*html.Node) + walk = func(n *html.Node) { + if n.Type == html.TextNode { + t := strings.TrimSpace(n.Data) + if t != "" { + if b.Len() > 0 { + b.WriteByte(' ') + } + b.WriteString(t) + } + return + } + for c := n.FirstChild; c != nil; c = c.NextSibling { + walk(c) + } + } + walk(element) + return escapeString(b.String()) +} @@ - spec, err := createElementSpec(node, node) + spec, err := createElementSpec(node, root)

Also applies to: 169-183, 193-199

coderabbitai · 2025-09-15T14:03:13Z

pkg/internal/html/xpath.go

+	isNodeName := len(attrs) == 0
+	if isNodeName {
+		xpath := fmt.Sprintf("//%s", domNode.TagName())
+
+		isUnique, _ := determineXpathUniqueness(xpath, doc, domNode)
+		if isUnique {
+			if !domNode.HasParent() {
+				xpath = fmt.Sprintf("/%s", domNode.TagName())
+			}
+			return true, true, xpath
+		}
+		return false, false, ""
+	}


🛠️ Refactor suggestion

⚠️ Potential issue

Fix uniqueness logic and avoid invalid XPaths; also escape attribute values.

determineXpathUniqueness returns true for 0 or 1 mismatched results, which is incorrect.

Building a semi-unique XPath with idx 0 yields invalid (...)[0].

Attribute values are embedded unescaped; quotes inside values break XPath literals.

Avoid panic on malformed pairs; keep this resilient.

Apply this diff:

@@ - if isNodeName { - xpath := fmt.Sprintf("//%s", domNode.TagName()) + if isNodeName { + xpath := fmt.Sprintf("//%s", domNode.TagName()) @@ - isUnique, _ := determineXpathUniqueness(xpath, doc, domNode) + isUnique, _ := determineXpathUniqueness(xpath, doc, domNode) if isUnique { if !domNode.HasParent() { xpath = fmt.Sprintf("/%s", domNode.TagName()) } return true, true, xpath } return false, false, "" } @@ - isPair := len(strings.Fields(attrs[0])) > 1 + isPair := len(strings.Fields(attrs[0])) > 1 @@ - if isPair { - attr1, attr2, ok := strings.Cut(attr, " ") - if !ok { - panic("generateUniqueXPATH invalid state") - } + if isPair { + attr1, attr2, ok := strings.Cut(attr, " ") + if !ok { + continue + } @@ - xpath = fmt.Sprintf( - "//%s[@%s=\"%s\" and @%s=\"%s\"]", - tagForXpath, - attr1, attr1Value, - attr2, attr2Value, - ) + xpath = fmt.Sprintf("//%s[@%s=%s and @%s=%s]", + tagForXpath, + attr1, quoteXPathLiteral(attr1Value), + attr2, quoteXPathLiteral(attr2Value), + ) } else { attrValue := domNode.GetAttribute(attr) if attrValue == "" { continue } - xpath = fmt.Sprintf( - "//%s[@%s=\"%s\"]", - tagForXpath, - attr, attrValue, - ) + xpath = fmt.Sprintf("//%s[@%s=%s]", + tagForXpath, + attr, quoteXPathLiteral(attrValue), + ) } @@ - isUnique, idx := determineXpathUniqueness(xpath, doc, domNode) + isUnique, idx := determineXpathUniqueness(xpath, doc, domNode) if isUnique { uniqueXPath = xpath break } - if semiUniqueXPath == "" { - semiUniqueXPath = fmt.Sprintf("(%s)[%d]", xpath, idx) + if semiUniqueXPath == "" && idx > 0 { + semiUniqueXPath = fmt.Sprintf("(%s)[%d]", xpath, idx) } } @@ -func determineXpathUniqueness(xpath string, doc Document, domNode Node) (bool, int) { - elems := doc.Find(xpath) - if len(elems) > 1 { - idx := slices.IndexFunc(elems, func(node Node) bool { - return domNode.Equal(node) - }) - return false, idx + 1 - } - return true, 0 -} +func determineXpathUniqueness(xpath string, doc Document, domNode Node) (bool, int) { + elems := doc.Find(xpath) + switch len(elems) { + case 0: + return false, 0 + case 1: + if domNode.Equal(elems[0]) { + return true, 1 + } + return false, 0 + default: + idx := slices.IndexFunc(elems, func(node Node) bool { + return domNode.Equal(node) + }) + if idx >= 0 { + return false, idx + 1 + } + return false, 0 + } +} + +// quoteXPathLiteral quotes s as an XPath literal, handling embedded quotes via concat() when needed. +func quoteXPathLiteral(s string) string { + switch { + case !strings.ContainsAny(s, `'"`): + return "'" + s + "'" + case !strings.Contains(s, `'`): + return "'" + s + "'" + case !strings.Contains(s, `"`): + return `"` + s + `"` + default: + parts := strings.Split(s, `"`) + var b strings.Builder + b.WriteString("concat(") + for i, p := range parts { + if i > 0 { + b.WriteString(`, '"', `) + } + b.WriteString("'" + p + "'") + } + b.WriteString(")") + return b.String() + } +}

Also applies to: 138-176, 187-196

coderabbitai · 2025-09-15T14:03:13Z

pkg/plugins/raw-text.go

+import (
+	"context"
+	"errors"
+
+	"github.com/vertexcover-io/locatr/pkg/internal/html"
+	"github.com/vertexcover-io/locatr/pkg/internal/utils"
+	"github.com/vertexcover-io/locatr/pkg/internal/xml"
+	"github.com/vertexcover-io/locatr/pkg/types"
+)


🛠️ Refactor suggestion

Add missing imports for error wrapping and content checks.

Apply this diff:

import ( "context" "errors" + "fmt" + "strings" "github.com/vertexcover-io/locatr/pkg/internal/html" "github.com/vertexcover-io/locatr/pkg/internal/utils" "github.com/vertexcover-io/locatr/pkg/internal/xml" "github.com/vertexcover-io/locatr/pkg/types" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import (

"context"

"errors"

"github.com/vertexcover-io/locatr/pkg/internal/html"

"github.com/vertexcover-io/locatr/pkg/internal/utils"

"github.com/vertexcover-io/locatr/pkg/internal/xml"

"github.com/vertexcover-io/locatr/pkg/types"

)

import (

"context"

"errors"

"fmt"

"strings"

"github.com/vertexcover-io/locatr/pkg/internal/html"

"github.com/vertexcover-io/locatr/pkg/internal/utils"

"github.com/vertexcover-io/locatr/pkg/internal/xml"

"github.com/vertexcover-io/locatr/pkg/types"

)

🤖 Prompt for AI Agents

In pkg/plugins/raw-text.go around lines 3 to 11, the file is missing imports needed for wrapping errors and performing content checks; add the standard "fmt" package to wrap/annotate errors with fmt.Errorf("%w", err) and add the "strings" package to run content checks (e.g., TrimSpace/HasPrefix/Contains) where the code validates or examines text content, then update the error construction sites to use fmt.Errorf for wrapping and replace any ad-hoc content checks with the appropriate strings helpers.

coderabbitai · 2025-09-15T14:03:13Z

pkg/plugins/raw-text.go

+func (plugin *rawTextPlugin) minifyXML() (*types.DOM, error) {
+	pageSource := plugin.content
+	platform := plugin.platform.Str()
+
+	eSpec, err := xml.MinifySource(pageSource, platform)
+	if err != nil {
+		return nil, err
+	}
+	locatrMap, err := xml.CreateLocatorMap(pageSource, platform)
+	if err != nil {
+		return nil, err
+	}
+	dom := &types.DOM{
+		RootElement: eSpec,
+		Metadata: &types.DOMMetadata{
+			LocatorType: types.XPathType,
+			LocatorMap:  locatrMap,
+		},
+	}
+	return dom, nil
+}


⚠️ Potential issue

XML path should reject unknown platforms; also wrap underlying errors and fix a variable name nit.

Apply this diff:

func (plugin *rawTextPlugin) minifyXML() (*types.DOM, error) { pageSource := plugin.content platform := plugin.platform.Str() + if platform == invalidText { + return nil, ErrInvalidPlatform + } + eSpec, err := xml.MinifySource(pageSource, platform) if err != nil { - return nil, err + return nil, fmt.Errorf("xml minify: %w", err) } - locatrMap, err := xml.CreateLocatorMap(pageSource, platform) + locatorMap, err := xml.CreateLocatorMap(pageSource, platform) if err != nil { - return nil, err + return nil, fmt.Errorf("xml locator map: %w", err) } dom := &types.DOM{ RootElement: eSpec, Metadata: &types.DOMMetadata{ LocatorType: types.XPathType, - LocatorMap: locatrMap, + LocatorMap: locatorMap, }, } return dom, nil }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (plugin *rawTextPlugin) minifyXML() (*types.DOM, error) {

pageSource := plugin.content

platform := plugin.platform.Str()

eSpec, err := xml.MinifySource(pageSource, platform)

if err != nil {

return nil, err

}

locatrMap, err := xml.CreateLocatorMap(pageSource, platform)

if err != nil {

return nil, err

}

dom := &types.DOM{

RootElement: eSpec,

Metadata: &types.DOMMetadata{

LocatorType: types.XPathType,

LocatorMap: locatrMap,

},

}

return dom, nil

}

func (plugin *rawTextPlugin) minifyXML() (*types.DOM, error) {

pageSource := plugin.content

platform := plugin.platform.Str()

if platform == invalidText {

return nil, ErrInvalidPlatform

}

eSpec, err := xml.MinifySource(pageSource, platform)

if err != nil {

return nil, fmt.Errorf("xml minify: %w", err)

}

locatorMap, err := xml.CreateLocatorMap(pageSource, platform)

if err != nil {

return nil, fmt.Errorf("xml locator map: %w", err)

}

dom := &types.DOM{

RootElement: eSpec,

Metadata: &types.DOMMetadata{

LocatorType: types.XPathType,

LocatorMap: locatorMap,

},

}

return dom, nil

}

🤖 Prompt for AI Agents

In pkg/plugins/raw-text.go around lines 88 to 108, the code should reject unknown platforms, wrap underlying errors, and fix a variable-name nit: validate plugin.platform.Str() against accepted platform strings (return an error like fmt.Errorf("xml: unknown platform %q", platform) on default), rename locatrMap to locatorMap, and when returning errors from xml.MinifySource and xml.CreateLocatorMap wrap them with context using fmt.Errorf("minify xml: %w", err) and fmt.Errorf("create locator map: %w", err) respectively so callers get wrapped errors with clear context.

coderabbitai · 2025-09-15T14:03:13Z

pkg/plugins/raw-text.go

+func (plugin *rawTextPlugin) GetMinifiedDOM(ctx context.Context) (*types.DOM, error) {
+	if plugin.pageType == HTMLPageType {
+		return plugin.minifyHTML()
+	}
+	return plugin.minifyXML()
+}


💡 Verification agent

🧩 Analysis chain

Blocker: invalid PageType currently falls through to XML; also missing empty-content guard.

Ensure we don’t silently run XML logic for unknown values and fail fast on empty page source.

Apply this diff:

func (plugin *rawTextPlugin) GetMinifiedDOM(ctx context.Context) (*types.DOM, error) { - if plugin.pageType == HTMLPageType { - return plugin.minifyHTML() - } - return plugin.minifyXML() + if strings.TrimSpace(plugin.content) == "" { + return nil, ErrEmptyContent + } + switch plugin.pageType { + case HTMLPageType: + return plugin.minifyHTML() + case XMLPageType: + return plugin.minifyXML() + default: + return nil, ErrInvalidPageType + } }

Run to surface all call sites and double‑check correct PageType/Platform combos:

🏁 Script executed:

#!/bin/bash rg -nP -C2 --type=go 'NewRawTextPlugin\('

Length of output: 275

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== cat pkg/plugins/raw-text.go ===" nl -ba pkg/plugins/raw-text.go | sed -n '1,240p' || true echo echo "=== search repo for symbols ===" rg -n --hidden --no-ignore-vcs 'NewRawTextPlugin\(' || true rg -n --hidden --no-ignore-vcs '\brawTextPlugin\b' || true rg -n --hidden --no-ignore-vcs '\bGetMinifiedDOM\(' || true rg -n --hidden --no-ignore-vcs 'minifyHTML\(|minifyXML\(' || true rg -n --hidden --no-ignore-vcs '\btype\s+PageType\b|HTMLPageType|XMLPageType' -S || true rg -n --hidden --no-ignore-vcs 'ErrEmptyContent|ErrInvalidPageType' -S || true rg -n --hidden --no-ignore-vcs '\bcontent\b' -S pkg/plugins || true

Length of output: 4788

Blocker — fail fast on empty content & make PageType handling explicit in raw-text plugin (pkg/plugins/raw-text.go)

Replace fall‑through XML path in GetMinifiedDOM (pkg/plugins/raw-text.go:110–115) with an empty‑content guard (return ErrEmptyContent when strings.TrimSpace(plugin.content) == "") and a switch on plugin.pageType with cases HTMLPageType and XMLPageType; default must return ErrInvalidPageType.

Define ErrEmptyContent and ErrInvalidPageType (or reuse existing error values) and add the strings import if missing.

Apply the same explicit PageType validation to other methods that branch on pageType (ExtractFirstUniqueHTMLID, IsLocatorValid, GetCurrentContext — ~lines 117–137) to avoid silent XML fall‑through.

Confirm callers construct rawTextPlugin with valid PageType/Platform values (rg shows NewRawTextPlugin is only defined here; verify external usages if any).

🤖 Prompt for AI Agents

In pkg/plugins/raw-text.go around lines 110–115 (and extending through ~117–137), replace the current fall-through XML path in GetMinifiedDOM with an early empty-content guard that returns ErrEmptyContent when strings.TrimSpace(plugin.content) == "" and then use a switch on plugin.pageType with explicit cases for HTMLPageType -> plugin.minifyHTML() and XMLPageType -> plugin.minifyXML(), returning ErrInvalidPageType in the default; define ErrEmptyContent and ErrInvalidPageType (or reuse existing error values) and add the strings import if missing. Apply the same explicit pageType validation (switch with HTMLPageType and XMLPageType and default ErrInvalidPageType) to ExtractFirstUniqueHTMLID, IsLocatorValid, and GetCurrentContext to avoid silent XML fall-through. Finally, verify callers construct rawTextPlugin with valid PageType/Platform values (check NewRawTextPlugin and any external usages) and update tests or callsites if they rely on the old implicit behavior.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/mode/visual-analysis.go (1)
83-85: Fail fast before expensive DOM work

Thanks for adding the guard. Could we move it right after m.applyDefaults() so we short‑circuit before calling GetMinifiedDOM and splitting chunks? That avoids needless DOM/minifier work (and any plugin side effects) when the reranker is missing.
@@
 	m.applyDefaults()
 
+	if rerankerClient == nil {
+		return fmt.Errorf("reranker client is required for visual analysis mode")
+	}
+
 	dom, err := plugin.GetMinifiedDOM(ctx)
 	if err != nil {
 		return err
@@
-	if rerankerClient == nil {
-		return fmt.Errorf("reranker client is required for visual analysis mode")
-	}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8dd83e and 41a00df.

📒 Files selected for processing (3)

pkg/locatr.go (4 hunks)
pkg/mode/dom-analysis.go (1 hunks)
pkg/mode/visual-analysis.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

pkg/locatr.go (4)

pkg/types/llm.go (1)

LLMClientInterface (9-19)

pkg/types/reranker.go (1)

RerankerClientInterface (22-24)

pkg/types/locatr.go (1)

LocatrMode (24-34)

pkg/mode/visual-analysis.go (1)

VisualAnalysisMode (43-48)

pkg/mode/dom-analysis.go (2)

pkg/types/reranker.go (1)

RerankRequest (9-13)

pkg/internal/utils/utils.go (1)

SortRerankChunks (118-138)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit (3.12, ubuntu-latest)

🔇 Additional comments (3)

pkg/mode/dom-analysis.go (1)

70-81: Reranker guard reads clean

The conditional keeps the old behavior when a reranker is supplied and gracefully skips the call when it’s disabled. Looks good.

pkg/locatr.go (2)

37-62: Configuration toggle looks solid

The new disableReranker flag and option wire-up are straightforward and keep the config surface tidy. No concerns here.

124-145: Initialization & mode validation make sense

Nulling out the reranker when disabled and blocking the combination with VisualAnalysisMode keeps the constructor’s invariants intact. Nicely scoped.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41a00df and 6accc44.

📒 Files selected for processing (1)

pkg/plugins/raw-text.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

pkg/plugins/raw-text.go (3)

pkg/types/dom.go (3)

DOM (49-52)

DOMMetadata (43-46)

XPathType (10-10)

pkg/internal/utils/utils.go (2)

ExtractFirstUniqueHTMLID (187-245)

ExtractFirstUniqueXMLID (152-173)

pkg/types/plugin.go (2)

Location (6-11)

PluginInterface (14-40)

🔇 Additional comments (3)

pkg/plugins/raw-text.go (3)
88-108: Reject unknown platforms before minifying XML (and wrap failures).

When plugin.platform has any value outside the enum, Str() returns "n/a", and we blindly hand that to xml.MinifySource/xml.CreateLocatorMap. Today this bubbles up as opaque parser failures and, depending on the implementation details, can even build an invalid locator map. Please fail fast on the unsupported platform and wrap the downstream errors with context. You’ll need to add a fmt import.
 func (plugin *rawTextPlugin) minifyXML() (*types.DOM, error) {
 	pageSource := plugin.content
 	platform := plugin.platform.Str()
 
+	if platform == invalidText {
+		return nil, fmt.Errorf("raw-text: unsupported platform %d", plugin.platform)
+	}
+
 	eSpec, err := xml.MinifySource(pageSource, platform)
 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("minify xml source: %w", err)
 	}
-	locatrMap, err := xml.CreateLocatorMap(pageSource, platform)
+	locatorMap, err := xml.CreateLocatorMap(pageSource, platform)
 	if err != nil {
-		return nil, err
+		return nil, fmt.Errorf("create xml locator map: %w", err)
 	}
 	dom := &types.DOM{
 		RootElement: eSpec,
 		Metadata: &types.DOMMetadata{
 			LocatorType: types.XPathType,
-			LocatorMap:  locatrMap,
+			LocatorMap:  locatorMap,
 		},
 	}
 	return dom, nil
 }
110-115: Fail fast on empty content and invalid page types.

Right now any non-HTML PageType silently falls through to the XML path, so a bad enum value causes HTML payloads to be parsed as XML, producing confusing errors. We also try to minify empty/whitespace-only content, which just defers failure to the parser. Please guard on content first, then switch on the two supported page types. This needs a strings import (and continues to rely on fmt from the previous change).
 func (plugin *rawTextPlugin) GetMinifiedDOM(ctx context.Context) (*types.DOM, error) {
-	if plugin.pageType == HTMLPageType {
-		return plugin.minifyHTML()
-	}
-	return plugin.minifyXML()
+	if strings.TrimSpace(plugin.content) == "" {
+		return nil, fmt.Errorf("raw-text: empty page source")
+	}
+
+	switch plugin.pageType {
+	case HTMLPageType:
+		return plugin.minifyHTML()
+	case XMLPageType:
+		return plugin.minifyXML()
+	default:
+		return nil, fmt.Errorf("raw-text: unsupported page type %d", plugin.pageType)
+	}
 }
117-131: Apply the same explicit PageType handling to ID/locator helpers.

ExtractFirstUniqueID and IsLocatorValid share the same silent XML fall-through: an invalid PageType (or a future enum extension) routes HTML through XML utilities, yielding misleading errors. Mirror the guarded switch here so callers get a deterministic “unsupported page type” failure instead.
 func (plugin *rawTextPlugin) ExtractFirstUniqueID(ctx context.Context, fragment string) (string, error) {
-	if plugin.pageType == HTMLPageType {
-		return utils.ExtractFirstUniqueHTMLID(fragment)
-	}
-	return utils.ExtractFirstUniqueXMLID(fragment)
+	switch plugin.pageType {
+	case HTMLPageType:
+		return utils.ExtractFirstUniqueHTMLID(fragment)
+	case XMLPageType:
+		return utils.ExtractFirstUniqueXMLID(fragment)
+	default:
+		return "", fmt.Errorf("raw-text: unsupported page type %d", plugin.pageType)
+	}
 }
 
 func (plugin *rawTextPlugin) IsLocatorValid(ctx context.Context, locator string) (bool, error) {
 	pageSource := plugin.content
 
-	if plugin.pageType == HTMLPageType {
-		return html.IsValidXPath(locator, pageSource)
-	}
-	return xml.IsValidXPath(locator, pageSource)
+	switch plugin.pageType {
+	case HTMLPageType:
+		return html.IsValidXPath(locator, pageSource)
+	case XMLPageType:
+		return xml.IsValidXPath(locator, pageSource)
+	default:
+		return false, fmt.Errorf("raw-text: unsupported page type %d", plugin.pageType)
+	}
 }

coderabbitai · 2025-09-25T04:03:15Z

pkg/plugins/raw-text.go

+var (
+	ErrModeNotSupported = errors.New("raw-text plugin only supports DOM evalutaion mode")
+)


⚠️ Potential issue

Fix typo in exported error message.

The user-facing error string currently says “evalutaion”.

-var ( - ErrModeNotSupported = errors.New("raw-text plugin only supports DOM evalutaion mode") -) +var ( + ErrModeNotSupported = errors.New("raw-text plugin only supports DOM evaluation mode") +)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

var (

ErrModeNotSupported = errors.New("raw-text plugin only supports DOM evalutaion mode")

)

var (

ErrModeNotSupported = errors.New("raw-text plugin only supports DOM evaluation mode")

)

🤖 Prompt for AI Agents

In pkg/plugins/raw-text.go around lines 13 to 15, the exported error string contains a typo ("evalutaion"); update the error message value to use the correct spelling "evaluation" so it reads "raw-text plugin only supports DOM evaluation mode".

Hungerarray added 3 commits September 15, 2025 19:20

add: valid xpath check for xml string

853cd86

add: raw HTML mapper

34113ed

add: string only plugin

f8dd83e

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

update: add option to disable rerank for dom analysis

41a00df

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

fix: plugin implementation typo

6accc44

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String only plugin support#49

String only plugin support#49
Hungerarray wants to merge 5 commits intomainfrom
text-evaluation

Hungerarray commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	github.com/antchfx/htmlquery v1.3.4 // indirect
	github.com/antchfx/htmlquery v1.3.4

Conversation

Hungerarray commented Sep 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

String Only plugin

Additions

Limitations

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Hungerarray commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 15, 2025 •

edited

Loading