π― High-compatibility XPath library for Go with precise location tracking
- π― High Compatibility - Strives for close compatibility with jsdom's XPath evaluation
- π Precise Location Tracking - Character-level positioning in source HTML/XML
- π Dual Extraction Modes - Extract full elements or content-only with
contentsOnly
option - β‘ High Performance - Optimized evaluation engine with smart caching
- π§ Production Ready - Comprehensive error handling and extensive testing
- π§ͺ Battle Tested - Extensively tested against reference implementations
- π¦ Zero Dependencies - Pure Go implementation, no external dependencies
- π¨ Developer Friendly - Rich debugging support with trace logging
package main
import (
"fmt"
"log"
"github.com/reclaimprotocol/xpath-go"
)
func main() {
html := `<html><body><div id="content" class="main">Hello World</div></body></html>`
// Simple query
results, err := xpath.Query("//div[@id='content']", html)
if err != nil {
log.Fatal(err)
}
for _, result := range results {
fmt.Printf("Found: %s\n", result.TextContent)
fmt.Printf("Location: %d-%d\n", result.StartLocation, result.EndLocation)
fmt.Printf("Path: %s\n", result.Path)
}
}
go get github.com/reclaimprotocol/xpath-go
child::
,parent::
,ancestor::
,descendant::
following::
,preceding::
,following-sibling::
,preceding-sibling::
attribute::
,namespace::
,self::
descendant-or-self::
,ancestor-or-self::
- Node Functions:
text()
,node()
,position()
,last()
,count()
- String Functions:
string()
,normalize-space()
,starts-with()
,contains()
,substring()
- Boolean Functions:
boolean()
,not()
- Number Functions:
number()
,string-length()
- Comparison:
=
,!=
,<
,>
,<=
,>=
- Logical:
and
,or
,not()
- Arithmetic:
+
,-
,*
,div
,mod
- Union:
|
(pipe operator)
- Attribute predicates:
[@id='test']
,[@class and @id]
- Position predicates:
[1]
,[last()]
,[position()>2]
- Content predicates:
[text()='value']
,[contains(text(), 'substring')]
- Complex boolean expressions:
[@id='a' or @class='b'] and [position()=1]
Comprehensive XPath 1.0 implementation with extensive test coverage
Feature Category | Support | Details |
---|---|---|
Basic Selection | β Full | Element, attribute, wildcard selection |
Attribute Queries | β Full | Attribute existence, value matching, complex conditions |
Text Functions | β Full | text(), contains(), starts-with(), normalize-space() |
Position Functions | β Full | position(), last(), numeric positions |
Axes Navigation | β Full | All XPath axes including ancestor/descendant |
Complex Predicates | β Full | Boolean logic, nested predicates, unions |
String Functions | β Full | substring(), string-length() with edge cases |
XPath-Go preserves HTML entities in their original encoded form, which differs from JavaScript's DOM behavior:
<!-- Source HTML -->
<p>Text with & < > characters</p>
Implementation | Text Content | XPath Query |
---|---|---|
JavaScript DOM | "Text with & < > characters" |
//p[contains(text(), '&')] β
|
XPath-Go | "Text with & < > characters" |
//p[contains(text(), '&')] β
|
Why this difference exists:
- β Preserves original HTML content exactly as written
- β No information loss - you can decode when needed
- β Predictable behavior - what you see is what you get
- β Security benefits - prevents entity-related parsing issues
Working with entities:
// Method 1: Query with encoded entities
results, _ := xpath.Query("//p[contains(text(), '&')]", html)
// Method 2: Decode after extraction
results, _ := xpath.Query("//p", html)
decoded := html.UnescapeString(results[0].TextContent)
π Read the complete HTML Entity Handling guide for detailed information and best practices.
XPath-Go uses byte-based position tracking for performance and Go ecosystem compatibility:
<!-- Source HTML -->
<p>Hello δΈη</p>
Implementation | Position Calculation | StartLocation |
EndLocation |
---|---|---|---|
JavaScript DOM | Character-based | 0 | 27 |
XPath-Go | Byte-based | 0 | 31 |
Why byte-based positioning:
- β Go idiomatic - Aligns with Go's string handling and byte slice operations
- β Performance - No Unicode code point counting overhead during parsing
- β Memory efficient - Direct byte offset calculations
- β Deterministic - Consistent across all platforms and Go versions
Working with Unicode positions:
// Method 1: Use byte positions directly (recommended for Go)
html := `<p>Hello δΈη</p>`
results, _ := xpath.Query("//p", html)
content := html[results[0].StartLocation:results[0].EndLocation]
// Method 2: Convert to character positions if needed
import "unicode/utf8"
func ByteToCharPos(s string, bytePos int) int {
return utf8.RuneCountInString(s[:bytePos])
}
While XPath-Go aims for high compatibility with web standards, there are some intentional design choices:
- HTML Entity Preservation: Maintains original
&
vs&
for security and consistency - Unicode Position Tracking: Uses byte offsets for Go ecosystem compatibility
- Performance Optimizations: Some complex expressions may have subtle evaluation differences
For complete compatibility details, see docs/COMPATIBILITY.md.
Get precise character positions for all matched nodes:
results, _ := xpath.Query("//div[@class='content']", htmlContent)
for _, result := range results {
fmt.Printf("Element: <%s>\n", result.NodeName)
fmt.Printf("Text: %s\n", result.TextContent)
fmt.Printf("Character Range: %d-%d\n", result.StartLocation, result.EndLocation)
fmt.Printf("XPath: %s\n", result.Path)
fmt.Printf("Attributes: %+v\n", result.Attributes)
}
For repeated queries, compile once and reuse:
// Compile once
compiled, err := xpath.Compile("//div[@class='item'][position()>1]")
if err != nil {
log.Fatal(err)
}
// Use multiple times (faster)
for _, htmlDoc := range documents {
results, err := compiled.Evaluate(htmlDoc)
if err != nil {
log.Printf("Error: %v", err)
continue
}
// Process results...
}
Control output format and extraction mode:
results, err := xpath.QueryWithOptions("//p", html, xpath.Options{
IncludeLocation: true,
OutputFormat: "values", // "nodes", "values", "paths"
ContentsOnly: false, // Extract full elements (default)
})
// Extract only inner content between tags
results, err := xpath.QueryWithOptions("//div", html, xpath.Options{
ContentsOnly: true, // Extract content-only: <div>content</div> β "content"
})
xpath.EnableTrace()
defer xpath.DisableTrace()
results, err := xpath.Query("//div[contains(@class, 'complex')]//p[last()]", html)
// Detailed evaluation steps logged to stderr
// Element selection
xpath.Query("//div", html) // All div elements
xpath.Query("/html/body/div", html) // Specific path
xpath.Query("//div[@id='main']", html) // Div with specific ID
// Attribute selection
xpath.Query("//div/@class", html) // Class attributes
xpath.Query("//*[@href]", html) // Elements with href
xpath.Query("//a[@href and @title]", html) // Links with both attributes
// Text content
xpath.Query("//p[text()='Hello']", html) // Exact text match
xpath.Query("//div[contains(text(), 'world')]", html) // Text contains
xpath.Query("//span[normalize-space(text())='Clean']", html) // Normalized text
// Position-based
xpath.Query("//li[1]", html) // First list item
xpath.Query("//tr[last()]", html) // Last table row
xpath.Query("//div[position()>2]", html) // Divs after second
// Boolean logic
xpath.Query("//div[@id='a' or @class='b']", html) // OR condition
xpath.Query("//p[@class and text()]", html) // AND condition
xpath.Query("//div[not(@class)]", html) // NOT condition
// Nested conditions
xpath.Query("//ul[li[@class='active']]", html) // UL containing active LI
xpath.Query("//div[@class='container']//p[position()=2]", html) // Second P in container
// Complex expressions
xpath.Query("//article[.//h1 and count(.//p)>2]", html) // Articles with H1 and 3+ paragraphs
// Family relationships
xpath.Query("//h2/following-sibling::p", html) // P elements after H2
xpath.Query("//span/parent::div[@class='box']", html) // Parent div with class
xpath.Query("//td/ancestor::table[@id='data']", html) // Ancestor table with ID
// Advanced navigation
xpath.Query("//div[@id='start']/descendant-or-self::*[@class]", html) // Descendants with class
xpath.Query("//li[3]/preceding-sibling::li", html) // Previous siblings
Extract either full elements or just their inner content:
html := `<div class="box">Hello <span>World</span>!</div>`
// Full element extraction (default)
results, _ := xpath.QueryWithOptions("//div", html, xpath.Options{
ContentsOnly: false,
})
// StartLocation/EndLocation: <div class="box">Hello <span>World</span>!</div>
// Content-only extraction
results, _ := xpath.QueryWithOptions("//div", html, xpath.Options{
ContentsOnly: true,
})
// StartLocation/EndLocation: Hello <span>World</span>!
// Fine-grained control (always available)
fmt.Printf("Full element: %s\n", html[result.StartLocation:result.EndLocation])
fmt.Printf("Inner content: %s\n", html[result.ContentStart:result.ContentEnd])
Use Cases:
- Full elements (
ContentsOnly: false
): HTML processing, DOM manipulation, complete element extraction - Content only (
ContentsOnly: true
): Text processing, content analysis, clean text extraction without tags
Optimized for production use:
- Fast parsing with caching support
- Efficient evaluation with minimal memory allocations
- Thread-safe design for concurrent usage
- Compiled expressions for repeated queries
// Compile once, use many times
compiled, _ := xpath.Compile("//div[@class='item'][position()>1]")
results, _ := compiled.Evaluate(html)
// Basic query evaluation
func Query(xpathExpr, content string) ([]Result, error)
// Query with custom options
func QueryWithOptions(xpathExpr, content string, opts Options) ([]Result, error)
// Compile XPath for reuse (performance optimization)
func Compile(xpathExpr string) (*XPath, error)
// Enable/disable debug tracing
func EnableTrace()
func DisableTrace()
type Result struct {
Value string // Node value or text content
NodeName string // Element name (div, span, etc.)
NodeType int // Node type (1=element, 2=attribute, 3=text)
Attributes map[string]string // Element attributes
StartLocation int // Character start position (full element or content-only)
EndLocation int // Character end position (full element or content-only)
ContentStart int // Start of inner content (after opening tag)
ContentEnd int // End of inner content (before closing tag)
Path string // Generated XPath path
TextContent string // Text content of node and children
}
type Options struct {
IncludeLocation bool // Include character positions (default: true)
OutputFormat string // "nodes", "values", "paths" (default: "nodes")
ContentsOnly bool // Extract only inner content between tags (default: false)
}
ContentsOnly Mode:
false
(default): Extract full elements including tags:<div>content</div>
true
: Extract only inner content:content
Both modes maintain precise position tracking. With ContentsOnly: true
, StartLocation
/EndLocation
point to the content boundaries, while ContentStart
/ContentEnd
are always available for fine-grained control.
# Go tests
go test ./...
# Compatibility tests (requires Node.js)
cd tests && npm install && npm test
# Benchmarks
go test -bench=. -benchmem ./...
We welcome contributions!
# Clone and setup
git clone https://github.com/reclaimprotocol/xpath-go.git
cd xpath-go && go mod download
# Run tests
go test ./... && cd tests && npm install && npm test
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with high compatibility goals for jsdom and web standards
- Inspired by the W3C XPath 1.0 Specification
- Thanks to the Go community for excellent tooling and libraries
π Production Ready: This library is actively used in production and provides reliable XPath evaluation for Go applications.