Skip to content

reclaimprotocol/xpath-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

xpath-go

🎯 High-compatibility XPath library for Go with precise location tracking

Go Version License: MIT Go Report Card Coverage Release

✨ Features

  • 🎯 High Compatibility - Strives for close compatibility with jsdom's XPath evaluation
  • πŸ“ Precise Location Tracking - Character-level positioning in source HTML/XML
  • πŸ“„ Dual Extraction Modes - Extract full elements or content-only with contentsOnly option
  • ⚑ High Performance - Optimized evaluation engine with smart caching
  • πŸ”§ Production Ready - Comprehensive error handling and extensive testing
  • πŸ§ͺ Battle Tested - Extensively tested against reference implementations
  • πŸ“¦ Zero Dependencies - Pure Go implementation, no external dependencies
  • 🎨 Developer Friendly - Rich debugging support with trace logging

πŸš€ Quick Start

package main

import (
    "fmt"
    "log"
    
    "github.com/reclaimprotocol/xpath-go"
)

func main() {
    html := `<html><body><div id="content" class="main">Hello World</div></body></html>`
    
    // Simple query
    results, err := xpath.Query("//div[@id='content']", html)
    if err != nil {
        log.Fatal(err)
    }
    
    for _, result := range results {
        fmt.Printf("Found: %s\n", result.TextContent)
        fmt.Printf("Location: %d-%d\n", result.StartLocation, result.EndLocation)
        fmt.Printf("Path: %s\n", result.Path)
    }
}

πŸ“¦ Installation

go get github.com/reclaimprotocol/xpath-go

🎯 Complete XPath Support

βœ… Axes (100% Compatible)

  • child::, parent::, ancestor::, descendant::
  • following::, preceding::, following-sibling::, preceding-sibling::
  • attribute::, namespace::, self::
  • descendant-or-self::, ancestor-or-self::

βœ… Functions (Comprehensive Support)

  • Node Functions: text(), node(), position(), last(), count()
  • String Functions: string(), normalize-space(), starts-with(), contains(), substring()
  • Boolean Functions: boolean(), not()
  • Number Functions: number(), string-length()

βœ… Operators (Full Support)

  • Comparison: =, !=, <, >, <=, >=
  • Logical: and, or, not()
  • Arithmetic: +, -, *, div, mod
  • Union: | (pipe operator)

βœ… Predicates (Full Support)

  • Attribute predicates: [@id='test'], [@class and @id]
  • Position predicates: [1], [last()], [position()>2]
  • Content predicates: [text()='value'], [contains(text(), 'substring')]
  • Complex boolean expressions: [@id='a' or @class='b'] and [position()=1]

πŸ“Š XPath Support

Comprehensive XPath 1.0 implementation with extensive test coverage

Feature Category Support Details
Basic Selection βœ… Full Element, attribute, wildcard selection
Attribute Queries βœ… Full Attribute existence, value matching, complex conditions
Text Functions βœ… Full text(), contains(), starts-with(), normalize-space()
Position Functions βœ… Full position(), last(), numeric positions
Axes Navigation βœ… Full All XPath axes including ancestor/descendant
Complex Predicates βœ… Full Boolean logic, nested predicates, unions
String Functions βœ… Full substring(), string-length() with edge cases

πŸ“‹ Important Compatibility Notes

HTML Entity Handling

XPath-Go preserves HTML entities in their original encoded form, which differs from JavaScript's DOM behavior:

<!-- Source HTML -->
<p>Text with &amp; &lt; &gt; characters</p>
Implementation Text Content XPath Query
JavaScript DOM "Text with & < > characters" //p[contains(text(), '&')] βœ…
XPath-Go "Text with &amp; &lt; &gt; characters" //p[contains(text(), '&amp;')] βœ…

Why this difference exists:

  • βœ… Preserves original HTML content exactly as written
  • βœ… No information loss - you can decode when needed
  • βœ… Predictable behavior - what you see is what you get
  • βœ… Security benefits - prevents entity-related parsing issues

Working with entities:

// Method 1: Query with encoded entities
results, _ := xpath.Query("//p[contains(text(), '&amp;')]", html)

// Method 2: Decode after extraction  
results, _ := xpath.Query("//p", html)
decoded := html.UnescapeString(results[0].TextContent)

πŸ“– Read the complete HTML Entity Handling guide for detailed information and best practices.

Unicode Position Tracking

XPath-Go uses byte-based position tracking for performance and Go ecosystem compatibility:

<!-- Source HTML -->
<p>Hello δΈ–η•Œ</p>
Implementation Position Calculation StartLocation EndLocation
JavaScript DOM Character-based 0 27
XPath-Go Byte-based 0 31

Why byte-based positioning:

  • βœ… Go idiomatic - Aligns with Go's string handling and byte slice operations
  • βœ… Performance - No Unicode code point counting overhead during parsing
  • βœ… Memory efficient - Direct byte offset calculations
  • βœ… Deterministic - Consistent across all platforms and Go versions

Working with Unicode positions:

// Method 1: Use byte positions directly (recommended for Go)
html := `<p>Hello δΈ–η•Œ</p>`
results, _ := xpath.Query("//p", html)
content := html[results[0].StartLocation:results[0].EndLocation]

// Method 2: Convert to character positions if needed
import "unicode/utf8"
func ByteToCharPos(s string, bytePos int) int {
    return utf8.RuneCountInString(s[:bytePos])
}

⚠️ Compatibility Considerations

While XPath-Go aims for high compatibility with web standards, there are some intentional design choices:

  • HTML Entity Preservation: Maintains original &amp; vs & for security and consistency
  • Unicode Position Tracking: Uses byte offsets for Go ecosystem compatibility
  • Performance Optimizations: Some complex expressions may have subtle evaluation differences

For complete compatibility details, see docs/COMPATIBILITY.md.

πŸ” Advanced Usage

Location Tracking

Get precise character positions for all matched nodes:

results, _ := xpath.Query("//div[@class='content']", htmlContent)
for _, result := range results {
    fmt.Printf("Element: <%s>\n", result.NodeName)
    fmt.Printf("Text: %s\n", result.TextContent) 
    fmt.Printf("Character Range: %d-%d\n", result.StartLocation, result.EndLocation)
    fmt.Printf("XPath: %s\n", result.Path)
    fmt.Printf("Attributes: %+v\n", result.Attributes)
}

Compiled XPath (Performance)

For repeated queries, compile once and reuse:

// Compile once
compiled, err := xpath.Compile("//div[@class='item'][position()>1]")
if err != nil {
    log.Fatal(err)
}

// Use multiple times (faster)
for _, htmlDoc := range documents {
    results, err := compiled.Evaluate(htmlDoc)
    if err != nil {
        log.Printf("Error: %v", err)
        continue
    }
    // Process results...
}

Custom Options

Control output format and extraction mode:

results, err := xpath.QueryWithOptions("//p", html, xpath.Options{
    IncludeLocation: true,
    OutputFormat:    "values", // "nodes", "values", "paths"
    ContentsOnly:    false,    // Extract full elements (default)
})

// Extract only inner content between tags
results, err := xpath.QueryWithOptions("//div", html, xpath.Options{
    ContentsOnly: true,  // Extract content-only: <div>content</div> β†’ "content"
})

Debug Tracing

xpath.EnableTrace()
defer xpath.DisableTrace()

results, err := xpath.Query("//div[contains(@class, 'complex')]//p[last()]", html)
// Detailed evaluation steps logged to stderr

πŸ“š Examples

Basic Selections

// Element selection
xpath.Query("//div", html)                    // All div elements
xpath.Query("/html/body/div", html)           // Specific path
xpath.Query("//div[@id='main']", html)        // Div with specific ID

// Attribute selection  
xpath.Query("//div/@class", html)             // Class attributes
xpath.Query("//*[@href]", html)               // Elements with href
xpath.Query("//a[@href and @title]", html)   // Links with both attributes

Text and Content

// Text content
xpath.Query("//p[text()='Hello']", html)           // Exact text match
xpath.Query("//div[contains(text(), 'world')]", html) // Text contains
xpath.Query("//span[normalize-space(text())='Clean']", html) // Normalized text

// Position-based
xpath.Query("//li[1]", html)                    // First list item
xpath.Query("//tr[last()]", html)               // Last table row  
xpath.Query("//div[position()>2]", html)        // Divs after second

Complex Predicates

// Boolean logic
xpath.Query("//div[@id='a' or @class='b']", html)           // OR condition
xpath.Query("//p[@class and text()]", html)                 // AND condition
xpath.Query("//div[not(@class)]", html)                     // NOT condition

// Nested conditions
xpath.Query("//ul[li[@class='active']]", html)              // UL containing active LI
xpath.Query("//div[@class='container']//p[position()=2]", html) // Second P in container

// Complex expressions
xpath.Query("//article[.//h1 and count(.//p)>2]", html)     // Articles with H1 and 3+ paragraphs

Axes Navigation

// Family relationships
xpath.Query("//h2/following-sibling::p", html)        // P elements after H2
xpath.Query("//span/parent::div[@class='box']", html)  // Parent div with class
xpath.Query("//td/ancestor::table[@id='data']", html)  // Ancestor table with ID

// Advanced navigation
xpath.Query("//div[@id='start']/descendant-or-self::*[@class]", html) // Descendants with class
xpath.Query("//li[3]/preceding-sibling::li", html)                    // Previous siblings

Dual Extraction Modes

Extract either full elements or just their inner content:

html := `<div class="box">Hello <span>World</span>!</div>`

// Full element extraction (default)
results, _ := xpath.QueryWithOptions("//div", html, xpath.Options{
    ContentsOnly: false,
})
// StartLocation/EndLocation: <div class="box">Hello <span>World</span>!</div>

// Content-only extraction  
results, _ := xpath.QueryWithOptions("//div", html, xpath.Options{
    ContentsOnly: true,
})
// StartLocation/EndLocation: Hello <span>World</span>!

// Fine-grained control (always available)
fmt.Printf("Full element: %s\n", html[result.StartLocation:result.EndLocation])
fmt.Printf("Inner content: %s\n", html[result.ContentStart:result.ContentEnd])

Use Cases:

  • Full elements (ContentsOnly: false): HTML processing, DOM manipulation, complete element extraction
  • Content only (ContentsOnly: true): Text processing, content analysis, clean text extraction without tags

πŸ“ˆ Performance

Optimized for production use:

  • Fast parsing with caching support
  • Efficient evaluation with minimal memory allocations
  • Thread-safe design for concurrent usage
  • Compiled expressions for repeated queries
// Compile once, use many times
compiled, _ := xpath.Compile("//div[@class='item'][position()>1]")
results, _ := compiled.Evaluate(html)

πŸ› οΈ API Reference

Core Functions

// Basic query evaluation
func Query(xpathExpr, content string) ([]Result, error)

// Query with custom options
func QueryWithOptions(xpathExpr, content string, opts Options) ([]Result, error)

// Compile XPath for reuse (performance optimization)
func Compile(xpathExpr string) (*XPath, error)

// Enable/disable debug tracing
func EnableTrace()
func DisableTrace()

Result Structure

type Result struct {
    Value         string            // Node value or text content
    NodeName      string            // Element name (div, span, etc.)
    NodeType      int               // Node type (1=element, 2=attribute, 3=text)
    Attributes    map[string]string // Element attributes
    StartLocation int               // Character start position (full element or content-only)
    EndLocation   int               // Character end position (full element or content-only)
    ContentStart  int               // Start of inner content (after opening tag)
    ContentEnd    int               // End of inner content (before closing tag)
    Path          string            // Generated XPath path
    TextContent   string            // Text content of node and children
}

Options

type Options struct {
    IncludeLocation bool   // Include character positions (default: true)
    OutputFormat    string // "nodes", "values", "paths" (default: "nodes")
    ContentsOnly    bool   // Extract only inner content between tags (default: false)
}

ContentsOnly Mode:

  • false (default): Extract full elements including tags: <div>content</div>
  • true: Extract only inner content: content

Both modes maintain precise position tracking. With ContentsOnly: true, StartLocation/EndLocation point to the content boundaries, while ContentStart/ContentEnd are always available for fine-grained control.

πŸ”§ Development

Testing

# Go tests
go test ./...

# Compatibility tests (requires Node.js)
cd tests && npm install && npm test

# Benchmarks
go test -bench=. -benchmem ./...

🀝 Contributing

We welcome contributions!

# Clone and setup
git clone https://github.com/reclaimprotocol/xpath-go.git
cd xpath-go && go mod download

# Run tests
go test ./... && cd tests && npm install && npm test

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Built with high compatibility goals for jsdom and web standards
  • Inspired by the W3C XPath 1.0 Specification
  • Thanks to the Go community for excellent tooling and libraries

πŸŽ‰ Production Ready: This library is actively used in production and provides reliable XPath evaluation for Go applications.

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •