Streaming XML with XPath

SETLr supports efficient streaming parsing of large XML files using XPath filtering.

Overview

For large XML files, loading the entire document into memory can be problematic. SETLr's streaming XML parser uses iterparse to process XML elements incrementally, combined with XPath expressions to filter only the elements you need.

Basic XML Extraction

@prefix setl: <http://purl.org/twc/vocab/setl/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix : <http://example.com/> .

:xmlTable a setl:Table ;
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <data.xml> ;
    ] .

This extracts all elements from the XML file into a pandas DataFrame.

XPath Filtering

Use setl:xpath to select specific elements:

:bookTable a setl:Table ;
    setl:xpath "//book" ;  # Select only <book> elements
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <catalog.xml> ;
    ] .

Example XML File

<?xml version="1.0"?>
<catalog>
  <book id="bk101">
    <author>Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <genre>Computer</genre>
    <price>44.95</price>
  </book>
  <book id="bk102">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
  </book>
  <magazine id="mg001">
    <title>Tech Weekly</title>
    <price>9.99</price>
  </magazine>
</catalog>

With setl:xpath "//book", only the <book> elements are extracted, not the <magazine>.

Advanced XPath Patterns

Select by Attribute

:expensiveBooks a setl:Table ;
    setl:xpath "//book[price > 10]" ;
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <catalog.xml> ;
    ] .

Select Nested Elements

:chapters a setl:Table ;
    setl:xpath "//book/chapter" ;
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <book.xml> ;
    ] .

Combine Conditions

:computerBooks a setl:Table ;
    setl:xpath "//book[genre='Computer']" ;
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <catalog.xml> ;
    ] .

DTD Validation

For XML files with DTD declarations, you can enable validation:

:validatedTable a setl:Table, setl:DTDValidatedXML ;
    setl:xpath "//record" ;
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <data.xml> ;
    ] .

Performance Considerations

Memory Efficiency

Streaming XML parsing is particularly useful for:

Large files (> 100 MB)
Many elements (thousands of records)
Limited memory environments

The parser only keeps the current element in memory, not the entire document.

Progress Tracking

SETLr shows a progress bar when parsing XML:

Processing XML: 45%|████▌     | 1234/2750 [00:12<00:15, 98.2 elements/s]

Complete Example

SETL Script (`books.setl.ttl`)

@prefix setl: <http://purl.org/twc/vocab/setl/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix : <http://example.com/> .

# Extract: Parse XML with XPath
:booksTable a setl:Table, csvw:Table ;
    setl:xpath "//book" ;
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <catalog.xml> ;
    ] .

# Transform: Convert to RDF
:booksGraph a void:Dataset ;
    prov:wasGeneratedBy [
        a setl:Transform, setl:JSLDT ;
        prov:used :booksTable ;
        prov:value '''[{
            "@id": "http://example.com/book/{{row['@id']}}",
            "@type": "http://schema.org/Book",
            "http://schema.org/author": "{{row.author}}",
            "http://schema.org/name": "{{row.title}}",
            "http://schema.org/genre": "{{row.genre}}"
        }]''' ;
    ] .

Run from Python

from rdflib import Graph, URIRef
import setlr

# Load SETL script
setl_graph = Graph()
setl_graph.parse("books.setl.ttl", format="turtle")

# Execute (streaming XML parse happens here)
resources = setlr.run_setl(setl_graph)

# Access parsed data
books_df = resources[URIRef('http://example.com/booksTable')]
print(f"Extracted {len(books_df)} books")
print(books_df.head())

# Access generated RDF
books_graph = resources[URIRef('http://example.com/booksGraph')]
print(f"Generated {len(books_graph)} triples")

XML Attributes

XML attributes are accessible in the DataFrame with @ prefix:

<book id="bk101" isbn="1234567890">
  <title>My Book</title>
</book>

Access in template:

"{{row['@id']}}"     # → "bk101"
"{{row['@isbn']}}"   # → "1234567890"
"{{row.title}}"      # → "My Book"

Nested Elements

For nested XML structures:

<book>
  <metadata>
    <author>John Doe</author>
    <year>2024</year>
  </metadata>
  <title>Example</title>
</book>

Use nested XPath:

:metadata a setl:Table ;
    setl:xpath "//book/metadata" ;
    prov:wasGeneratedBy [
        a setl:Extract ;
        prov:used <books.xml> ;
    ] .

Limitations

XPath 1.0 syntax only (not full XPath 2.0)
Element text content and attributes only (no CDATA sections)
Cannot access parent or sibling elements after extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming XML with XPath

Overview

Basic XML Extraction

XPath Filtering

Example XML File

Advanced XPath Patterns

Select by Attribute

Select Nested Elements

Combine Conditions

DTD Validation

Performance Considerations

Memory Efficiency

Progress Tracking

Complete Example

SETL Script (`books.setl.ttl`)

Run from Python

XML Attributes

Nested Elements

Limitations

See Also

FilesExpand file tree

streaming-xml.md

Latest commit

History

streaming-xml.md

File metadata and controls

Streaming XML with XPath

Overview

Basic XML Extraction

XPath Filtering

Example XML File

Advanced XPath Patterns

Select by Attribute

Select Nested Elements

Combine Conditions

DTD Validation

Performance Considerations

Memory Efficiency

Progress Tracking

Complete Example

SETL Script (books.setl.ttl)

Run from Python

XML Attributes

Nested Elements

Limitations

See Also

SETL Script (`books.setl.ttl`)