Advanced Features

SETLr provides powerful advanced capabilities for complex data transformation workflows, large-scale processing, and production deployments.

Overview

This guide covers advanced topics including:

Multi-source transforms
Conditional loading and filtering
Performance optimization
Error handling and debugging
Integration patterns

For specific advanced features, see:

Streaming XML with XPath - Efficient large file processing
Python Functions in Transforms - Custom Python code
SPARQL Support - Query and update endpoints
SHACL Validation - Validate your RDF output

Multi-Source Transforms

SETLr can combine data from multiple sources in a single transform.

Combining Multiple Tables

@prefix setl: <http://purl.org/twc/vocab/setl/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .

# Load first table
:users a csvw:Table, setl:Table ;
    prov:wasGeneratedBy [ 
        a setl:Extract ; 
        prov:used <users.csv> 
    ] .

# Load second table
:orders a csvw:Table, setl:Table ;
    prov:wasGeneratedBy [ 
        a setl:Extract ; 
        prov:used <orders.csv> 
    ] .

# Transform using both tables
:output prov:wasGeneratedBy [
    a setl:Transform, setl:JSLDT ;
    prov:used :users, :orders ;
    prov:value '''
    [{
        "@for": "user in users",
        "@do": {
            "@id": "http://example.com/user/{{user.ID}}",
            "@type": "Person",
            "name": "{{user.Name}}",
            "orders": [{
                "@for": "order in orders",
                "@if": "order.UserID == user.ID",
                "@do": {
                    "@id": "http://example.com/order/{{order.OrderID}}"
                }
            }]
        }
    }]
    '''
] .

Loading from Different Formats

# CSV data
:csv_table a csvw:Table, setl:Table ;
    prov:wasGeneratedBy [ 
        a setl:Extract ; 
        prov:used <data.csv> 
    ] .

# JSON data  
:json_data a setl:Table ;
    prov:wasGeneratedBy [ 
        a setl:Extract ;
        prov:used <data.json> ;
        setl:hasJSONSelector "$.items[*]"
    ] .

# XML data
:xml_data a setl:Table ;
    prov:wasGeneratedBy [ 
        a setl:Extract ;
        prov:used <data.xml> ;
        setl:hasXPathSelector "//item"
    ] .

Conditional Loading

Use conditional logic to selectively process data based on runtime conditions.

Filtering with @if

[{
    "@for": "row in table",
    "@if": "row.Status == 'active' and row.Score > 50",
    "@do": {
        "@id": "http://example.com/entity/{{row.ID}}",
        "@type": "ActiveEntity",
        "score": "{{row.Score}}"
    }
}]

Conditional Fields

{
    "@id": "http://example.com/person/{{row.ID}}",
    "@type": "Person",
    "name": "{{row.Name}}",
    "email": {
        "@if": "row.Email",
        "@do": "mailto:{{row.Email}}"
    },
    "phone": {
        "@if": "row.Phone and row.PhoneVerified",
        "@do": "{{row.Phone}}"
    }
}

Performance Optimization

Streaming Processing

See Streaming XML documentation for details.

Batch Processing

Process data in batches to control memory usage:

from rdflib import Graph, Namespace, URIRef
import setlr

# For very large datasets, process in chunks
chunk_size = 10000
offset = 0

output_graph = Graph()

while True:
    # Create SETL script for this batch
    setl_graph = create_batch_setl(offset, chunk_size)
    
    # Process batch
    resources = setlr.run_setl(setl_graph)
    
    # Accumulate results
    batch_output = resources[URIRef('http://example.com/output')]
    output_graph += batch_output
    
    # Check if done
    if len(batch_output) < chunk_size:
        break
    
    offset += chunk_size

# Save final results
output_graph.serialize('output.ttl', format='turtle')

Pandas Optimization

For CSV/Excel files, pandas is used automatically. Optimize with:

# Use appropriate dtypes to reduce memory
# Specify in your data loading if possible

# For very wide tables, select only needed columns
# by processing the source data first

Error Handling and Debugging

Verbose Logging

Enable detailed logging to diagnose issues:

import logging
import setlr

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('setlr')
logger.setLevel(logging.DEBUG)

# Now run SETL
resources = setlr.run_setl(setl_graph)

Progress Tracking

Use tqdm for progress tracking on large datasets:

from tqdm import tqdm
import setlr

# Progress bars are automatically shown for:
# - Large file processing
# - Batch operations
# - Network transfers

resources = setlr.run_setl(setl_graph)

Validation During Development

Validate intermediate results to catch issues early:

from rdflib import Graph
import setlr

# Process data
resources = setlr.run_setl(setl_graph)
output = resources[URIRef('http://example.com/output')]

# Validate results
print(f"Generated {len(output)} triples")
print(f"Subjects: {len(set(output.subjects()))}")
print(f"Predicates: {len(set(output.predicates()))}")
print(f"Objects: {len(set(output.objects()))}")

# Check for specific patterns
for s, p, o in output.triples((None, RDF.type, None)):
    print(f"Type: {o}")

Error Recovery

Handle errors gracefully in production:

import setlr
from rdflib import Graph

try:
    setl_graph = Graph()
    setl_graph.parse('transform.setl.ttl', format='turtle')
    resources = setlr.run_setl(setl_graph)
    
except setlr.SetlrError as e:
    print(f"SETL processing error: {e}")
    # Handle gracefully
    
except Exception as e:
    print(f"Unexpected error: {e}")
    # Log and notify

Integration Patterns

CI/CD Integration

Integrate SETLr into your CI/CD pipeline:

# GitHub Actions example
name: Generate RDF

on: [push]

jobs:
  generate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install SETLr
        run: pip install setlr
      
      - name: Generate RDF
        run: setlr transform.setl.ttl -o output.ttl
      
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: rdf-output
          path: output.ttl

Docker Integration

Use SETLr in containerized environments:

FROM python:3.11-slim

# Install SETLr
RUN pip install setlr

# Copy your SETL scripts and data
COPY transform.setl.ttl /app/
COPY data/ /app/data/

WORKDIR /app

# Run transformation
CMD ["setlr", "transform.setl.ttl", "-o", "/output/result.ttl"]

# Build and run
docker build -t my-setl-transform .
docker run -v $(pwd)/output:/output my-setl-transform

Scheduled Processing

Run SETLr transformations on a schedule:

# scheduled_transform.py
import schedule
import time
from rdflib import Graph
import setlr

def run_transform():
    """Run the SETL transformation"""
    print("Starting transformation...")
    
    setl_graph = Graph()
    setl_graph.parse('transform.setl.ttl', format='turtle')
    
    resources = setlr.run_setl(setl_graph)
    
    # Save output with timestamp
    timestamp = time.strftime('%Y%m%d_%H%M%S')
    output_file = f'output_{timestamp}.ttl'
    
    output = resources[URIRef('http://example.com/output')]
    output.serialize(output_file, format='turtle')
    
    print(f"Transformation complete: {output_file}")

# Schedule to run every day at 2 AM
schedule.every().day.at("02:00").do(run_transform)

while True:
    schedule.run_pending()
    time.sleep(60)

REST API Wrapper

Expose SETLr as a REST API:

from flask import Flask, request, jsonify
from rdflib import Graph
import setlr
import tempfile

app = Flask(__name__)

@app.route('/transform', methods=['POST'])
def transform():
    """Accept CSV data and SETL script, return RDF"""
    
    # Get input
    csv_data = request.files['data']
    setl_script = request.form['setl']
    
    # Save to temp files
    with tempfile.NamedTemporaryFile(mode='w', suffix='.csv') as csv_file:
        csv_data.save(csv_file.name)
        
        # Update SETL script with temp file path
        setl_graph = Graph()
        setl_graph.parse(data=setl_script, format='turtle')
        
        # Run transformation
        resources = setlr.run_setl(setl_graph)
        
        # Return RDF
        output = resources[URIRef('http://example.com/output')]
        return output.serialize(format='turtle'), 200, {
            'Content-Type': 'text/turtle'
        }

if __name__ == '__main__':
    app.run(debug=True)

Best Practices

2. Version Control

Store SETL scripts in version control
Track changes to transforms with your data processing pipeline
Use branches for experimental transforms

3. Testing

Test SETL scripts with sample data before production use
Validate output with SHACL shapes
Compare output to expected results

4. Documentation

Document complex transforms with comments (use rdfs:comment)
Maintain README files for transform collections
Include example data with your scripts

5. Monitoring

Log transformation results (record counts, errors)
Monitor resource usage for large datasets
Set up alerts for transformation failures

Next Steps

Explore Streaming XML for large file processing
Learn about Python Functions for custom logic
Set up SPARQL endpoints for data loading
Implement SHACL validation for quality control

Support

For questions about advanced features:

Check the documentation
Open a discussion
Report issues on GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Features

Overview

Multi-Source Transforms

Combining Multiple Tables

Loading from Different Formats

Conditional Loading

Filtering with @if

Conditional Fields

Performance Optimization

Streaming Processing

Batch Processing

Pandas Optimization

Error Handling and Debugging

Verbose Logging

Progress Tracking

Validation During Development

Error Recovery

Integration Patterns

CI/CD Integration

Docker Integration

Scheduled Processing

REST API Wrapper

Best Practices

2. Version Control

3. Testing

4. Documentation

5. Monitoring

Next Steps

Support

FilesExpand file tree

advanced.md

Latest commit

History

advanced.md

File metadata and controls

Advanced Features

Overview

Multi-Source Transforms

Combining Multiple Tables

Loading from Different Formats

Conditional Loading

Filtering with @if

Conditional Fields

Performance Optimization

Streaming Processing

Batch Processing

Pandas Optimization

Error Handling and Debugging

Verbose Logging

Progress Tracking

Validation During Development

Error Recovery

Integration Patterns

CI/CD Integration

Docker Integration

Scheduled Processing

REST API Wrapper

Best Practices

2. Version Control

3. Testing

4. Documentation

5. Monitoring

Next Steps

Support