SETLr provides powerful advanced capabilities for complex data transformation workflows, large-scale processing, and production deployments.
This guide covers advanced topics including:
- Multi-source transforms
- Conditional loading and filtering
- Performance optimization
- Error handling and debugging
- Integration patterns
For specific advanced features, see:
- Streaming XML with XPath - Efficient large file processing
- Python Functions in Transforms - Custom Python code
- SPARQL Support - Query and update endpoints
- SHACL Validation - Validate your RDF output
SETLr can combine data from multiple sources in a single transform.
@prefix setl: <http://purl.org/twc/vocab/setl/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix csvw: <http://www.w3.org/ns/csvw#> .
# Load first table
:users a csvw:Table, setl:Table ;
prov:wasGeneratedBy [
a setl:Extract ;
prov:used <users.csv>
] .
# Load second table
:orders a csvw:Table, setl:Table ;
prov:wasGeneratedBy [
a setl:Extract ;
prov:used <orders.csv>
] .
# Transform using both tables
:output prov:wasGeneratedBy [
a setl:Transform, setl:JSLDT ;
prov:used :users, :orders ;
prov:value '''
[{
"@for": "user in users",
"@do": {
"@id": "http://example.com/user/{{user.ID}}",
"@type": "Person",
"name": "{{user.Name}}",
"orders": [{
"@for": "order in orders",
"@if": "order.UserID == user.ID",
"@do": {
"@id": "http://example.com/order/{{order.OrderID}}"
}
}]
}
}]
'''
] .# CSV data
:csv_table a csvw:Table, setl:Table ;
prov:wasGeneratedBy [
a setl:Extract ;
prov:used <data.csv>
] .
# JSON data
:json_data a setl:Table ;
prov:wasGeneratedBy [
a setl:Extract ;
prov:used <data.json> ;
setl:hasJSONSelector "$.items[*]"
] .
# XML data
:xml_data a setl:Table ;
prov:wasGeneratedBy [
a setl:Extract ;
prov:used <data.xml> ;
setl:hasXPathSelector "//item"
] .Use conditional logic to selectively process data based on runtime conditions.
[{
"@for": "row in table",
"@if": "row.Status == 'active' and row.Score > 50",
"@do": {
"@id": "http://example.com/entity/{{row.ID}}",
"@type": "ActiveEntity",
"score": "{{row.Score}}"
}
}]{
"@id": "http://example.com/person/{{row.ID}}",
"@type": "Person",
"name": "{{row.Name}}",
"email": {
"@if": "row.Email",
"@do": "mailto:{{row.Email}}"
},
"phone": {
"@if": "row.Phone and row.PhoneVerified",
"@do": "{{row.Phone}}"
}
}See Streaming XML documentation for details.
Process data in batches to control memory usage:
from rdflib import Graph, Namespace, URIRef
import setlr
# For very large datasets, process in chunks
chunk_size = 10000
offset = 0
output_graph = Graph()
while True:
# Create SETL script for this batch
setl_graph = create_batch_setl(offset, chunk_size)
# Process batch
resources = setlr.run_setl(setl_graph)
# Accumulate results
batch_output = resources[URIRef('http://example.com/output')]
output_graph += batch_output
# Check if done
if len(batch_output) < chunk_size:
break
offset += chunk_size
# Save final results
output_graph.serialize('output.ttl', format='turtle')For CSV/Excel files, pandas is used automatically. Optimize with:
# Use appropriate dtypes to reduce memory
# Specify in your data loading if possible
# For very wide tables, select only needed columns
# by processing the source data firstEnable detailed logging to diagnose issues:
import logging
import setlr
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('setlr')
logger.setLevel(logging.DEBUG)
# Now run SETL
resources = setlr.run_setl(setl_graph)Use tqdm for progress tracking on large datasets:
from tqdm import tqdm
import setlr
# Progress bars are automatically shown for:
# - Large file processing
# - Batch operations
# - Network transfers
resources = setlr.run_setl(setl_graph)Validate intermediate results to catch issues early:
from rdflib import Graph
import setlr
# Process data
resources = setlr.run_setl(setl_graph)
output = resources[URIRef('http://example.com/output')]
# Validate results
print(f"Generated {len(output)} triples")
print(f"Subjects: {len(set(output.subjects()))}")
print(f"Predicates: {len(set(output.predicates()))}")
print(f"Objects: {len(set(output.objects()))}")
# Check for specific patterns
for s, p, o in output.triples((None, RDF.type, None)):
print(f"Type: {o}")Handle errors gracefully in production:
import setlr
from rdflib import Graph
try:
setl_graph = Graph()
setl_graph.parse('transform.setl.ttl', format='turtle')
resources = setlr.run_setl(setl_graph)
except setlr.SetlrError as e:
print(f"SETL processing error: {e}")
# Handle gracefully
except Exception as e:
print(f"Unexpected error: {e}")
# Log and notifyIntegrate SETLr into your CI/CD pipeline:
# GitHub Actions example
name: Generate RDF
on: [push]
jobs:
generate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install SETLr
run: pip install setlr
- name: Generate RDF
run: setlr transform.setl.ttl -o output.ttl
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: rdf-output
path: output.ttlUse SETLr in containerized environments:
FROM python:3.11-slim
# Install SETLr
RUN pip install setlr
# Copy your SETL scripts and data
COPY transform.setl.ttl /app/
COPY data/ /app/data/
WORKDIR /app
# Run transformation
CMD ["setlr", "transform.setl.ttl", "-o", "/output/result.ttl"]# Build and run
docker build -t my-setl-transform .
docker run -v $(pwd)/output:/output my-setl-transformRun SETLr transformations on a schedule:
# scheduled_transform.py
import schedule
import time
from rdflib import Graph
import setlr
def run_transform():
"""Run the SETL transformation"""
print("Starting transformation...")
setl_graph = Graph()
setl_graph.parse('transform.setl.ttl', format='turtle')
resources = setlr.run_setl(setl_graph)
# Save output with timestamp
timestamp = time.strftime('%Y%m%d_%H%M%S')
output_file = f'output_{timestamp}.ttl'
output = resources[URIRef('http://example.com/output')]
output.serialize(output_file, format='turtle')
print(f"Transformation complete: {output_file}")
# Schedule to run every day at 2 AM
schedule.every().day.at("02:00").do(run_transform)
while True:
schedule.run_pending()
time.sleep(60)Expose SETLr as a REST API:
from flask import Flask, request, jsonify
from rdflib import Graph
import setlr
import tempfile
app = Flask(__name__)
@app.route('/transform', methods=['POST'])
def transform():
"""Accept CSV data and SETL script, return RDF"""
# Get input
csv_data = request.files['data']
setl_script = request.form['setl']
# Save to temp files
with tempfile.NamedTemporaryFile(mode='w', suffix='.csv') as csv_file:
csv_data.save(csv_file.name)
# Update SETL script with temp file path
setl_graph = Graph()
setl_graph.parse(data=setl_script, format='turtle')
# Run transformation
resources = setlr.run_setl(setl_graph)
# Return RDF
output = resources[URIRef('http://example.com/output')]
return output.serialize(format='turtle'), 200, {
'Content-Type': 'text/turtle'
}
if __name__ == '__main__':
app.run(debug=True)- Store SETL scripts in version control
- Track changes to transforms with your data processing pipeline
- Use branches for experimental transforms
- Test SETL scripts with sample data before production use
- Validate output with SHACL shapes
- Compare output to expected results
- Document complex transforms with comments (use rdfs:comment)
- Maintain README files for transform collections
- Include example data with your scripts
- Log transformation results (record counts, errors)
- Monitor resource usage for large datasets
- Set up alerts for transformation failures
- Explore Streaming XML for large file processing
- Learn about Python Functions for custom logic
- Set up SPARQL endpoints for data loading
- Implement SHACL validation for quality control
For questions about advanced features:
- Check the documentation
- Open a discussion
- Report issues on GitHub