A Python tool for efficiently processing RDF HDT files to generate VOID (Vocabulary of Interlinked Datasets) descriptions.
void-hdt analyzes HDT files and produces comprehensive metadata about RDF datasets using the VOID vocabulary. It leverages the efficiency of the HDT format and uses iterator-based processing to handle large datasets.
- Dataset Statistics: Total triples, distinct subjects, predicates, and objects
- Dataset Property Partitions: Triple counts per property across all triples
- Class Partitions: Identifies classes (via
rdf:type) and counts instances - Property Partitions: For each class, documents property usage and triple counts
- Object Class Partitions: Breakdown of object classes for each property partition
- Efficient Processing: Iterator-based design for memory-efficient handling of large files
- Turtle Output: Generates VOID descriptions in Turtle format
Build the Docker image:
docker build -t void-hdt .uv syncpip install -e .Process an HDT file using Docker:
docker run --rm -v /path/to/data:/data void-hdt /data/input.hdt -o /data/output.ttlOr with custom dataset URI:
docker run --rm -v /path/to/data:/data void-hdt \
/data/input.hdt \
-o /data/output.ttl \
--dataset-uri http://example.org/mydatavoid-hdt input.hdt -o output.ttlHDT_FILE: Path to the input HDT file (required)-o, --output PATH: Output file path for VOID description (required)--dataset-uri URI: URI for the dataset being described (default:http://example.org/dataset)--use-blank-nodes: Use blank nodes for partition nodes instead of URI references (optional)
void-hdt data/mydata.hdt -o void-description.ttl --dataset-uri http://example.org/mydataThe tool generates a VOID description in Turtle format that includes:
- Dataset-level statistics (triples, distinct subjects/predicates/objects)
- Dataset-level property partitions showing triple counts per property
- Class partitions with entity and triple counts
- Property partitions within each class partition showing triple counts per property
- Object class partitions showing the breakdown of object types for each property
Example output structure:
@prefix void: <http://rdfs.org/ns/void#> .
@prefix voidext: <http://ldf.fi/void-ext#> .
<http://example.org/dataset> a void:Dataset ;
void:triples 1000000 ;
void:distinctSubjects 50000 ;
void:properties 25 ;
void:distinctObjects 75000 ;
void:propertyPartition <http://example.org/dataset/property/abc123...> ;
void:classPartition <http://example.org/dataset/class/def456...> .
# Dataset-level property partition
<http://example.org/dataset/property/abc123...> a void:Dataset ;
void:property <http://example.org/name> ;
void:triples 50000 .
# Class partition
<http://example.org/dataset/class/def456...> a void:Dataset ;
void:class <http://example.org/Person> ;
void:entities 10000 ;
void:triples 30000 ;
void:propertyPartition <http://example.org/dataset/class/def456.../property/789abc...> .
# Property partition within class
<http://example.org/dataset/class/def456.../property/789abc...> a void:Dataset ;
void:property <http://example.org/worksFor> ;
void:triples 8000 ;
voidext:objectClassPartition <.../target/xyz789...> .
# Object class partition
<.../target/xyz789...> a void:Dataset ;
void:class <http://example.org/Company> ;
void:triples 8000 .Note: Partition URIs use MD5 hashes of the original IRIs to ensure syntactically valid URIs. The original IRIs are preserved via void:class and void:property predicates.
The following Mermaid diagram summarizes the VoID structure produced by this tool. The source for this diagram is also available in void-output-schema-class.mmd.
classDiagram
class Dataset {
void:triples : xsd:integer
void:distinctSubjects : xsd:integer
void:properties : xsd:integer
void:distinctObjects : xsd:integer
}
class DatasetPropertyPartition {
void:property : IRI
void:triples : xsd:integer
}
class ClassPartition {
void:class : IRI
void:entities : xsd:integer
void:triples : xsd:integer
}
class ClassPropertyPartition {
void:property : IRI
void:triples : xsd:integer
}
class ObjectClassPartition {
void:class : IRI
void:triples : xsd:integer
}
class UntypedObjectPartition {
void:triples : xsd:integer
}
Dataset "1" --> "0..*" DatasetPropertyPartition : void#58;propertyPartition
Dataset "1" --> "0..*" ClassPartition : void#58;classPartition
ClassPartition "1" --> "0..*" ClassPropertyPartition : void#58;propertyPartition
ClassPropertyPartition "1" --> "0..*" ObjectClassPartition : voidext#58;objectClassPartition
ClassPropertyPartition "1" --> "0..*" UntypedObjectPartition : voidext#58;objectClassPartition
note for Dataset "rdf:type void:Dataset"
note for DatasetPropertyPartition "rdf:type void:Dataset"
note for ClassPartition "rdf:type void:Dataset"
note for ClassPropertyPartition "rdf:type void:Dataset"
note for ObjectClassPartition "rdf:type void:Dataset"
note for UntypedObjectPartition "rdf:type void:Dataset"
- Python 3.12+
- uv for dependency management
- rdflib-hdt for HDT file access
- rdflib for VOID vocabulary generation
- click for CLI
- ty for type checking
- ruff for formatting and linting
uv run pytest -vuvx ty checkuv run ruff formatuv run ruff checkMIT
Contributions are welcome! Please feel free to submit issues or pull requests.