Skip to content

frink-okn/void-hdt

Repository files navigation

void-hdt

A Python tool for efficiently processing RDF HDT files to generate VOID (Vocabulary of Interlinked Datasets) descriptions.

Overview

void-hdt analyzes HDT files and produces comprehensive metadata about RDF datasets using the VOID vocabulary. It leverages the efficiency of the HDT format and uses iterator-based processing to handle large datasets.

Features

  • Dataset Statistics: Total triples, distinct subjects, predicates, and objects
  • Dataset Property Partitions: Triple counts per property across all triples
  • Class Partitions: Identifies classes (via rdf:type) and counts instances
  • Property Partitions: For each class, documents property usage and triple counts
  • Object Class Partitions: Breakdown of object classes for each property partition
  • Efficient Processing: Iterator-based design for memory-efficient handling of large files
  • Turtle Output: Generates VOID descriptions in Turtle format

Installation

Using Docker (recommended for quick usage)

Build the Docker image:

docker build -t void-hdt .

Using uv (for development)

uv sync

Using pip

pip install -e .

Usage

Docker

Process an HDT file using Docker:

docker run --rm -v /path/to/data:/data void-hdt /data/input.hdt -o /data/output.ttl

Or with custom dataset URI:

docker run --rm -v /path/to/data:/data void-hdt \
  /data/input.hdt \
  -o /data/output.ttl \
  --dataset-uri http://example.org/mydata

Local Installation

Command Line

void-hdt input.hdt -o output.ttl

Options

  • HDT_FILE: Path to the input HDT file (required)
  • -o, --output PATH: Output file path for VOID description (required)
  • --dataset-uri URI: URI for the dataset being described (default: http://example.org/dataset)
  • --use-blank-nodes: Use blank nodes for partition nodes instead of URI references (optional)

Example

void-hdt data/mydata.hdt -o void-description.ttl --dataset-uri http://example.org/mydata

Output Format

The tool generates a VOID description in Turtle format that includes:

  • Dataset-level statistics (triples, distinct subjects/predicates/objects)
  • Dataset-level property partitions showing triple counts per property
  • Class partitions with entity and triple counts
  • Property partitions within each class partition showing triple counts per property
  • Object class partitions showing the breakdown of object types for each property

Example output structure:

@prefix void: <http://rdfs.org/ns/void#> .
@prefix voidext: <http://ldf.fi/void-ext#> .

<http://example.org/dataset> a void:Dataset ;
    void:triples 1000000 ;
    void:distinctSubjects 50000 ;
    void:properties 25 ;
    void:distinctObjects 75000 ;
    void:propertyPartition <http://example.org/dataset/property/abc123...> ;
    void:classPartition <http://example.org/dataset/class/def456...> .

# Dataset-level property partition
<http://example.org/dataset/property/abc123...> a void:Dataset ;
    void:property <http://example.org/name> ;
    void:triples 50000 .

# Class partition
<http://example.org/dataset/class/def456...> a void:Dataset ;
    void:class <http://example.org/Person> ;
    void:entities 10000 ;
    void:triples 30000 ;
    void:propertyPartition <http://example.org/dataset/class/def456.../property/789abc...> .

# Property partition within class
<http://example.org/dataset/class/def456.../property/789abc...> a void:Dataset ;
    void:property <http://example.org/worksFor> ;
    void:triples 8000 ;
    voidext:objectClassPartition <.../target/xyz789...> .

# Object class partition
<.../target/xyz789...> a void:Dataset ;
    void:class <http://example.org/Company> ;
    void:triples 8000 .

Note: Partition URIs use MD5 hashes of the original IRIs to ensure syntactically valid URIs. The original IRIs are preserved via void:class and void:property predicates.

Output Schema

The following Mermaid diagram summarizes the VoID structure produced by this tool. The source for this diagram is also available in void-output-schema-class.mmd.

classDiagram
    class Dataset {
        void:triples : xsd:integer
        void:distinctSubjects : xsd:integer
        void:properties : xsd:integer
        void:distinctObjects : xsd:integer
    }

    class DatasetPropertyPartition {
        void:property : IRI
        void:triples : xsd:integer
    }

    class ClassPartition {
        void:class : IRI
        void:entities : xsd:integer
        void:triples : xsd:integer
    }

    class ClassPropertyPartition {
        void:property : IRI
        void:triples : xsd:integer
    }

    class ObjectClassPartition {
        void:class : IRI
        void:triples : xsd:integer
    }

    class UntypedObjectPartition {
        void:triples : xsd:integer
    }

    Dataset "1" --> "0..*" DatasetPropertyPartition : void#58;propertyPartition
    Dataset "1" --> "0..*" ClassPartition : void#58;classPartition
    ClassPartition "1" --> "0..*" ClassPropertyPartition : void#58;propertyPartition
    ClassPropertyPartition "1" --> "0..*" ObjectClassPartition : voidext#58;objectClassPartition
    ClassPropertyPartition "1" --> "0..*" UntypedObjectPartition : voidext#58;objectClassPartition

    note for Dataset "rdf:type void:Dataset"
    note for DatasetPropertyPartition "rdf:type void:Dataset"
    note for ClassPartition "rdf:type void:Dataset"
    note for ClassPropertyPartition "rdf:type void:Dataset"
    note for ObjectClassPartition "rdf:type void:Dataset"
    note for UntypedObjectPartition "rdf:type void:Dataset"
Loading

Development

Requirements

  • Python 3.12+
  • uv for dependency management
  • rdflib-hdt for HDT file access
  • rdflib for VOID vocabulary generation
  • click for CLI
  • ty for type checking
  • ruff for formatting and linting

Running Tests

uv run pytest -v

Type Checking

uvx ty check

Formatting

uv run ruff format

Linting

uv run ruff check

License

MIT

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

About

A Python tool for efficiently processing RDF HDT files to generate VOID (Vocabulary of Interlinked Datasets) descriptions

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors