Skip to content

Latest commit

 

History

History
226 lines (164 loc) · 6.08 KB

File metadata and controls

226 lines (164 loc) · 6.08 KB

void-hdt

A Python tool for efficiently processing RDF HDT files to generate VOID (Vocabulary of Interlinked Datasets) descriptions.

Overview

void-hdt analyzes HDT files and produces comprehensive metadata about RDF datasets using the VOID vocabulary. It leverages the efficiency of the HDT format and uses iterator-based processing to handle large datasets.

Features

  • Dataset Statistics: Total triples, distinct subjects, predicates, and objects
  • Dataset Property Partitions: Triple counts per property across all triples
  • Class Partitions: Identifies classes (via rdf:type) and counts instances
  • Property Partitions: For each class, documents property usage and triple counts
  • Object Class Partitions: Breakdown of object classes for each property partition
  • Efficient Processing: Iterator-based design for memory-efficient handling of large files
  • Turtle Output: Generates VOID descriptions in Turtle format

Installation

Using Docker (recommended for quick usage)

Build the Docker image:

docker build -t void-hdt .

Using uv (for development)

uv sync

Using pip

pip install -e .

Usage

Docker

Process an HDT file using Docker:

docker run --rm -v /path/to/data:/data void-hdt /data/input.hdt -o /data/output.ttl

Or with custom dataset URI:

docker run --rm -v /path/to/data:/data void-hdt \
  /data/input.hdt \
  -o /data/output.ttl \
  --dataset-uri http://example.org/mydata

Local Installation

Command Line

void-hdt input.hdt -o output.ttl

Options

  • HDT_FILE: Path to the input HDT file (required)
  • -o, --output PATH: Output file path for VOID description (required)
  • --dataset-uri URI: URI for the dataset being described (default: http://example.org/dataset)
  • --use-blank-nodes: Use blank nodes for partition nodes instead of URI references (optional)

Example

void-hdt data/mydata.hdt -o void-description.ttl --dataset-uri http://example.org/mydata

Output Format

The tool generates a VOID description in Turtle format that includes:

  • Dataset-level statistics (triples, distinct subjects/predicates/objects)
  • Dataset-level property partitions showing triple counts per property
  • Class partitions with entity and triple counts
  • Property partitions within each class partition showing triple counts per property
  • Object class partitions showing the breakdown of object types for each property

Example output structure:

@prefix void: <http://rdfs.org/ns/void#> .
@prefix voidext: <http://ldf.fi/void-ext#> .

<http://example.org/dataset> a void:Dataset ;
    void:triples 1000000 ;
    void:distinctSubjects 50000 ;
    void:properties 25 ;
    void:distinctObjects 75000 ;
    void:propertyPartition <http://example.org/dataset/property/abc123...> ;
    void:classPartition <http://example.org/dataset/class/def456...> .

# Dataset-level property partition
<http://example.org/dataset/property/abc123...> a void:Dataset ;
    void:property <http://example.org/name> ;
    void:triples 50000 .

# Class partition
<http://example.org/dataset/class/def456...> a void:Dataset ;
    void:class <http://example.org/Person> ;
    void:entities 10000 ;
    void:triples 30000 ;
    void:propertyPartition <http://example.org/dataset/class/def456.../property/789abc...> .

# Property partition within class
<http://example.org/dataset/class/def456.../property/789abc...> a void:Dataset ;
    void:property <http://example.org/worksFor> ;
    void:triples 8000 ;
    voidext:objectClassPartition <.../target/xyz789...> .

# Object class partition
<.../target/xyz789...> a void:Dataset ;
    void:class <http://example.org/Company> ;
    void:triples 8000 .

Note: Partition URIs use MD5 hashes of the original IRIs to ensure syntactically valid URIs. The original IRIs are preserved via void:class and void:property predicates.

Output Schema

The following Mermaid diagram summarizes the VoID structure produced by this tool. The source for this diagram is also available in void-output-schema-class.mmd.

classDiagram
    class Dataset {
        void:triples : xsd:integer
        void:distinctSubjects : xsd:integer
        void:properties : xsd:integer
        void:distinctObjects : xsd:integer
    }

    class DatasetPropertyPartition {
        void:property : IRI
        void:triples : xsd:integer
    }

    class ClassPartition {
        void:class : IRI
        void:entities : xsd:integer
        void:triples : xsd:integer
    }

    class ClassPropertyPartition {
        void:property : IRI
        void:triples : xsd:integer
    }

    class ObjectClassPartition {
        void:class : IRI
        void:triples : xsd:integer
    }

    class UntypedObjectPartition {
        void:triples : xsd:integer
    }

    Dataset "1" --> "0..*" DatasetPropertyPartition : void#58;propertyPartition
    Dataset "1" --> "0..*" ClassPartition : void#58;classPartition
    ClassPartition "1" --> "0..*" ClassPropertyPartition : void#58;propertyPartition
    ClassPropertyPartition "1" --> "0..*" ObjectClassPartition : voidext#58;objectClassPartition
    ClassPropertyPartition "1" --> "0..*" UntypedObjectPartition : voidext#58;objectClassPartition

    note for Dataset "rdf:type void:Dataset"
    note for DatasetPropertyPartition "rdf:type void:Dataset"
    note for ClassPartition "rdf:type void:Dataset"
    note for ClassPropertyPartition "rdf:type void:Dataset"
    note for ObjectClassPartition "rdf:type void:Dataset"
    note for UntypedObjectPartition "rdf:type void:Dataset"
Loading

Development

Requirements

  • Python 3.12+
  • uv for dependency management
  • rdflib-hdt for HDT file access
  • rdflib for VOID vocabulary generation
  • click for CLI
  • ty for type checking
  • ruff for formatting and linting

Running Tests

uv run pytest -v

Type Checking

uvx ty check

Formatting

uv run ruff format

Linting

uv run ruff check

License

MIT

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.