Skip to content

Dataset set difference operation is very slow #3363

@edmondchuc

Description

@edmondchuc

The current Dataset set difference operation is very slow. On a reasonably small dataset with around 30k statements, it takes around 30 minutes to finish processing.

rdflib/rdflib/graph.py

Lines 878 to 888 in 91c8cbc

def __sub__(self, other: Graph) -> Graph:
"""Set-theoretic difference.
BNode IDs are not changed."""
try:
retval = type(self)()
except TypeError:
retval = Graph()
for x in self:
if x not in other:
retval.add(x)
return retval

By using python's builtin set, I was able to reduce the processing time down to less than 1 second.

# Convert datasets to sets of quads for faster diff operations
logger.info("Extracting quads from current dataset")
ds_quads = set(ds.quads())
logger.info(f"Quads: {len(ds_quads)}")
logger.info("Extracting quads from previous dataset")
previous_ds_quads = set(previous_ds.quads())
logger.info(f"Quads: {len(previous_ds_quads)}")

# Compute diffs using set operations
# Statements in previous but not in current = deletions
logger.info("Computing deletions (previous - current)")
to_delete_quads = previous_ds_quads - ds_quads
# Statements in current but not in previous = additions
logger.info("Computing additions (current - previous)")
to_add_quads = ds_quads - previous_ds_quads

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions