Dataset set difference operation is very slow

The current Dataset set difference operation is very slow. On a reasonably small dataset with around 30k statements, it takes around 30 minutes to finish processing.

https://github.com/RDFLib/rdflib/blob/91c8cbc5148397260a1d219e60f4d1051f636cd3/rdflib/graph.py#L878-L888

By using python's builtin set, I was able to reduce the processing time down to less than 1 second.

```python
# Convert datasets to sets of quads for faster diff operations
logger.info("Extracting quads from current dataset")
ds_quads = set(ds.quads())
logger.info(f"Quads: {len(ds_quads)}")
logger.info("Extracting quads from previous dataset")
previous_ds_quads = set(previous_ds.quads())
logger.info(f"Quads: {len(previous_ds_quads)}")

# Compute diffs using set operations
# Statements in previous but not in current = deletions
logger.info("Computing deletions (previous - current)")
to_delete_quads = previous_ds_quads - ds_quads
# Statements in current but not in previous = additions
logger.info("Computing additions (current - previous)")
to_add_quads = ds_quads - previous_ds_quads
```

	def __sub__(self, other: Graph) -> Graph:
	"""Set-theoretic difference.
	BNode IDs are not changed."""
	try:
	retval = type(self)()
	except TypeError:
	retval = Graph()
	for x in self:
	if x not in other:
	retval.add(x)
	return retval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset set difference operation is very slow #3363

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset set difference operation is very slow #3363

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions