-
Notifications
You must be signed in to change notification settings - Fork 586
Open
Description
The current Dataset set difference operation is very slow. On a reasonably small dataset with around 30k statements, it takes around 30 minutes to finish processing.
Lines 878 to 888 in 91c8cbc
| def __sub__(self, other: Graph) -> Graph: | |
| """Set-theoretic difference. | |
| BNode IDs are not changed.""" | |
| try: | |
| retval = type(self)() | |
| except TypeError: | |
| retval = Graph() | |
| for x in self: | |
| if x not in other: | |
| retval.add(x) | |
| return retval |
By using python's builtin set, I was able to reduce the processing time down to less than 1 second.
# Convert datasets to sets of quads for faster diff operations
logger.info("Extracting quads from current dataset")
ds_quads = set(ds.quads())
logger.info(f"Quads: {len(ds_quads)}")
logger.info("Extracting quads from previous dataset")
previous_ds_quads = set(previous_ds.quads())
logger.info(f"Quads: {len(previous_ds_quads)}")
# Compute diffs using set operations
# Statements in previous but not in current = deletions
logger.info("Computing deletions (previous - current)")
to_delete_quads = previous_ds_quads - ds_quads
# Statements in current but not in previous = additions
logger.info("Computing additions (current - previous)")
to_add_quads = ds_quads - previous_ds_quadsReactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels