-
Notifications
You must be signed in to change notification settings - Fork 587
Description
I was trying out pyLODE's supermodel profile and noticed it was extremely slow. After some debugging, I found the issue which can be demonstrated by this script:
from time import perf_counter
from rdflib import Dataset, Graph
file = "https://www.w3.org/ns/shacl.ttl"
source_data = Graph().parse(file)
test_graph = Graph()
start = perf_counter()
for s, p, o in source_data:
test_graph.add((s, p, o))
base_dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to graph in {base_dur:.3f} seconds")
test_ds = Dataset()
start = perf_counter()
for s, p, o in source_data:
test_ds.add((s, p, o, source_data.identifier))
dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to dataset by identifier in {dur:.3f} seconds ({dur / base_dur:.2f}x the first case)")
assert source_data in test_ds.graphs()
test_ds = Dataset()
start = perf_counter()
for s, p, o in source_data:
test_ds.add((s, p, o, source_data))
dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to dataset by graph in {dur:.3f} seconds ({dur / base_dur:.2f}x the first case)")
assert source_data in test_ds.graphs()I get the following output from this script:
Added 1128 triples to graph in 0.012 seconds
Added 1128 triples to dataset by identifier in 0.017 seconds (1.37x the first case)
Added 1128 triples to dataset by graph in 10.057 seconds (827.36x the first case)
When Dataset.add is called where the context is a Graph, it takes orders of magnitude longer to add the triples to the dataset.
pyLODE was so slow because it did that third case here: https://github.com/RDFLib/pyLODE/blob/eb7a1d680ae08df1da13d14e9ecea8b7a7b670e8/pylode/profiles/supermodel/dataset.py#L27
This slowdown was introduced at this line and commit:
Line 2296 in 1b77686
| _graph.__iadd__(c) |
Whenever Dataset.add (really ConjunctiveGraph.add) is called where the context is a Graph, before the triple gets added to the dataset, the entire content of that context graph is getting copied into the graph created by ConjunctiveGraph._graph, which is clearly unnecessary in this context. I'm not exactly sure why that above change was made.
Also note that the type annotation of ConjunctiveGraph.add is _TripleOrOptionalQuadType, which types the context element of the quad as Optional[_ContextType], and _ContextType is equivalent to Graph. However, whenever I use Dataset.add I always use an IdentifiedNode which doesn't suffer from this issue and works as expected (as shown in the above example).
Given that many of the functions mentioned here aren't fully documented and that the type annotations are potentially not quite right, I'm not quite sure what should be considered the actual issue. Presumably it is at least one of the following:
- The type annotation of
ConjunctiveGraph.addis incorrect and pyLODE should be loading the Dataset differently (i.e. pyLODE should not be providing aGraphas the context, but rather theGraph.identifier) ConjunctiveGraph._graphshould not be called fromConjunctiveGraph.add(viaConjunctiveGraph._spoc)ConjunctiveGraph._graphshould not be copying the full contents of the providedc
I'd be curious to know what others think about this and if/what changes are necessary.