Skip to content

Significant performance issue when adding triples to a Dataset by a Graph object #3392

@mgberg

Description

@mgberg

I was trying out pyLODE's supermodel profile and noticed it was extremely slow. After some debugging, I found the issue which can be demonstrated by this script:

from time import perf_counter

from rdflib import Dataset, Graph

file = "https://www.w3.org/ns/shacl.ttl"

source_data = Graph().parse(file)

test_graph = Graph()
start = perf_counter()
for s, p, o in source_data:
    test_graph.add((s, p, o))
base_dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to graph in {base_dur:.3f} seconds")

test_ds = Dataset()
start = perf_counter()
for s, p, o in source_data:
    test_ds.add((s, p, o, source_data.identifier))
dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to dataset by identifier in {dur:.3f} seconds ({dur / base_dur:.2f}x the first case)")
assert source_data in test_ds.graphs()

test_ds = Dataset()
start = perf_counter()
for s, p, o in source_data:
    test_ds.add((s, p, o, source_data))
dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to dataset by graph in {dur:.3f} seconds ({dur / base_dur:.2f}x the first case)")
assert source_data in test_ds.graphs()

I get the following output from this script:

Added 1128 triples to graph in 0.012 seconds
Added 1128 triples to dataset by identifier in 0.017 seconds (1.37x the first case)
Added 1128 triples to dataset by graph in 10.057 seconds (827.36x the first case)

When Dataset.add is called where the context is a Graph, it takes orders of magnitude longer to add the triples to the dataset.

pyLODE was so slow because it did that third case here: https://github.com/RDFLib/pyLODE/blob/eb7a1d680ae08df1da13d14e9ecea8b7a7b670e8/pylode/profiles/supermodel/dataset.py#L27

This slowdown was introduced at this line and commit:

_graph.__iadd__(c)

Whenever Dataset.add (really ConjunctiveGraph.add) is called where the context is a Graph, before the triple gets added to the dataset, the entire content of that context graph is getting copied into the graph created by ConjunctiveGraph._graph, which is clearly unnecessary in this context. I'm not exactly sure why that above change was made.

Also note that the type annotation of ConjunctiveGraph.add is _TripleOrOptionalQuadType, which types the context element of the quad as Optional[_ContextType], and _ContextType is equivalent to Graph. However, whenever I use Dataset.add I always use an IdentifiedNode which doesn't suffer from this issue and works as expected (as shown in the above example).

Given that many of the functions mentioned here aren't fully documented and that the type annotations are potentially not quite right, I'm not quite sure what should be considered the actual issue. Presumably it is at least one of the following:

  • The type annotation of ConjunctiveGraph.add is incorrect and pyLODE should be loading the Dataset differently (i.e. pyLODE should not be providing a Graph as the context, but rather the Graph.identifier)
  • ConjunctiveGraph._graph should not be called from ConjunctiveGraph.add (via ConjunctiveGraph._spoc)
  • ConjunctiveGraph._graph should not be copying the full contents of the provided c

I'd be curious to know what others think about this and if/what changes are necessary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions