Significant performance issue when adding triples to a Dataset by a Graph object

I was trying out pyLODE's supermodel profile and noticed it was _extremely_ slow. After some debugging, I found the issue which can be demonstrated by this script:

```python
from time import perf_counter

from rdflib import Dataset, Graph

file = "https://www.w3.org/ns/shacl.ttl"

source_data = Graph().parse(file)

test_graph = Graph()
start = perf_counter()
for s, p, o in source_data:
    test_graph.add((s, p, o))
base_dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to graph in {base_dur:.3f} seconds")

test_ds = Dataset()
start = perf_counter()
for s, p, o in source_data:
    test_ds.add((s, p, o, source_data.identifier))
dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to dataset by identifier in {dur:.3f} seconds ({dur / base_dur:.2f}x the first case)")
assert source_data in test_ds.graphs()

test_ds = Dataset()
start = perf_counter()
for s, p, o in source_data:
    test_ds.add((s, p, o, source_data))
dur = perf_counter() - start
print(f"Added {len(test_graph)} triples to dataset by graph in {dur:.3f} seconds ({dur / base_dur:.2f}x the first case)")
assert source_data in test_ds.graphs()
```
I get the following output from this script:
```
Added 1128 triples to graph in 0.012 seconds
Added 1128 triples to dataset by identifier in 0.017 seconds (1.37x the first case)
Added 1128 triples to dataset by graph in 10.057 seconds (827.36x the first case)
```
When `Dataset.add` is called where the context is a `Graph`, it takes _orders of magnitude_ longer to add the triples to the dataset.

pyLODE was so slow because it did that third case here: https://github.com/RDFLib/pyLODE/blob/eb7a1d680ae08df1da13d14e9ecea8b7a7b670e8/pylode/profiles/supermodel/dataset.py#L27

This slowdown was introduced at this line and commit: https://github.com/RDFLib/rdflib/blob/1b77686fca4a0281bd14d23df2616149f31f5017/rdflib/graph.py#L2296

Whenever `Dataset.add` (really `ConjunctiveGraph.add`) is called where the context is a `Graph`, before the triple gets added to the dataset, _the entire content of that context graph is getting copied_ into the graph created by `ConjunctiveGraph._graph`, which is clearly unnecessary in this context. I'm not exactly sure why that above change was made.

Also note that the type annotation of `ConjunctiveGraph.add` is `_TripleOrOptionalQuadType`, which types the context element of the quad as `Optional[_ContextType]`, and `_ContextType` is equivalent to `Graph`. However, whenever I use `Dataset.add` I always use an `IdentifiedNode` which doesn't suffer from this issue and works as expected (as shown in the above example).

Given that many of the functions mentioned here aren't fully documented and that the type annotations are potentially not quite right, I'm not quite sure what should be considered the actual issue. Presumably it is at least one of the following:
- The type annotation of `ConjunctiveGraph.add` is incorrect and pyLODE should be loading the Dataset differently (i.e. pyLODE should not be providing a `Graph` as the context, but rather the `Graph.identifier`)
- `ConjunctiveGraph._graph` should not be called from `ConjunctiveGraph.add` (via `ConjunctiveGraph._spoc`)
- `ConjunctiveGraph._graph` should not be copying the full contents of the provided `c`

I'd be curious to know what others think about this and if/what changes are necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance issue when adding triples to a Dataset by a Graph object #3392

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significant performance issue when adding triples to a Dataset by a Graph object #3392

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions