Skip to content

Commit cf0daed

Browse files
Changed schema for Value -> Term, majorly breaking change (#622)
* Changed schema for Value -> Term, majorly breaking change * Following the schema change, Value -> Term into all processing * Updated Cassandra for g, p, s, o index patterns (7 indexes) * Reviewed and updated all tests * Neo4j, Memgraph and FalkorDB remain broken, will look at once settled down
1 parent e061f2c commit cf0daed

File tree

86 files changed

+2447
-1753
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+2447
-1753
lines changed

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,8 @@ some-containers:
7070
-t ${CONTAINER_BASE}/trustgraph-base:${VERSION} .
7171
${DOCKER} build -f containers/Containerfile.flow \
7272
-t ${CONTAINER_BASE}/trustgraph-flow:${VERSION} .
73-
${DOCKER} build -f containers/Containerfile.vertexai \
74-
-t ${CONTAINER_BASE}/trustgraph-vertexai:${VERSION} .
73+
# ${DOCKER} build -f containers/Containerfile.vertexai \
74+
# -t ${CONTAINER_BASE}/trustgraph-vertexai:${VERSION} .
7575
# ${DOCKER} build -f containers/Containerfile.mcp \
7676
# -t ${CONTAINER_BASE}/trustgraph-mcp:${VERSION} .
7777
# ${DOCKER} build -f containers/Containerfile.vertexai \

docs/tech-specs/graph-contexts.md

Lines changed: 79 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -228,10 +228,15 @@ Following SPARQL conventions for backward compatibility:
228228

229229
- **`g` omitted / None**: Query the default graph only
230230
- **`g` = specific IRI**: Query that named graph only
231-
- **`g` = wildcard / `*`**: Query across all graphs
231+
- **`g` = wildcard / `*`**: Query across all graphs (equivalent to SPARQL
232+
`GRAPH ?g { ... }`)
232233

233234
This keeps simple queries simple and makes named graph queries opt-in.
234235

236+
Cross-graph queries (g=wildcard) are fully supported. The Cassandra schema
237+
includes dedicated tables (SPOG, POSG, OSPG) where g is a clustering column
238+
rather than a partition key, enabling efficient queries across all graphs.
239+
235240
#### Temporal Queries
236241

237242
**Find all facts discovered after a given date:**
@@ -388,12 +393,78 @@ will proceed in phases:
388393
Cassandra requires multiple tables to support different query access patterns
389394
(each table efficiently queries by its partition key + clustering columns).
390395

391-
**Challenge: Quads**
396+
##### Query Patterns
397+
398+
With quads (g, s, p, o), each position can be specified or wildcard, giving
399+
16 possible query patterns:
400+
401+
| # | g | s | p | o | Description |
402+
|---|---|---|---|---|-------------|
403+
| 1 | ? | ? | ? | ? | All quads |
404+
| 2 | ? | ? | ? | o | By object |
405+
| 3 | ? | ? | p | ? | By predicate |
406+
| 4 | ? | ? | p | o | By predicate + object |
407+
| 5 | ? | s | ? | ? | By subject |
408+
| 6 | ? | s | ? | o | By subject + object |
409+
| 7 | ? | s | p | ? | By subject + predicate |
410+
| 8 | ? | s | p | o | Full triple (which graphs?) |
411+
| 9 | g | ? | ? | ? | By graph |
412+
| 10 | g | ? | ? | o | By graph + object |
413+
| 11 | g | ? | p | ? | By graph + predicate |
414+
| 12 | g | ? | p | o | By graph + predicate + object |
415+
| 13 | g | s | ? | ? | By graph + subject |
416+
| 14 | g | s | ? | o | By graph + subject + object |
417+
| 15 | g | s | p | ? | By graph + subject + predicate |
418+
| 16 | g | s | p | o | Exact quad |
419+
420+
##### Table Design
421+
422+
Cassandra constraint: You can only efficiently query by partition key, then
423+
filter on clustering columns left-to-right. For g-wildcard queries, g must be
424+
a clustering column. For g-specified queries, g in the partition key is more
425+
efficient.
426+
427+
**Two table families needed:**
428+
429+
**Family A: g-wildcard queries** (g in clustering columns)
430+
431+
| Table | Partition | Clustering | Supports patterns |
432+
|-------|-----------|------------|-------------------|
433+
| SPOG | (user, collection, s) | p, o, g | 5, 7, 8 |
434+
| POSG | (user, collection, p) | o, s, g | 3, 4 |
435+
| OSPG | (user, collection, o) | s, p, g | 2, 6 |
436+
437+
**Family B: g-specified queries** (g in partition key)
438+
439+
| Table | Partition | Clustering | Supports patterns |
440+
|-------|-----------|------------|-------------------|
441+
| GSPO | (user, collection, g, s) | p, o | 9, 13, 15, 16 |
442+
| GPOS | (user, collection, g, p) | o, s | 11, 12 |
443+
| GOSP | (user, collection, g, o) | s, p | 10, 14 |
444+
445+
**Collection table** (for iteration and bulk deletion)
392446

393-
For triples, typical indexes are SPO, POS, OSP (partition by first, cluster by
394-
rest). For quads, the graph dimension adds: SPOG, POSG, OSPG, GSPO, etc.
447+
| Table | Partition | Clustering | Purpose |
448+
|-------|-----------|------------|---------|
449+
| COLL | (user, collection) | g, s, p, o | Enumerate all quads in collection |
395450

396-
**Challenge: Quoted Triples**
451+
##### Write and Delete Paths
452+
453+
**Write path**: Insert into all 7 tables.
454+
455+
**Delete collection path**:
456+
1. Iterate COLL table for `(user, collection)`
457+
2. For each quad, delete from all 6 query tables
458+
3. Delete from COLL table (or range delete)
459+
460+
**Delete single quad path**: Delete from all 7 tables directly.
461+
462+
##### Storage Cost
463+
464+
Each quad is stored 7 times. This is the cost of flexible querying combined
465+
with efficient collection deletion.
466+
467+
##### Quoted Triples in Storage
397468

398469
Subject or object can be a triple itself. Options:
399470

@@ -425,29 +496,9 @@ Metadata table:
425496
- Pro: Clean separation, can index triple IDs
426497
- Con: Requires computing/managing triple identity, two-phase lookups
427498

428-
**Option C: Hybrid**
429-
- Store quads normally with serialized quoted triple strings for simple cases
430-
- Maintain a separate triple ID lookup for advanced queries
431-
- Pro: Flexibility
432-
- Con: Complexity
433-
434-
**Recommendation**: TBD after prototyping. Option A is simplest for initial
435-
implementation; Option B may be needed for advanced query patterns.
436-
437-
#### Indexing Strategy
438-
439-
Indexes must support the defined query patterns:
440-
441-
| Query Type | Access Pattern | Index Needed |
442-
|------------|----------------|--------------|
443-
| Facts by date | P=discoveredOn, O>date | POG (predicate, object, graph) |
444-
| Facts by source | P=supportedBy, O=source | POG |
445-
| Facts by asserter | P=assertedBy, O=person | POG |
446-
| Metadata for a fact | S=quotedTriple | SPO/SPOG |
447-
| All facts in graph | G=graphIRI | GSPO |
448-
449-
For temporal range queries (dates), Cassandra clustering column ordering
450-
enables efficient scans when date is a clustering column.
499+
**Recommendation**: Start with Option A (serialized strings) for simplicity.
500+
Option B may be needed if advanced query patterns over quoted triple
501+
components are required.
451502

452503
2. **Phase 2+: Other Backends**
453504
- Neo4j and other stores implemented in subsequent stages

tests/contract/conftest.py

Lines changed: 19 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,10 @@
1515
TextCompletionRequest, TextCompletionResponse,
1616
DocumentRagQuery, DocumentRagResponse,
1717
AgentRequest, AgentResponse, AgentStep,
18-
Chunk, Triple, Triples, Value, Error,
18+
Chunk, Triple, Triples, Term, Error,
1919
EntityContext, EntityContexts,
2020
GraphEmbeddings, EntityEmbeddings,
21-
Metadata
21+
Metadata, IRI, LITERAL
2222
)
2323

2424

@@ -43,7 +43,7 @@ def schema_registry():
4343
"Chunk": Chunk,
4444
"Triple": Triple,
4545
"Triples": Triples,
46-
"Value": Value,
46+
"Term": Term,
4747
"Error": Error,
4848
"EntityContext": EntityContext,
4949
"EntityContexts": EntityContexts,
@@ -98,26 +98,22 @@ def sample_message_data():
9898
"collection": "test_collection",
9999
"metadata": []
100100
},
101-
"Value": {
102-
"value": "http://example.com/entity",
103-
"is_uri": True,
104-
"type": ""
101+
"Term": {
102+
"type": IRI,
103+
"iri": "http://example.com/entity"
105104
},
106105
"Triple": {
107-
"s": Value(
108-
value="http://example.com/subject",
109-
is_uri=True,
110-
type=""
106+
"s": Term(
107+
type=IRI,
108+
iri="http://example.com/subject"
111109
),
112-
"p": Value(
113-
value="http://example.com/predicate",
114-
is_uri=True,
115-
type=""
110+
"p": Term(
111+
type=IRI,
112+
iri="http://example.com/predicate"
116113
),
117-
"o": Value(
118-
value="Object value",
119-
is_uri=False,
120-
type=""
114+
"o": Term(
115+
type=LITERAL,
116+
value="Object value"
121117
)
122118
}
123119
}
@@ -139,10 +135,10 @@ def invalid_message_data():
139135
{"query": "test", "user": "test", "collection": "test", "doc_limit": -1}, # Invalid doc_limit
140136
{"query": "test"}, # Missing required fields
141137
],
142-
"Value": [
143-
{"value": None, "is_uri": True, "type": ""}, # Invalid value (None)
144-
{"value": "test", "is_uri": "not_boolean", "type": ""}, # Invalid is_uri
145-
{"value": 123, "is_uri": True, "type": ""}, # Invalid value (not string)
138+
"Term": [
139+
{"type": IRI, "iri": None}, # Invalid iri (None)
140+
{"type": "invalid_type", "value": "test"}, # Invalid type
141+
{"type": LITERAL, "value": 123}, # Invalid value (not string)
146142
]
147143
}
148144

0 commit comments

Comments
 (0)