Skip to content

Commit 02d4d82

Browse files
Enable Structured Output for schema from text extractor (#466)
* Make NodeType stricter and add support for structured output in schema * Refactor GraphSchema.patterns and run for SchemaFromTextExtractor * Change prompt to match new GraphSchema.patterns * Add example file * Update pattern in graph pruning and related example * Fix issues to satisfy min_length=1 for properties and new pattern type * Fix mypy issues * Fix unit tests * Add supports_structured_output to LLM interfaces that have the feature enabled * Update changelog * Update docs
1 parent 57f80e6 commit 02d4d82

File tree

16 files changed

+797
-153
lines changed

16 files changed

+797
-153
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,17 @@
1010
- Support for version 6.0.0 of the Neo4j Python driver
1111
- Support for structured output in `OpenAILLM` and `VertexAILLM` via `response_format` parameter. Accepts Pydantic models (requires `ConfigDict(extra="forbid")`) or JSON schemas.
1212
- Added `use_structured_output` parameter to `LLMEntityRelationExtractor` for improved entity extraction reliability with OpenAI/VertexAI LLMs.
13+
- Added `use_structured_output` parameter to `SchemaFromTextExtractor` for improved schema generation with OpenAI/VertexAI LLMs. Enforces `GraphSchema` structure via Pydantic model validation and includes automatic cleanup of invalid patterns/constraints.
14+
- Added `supports_structured_output` capability flag to `LLMInterface` for forward-compatible detection of structured output support across LLM implementations.
1315

1416
### Changed
1517

1618
- Switched project/dependency management from Poetry to uv.
1719
- Dropped support for Python 3.9 (EOL)
1820
- Made `Neo4jNode`, `Neo4jRelationship`, and `Neo4jGraph` stricter: properties field now uses typed `PropertyValue` (Neo4j primitives, temporal values, lists, `GeoPoint`) and fixed mutable defaults with `Field(default_factory=...)`.
21+
- **Breaking**: `NodeType.properties` now requires at least one property (`min_length=1`). String-based node definitions (e.g., `NodeType("Person")`) automatically receive a default "name" property with `additional_properties=True`.
22+
- **Breaking**: `RelationshipType` with empty properties and `additional_properties=False` is now auto-corrected to `additional_properties=True` to prevent pruning of LLM-extracted properties.
23+
- Introduced `Pattern` Pydantic model for internal storage of graph patterns, replacing tuple format. Public APIs maintain backward compatibility by accepting both tuples and `Pattern` objects.
1924

2025
## 1.11.0
2126

docs/source/types.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,11 @@ RelationshipType
9595

9696
.. autoclass:: neo4j_graphrag.experimental.components.schema.RelationshipType
9797

98+
Pattern
99+
=======
100+
101+
.. autoclass:: neo4j_graphrag.experimental.components.schema.Pattern
102+
98103
GraphSchema
99104
===========
100105

docs/source/user_guide_kg_builder.rst

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -867,6 +867,49 @@ You can also save and reload the extracted schema:
867867
restored_schema = GraphSchema.from_file("my_schema.json") # or my_schema.yaml
868868
869869
870+
Using Structured Output with Schema Extraction
871+
-----------------------------------------------
872+
873+
For improved reliability with :ref:`OpenAILLM <openaillm>` or :ref:`VertexAILLM <vertexaillm>`, enable structured output mode. When ``use_structured_output=True``, the extractor passes the ``GraphSchema`` Pydantic model as ``response_format`` to the LLM, ensuring responses conform to the expected schema structure with automatic validation:
874+
875+
.. code:: python
876+
877+
from neo4j_graphrag.experimental.components.schema import SchemaFromTextExtractor
878+
from neo4j_graphrag.llm import OpenAILLM
879+
880+
llm = OpenAILLM(model_name="gpt-4o-mini", model_params={"temperature": 0})
881+
schema_extractor = SchemaFromTextExtractor(
882+
llm=llm,
883+
use_structured_output=True
884+
)
885+
886+
extracted_schema = await schema_extractor.run(text="Some text")
887+
888+
.. note::
889+
890+
Using ``use_structured_output=True`` with other LLM providers will raise a ``ValueError``. Do not pass ``response_format`` in constructor parameters; the extractor automatically sets it when calling ``invoke()``.
891+
892+
893+
Schema Validation and Node Properties
894+
--------------------------------------
895+
896+
**Important:** All node types must have at least one property defined. When using string shorthand for node types (e.g., ``"Person"``), a default ``"name"`` property is automatically added with ``additional_properties=True`` to allow flexible LLM extraction:
897+
898+
.. code:: python
899+
900+
# String shorthand - automatically gets default property
901+
NodeType("Person") # Becomes: properties=[{"name": "name", "type": "STRING"}], additional_properties=True
902+
903+
# Explicit definition - must include at least one property
904+
NodeType(
905+
label="Person",
906+
properties=[PropertyType(name="name", type="STRING")],
907+
additional_properties=True # Allow LLM to extract additional properties
908+
)
909+
910+
**Relationship types** with no properties automatically set ``additional_properties=True`` to preserve LLM-extracted properties during graph construction.
911+
912+
870913
Schema Visualization
871914
--------------------
872915

@@ -949,7 +992,7 @@ For improved reliability and type safety with :ref:`OpenAILLM <openaillm>` or :r
949992
950993
.. note::
951994

952-
Using `use_structured_output=True` with other LLM providers will raise a `ValueError`. Do not pass `response_format` in constructor parameters (`model_params` or `generation_config`); the extractor automatically sets it when calling `invoke()`.
995+
Structured output is only available for LLMs with ``supports_structured_output=True`` (currently :ref:`OpenAILLM <openaillm>` and :ref:`VertexAILLM <vertexaillm>`). Using ``use_structured_output=True`` with other providers will raise a ``ValueError``. Do not pass ``response_format`` in constructor parameters (``model_params`` or ``generation_config``); the extractor automatically sets it when calling ``invoke()``.
953996

954997

955998
Error Behaviour

examples/customize/build_graph/components/pruners/graph_pruner.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from neo4j_graphrag.experimental.components.schema import (
77
GraphSchema,
88
NodeType,
9+
Pattern,
910
PropertyType,
1011
RelationshipType,
1112
)
@@ -94,8 +95,8 @@
9495
),
9596
),
9697
patterns=(
97-
("Person", "KNOWS", "Person"),
98-
("Person", "WORKS_FOR", "Organization"),
98+
Pattern(source="Person", relationship="KNOWS", target="Person"),
99+
Pattern(source="Person", relationship="WORKS_FOR", target="Organization"),
99100
),
100101
additional_node_types=False,
101102
additional_relationship_types=False,
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
"""
2+
Simple example demonstrating structured output with SchemaFromTextExtractor.
3+
4+
This example shows how to use structured output for more reliable schema extraction
5+
with automatic validation against the GraphSchema Pydantic model.
6+
7+
The GraphSchema is now compatible with both OpenAI and VertexAI structured output APIs,
8+
with strict validation and proper field definitions. With structured output enabled:
9+
- Uses LLMInterfaceV2 (list of messages)
10+
- Passes GraphSchema Pydantic model as response_format to ainvoke()
11+
- Ensures response conforms to expected schema structure
12+
- Provides automatic type validation
13+
- Reduces need for JSON repair and error handling
14+
- Enforces min_length=1 on node properties (nodes must have at least one property)
15+
16+
Prerequisites:
17+
- Google Cloud credentials configured for VertexAI
18+
- Or OpenAI API key set in OPENAI_API_KEY environment variable
19+
"""
20+
21+
import asyncio
22+
from dotenv import load_dotenv
23+
24+
from neo4j_graphrag.experimental.components.schema import (
25+
SchemaFromTextExtractor,
26+
GraphSchema,
27+
)
28+
from neo4j_graphrag.llm import OpenAILLM
29+
30+
31+
# Sample text to extract schema from
32+
SAMPLE_TEXT = """
33+
Acme Corporation was founded in 1985 by John Smith in New York City.
34+
The company specializes in manufacturing high-quality widgets and gadgets
35+
for the consumer electronics industry.
36+
37+
Sarah Johnson joined Acme in 2010 as a Senior Engineer and was promoted to
38+
Engineering Director in 2015. She oversees a team of 12 engineers working on
39+
next-generation products. Sarah holds a PhD in Electrical Engineering from MIT
40+
and has filed 5 patents during her time at Acme.
41+
42+
The company expanded to international markets in 2012, opening offices in London,
43+
Tokyo, and Berlin. Each office is managed by a regional director who reports
44+
directly to the CEO, Michael Brown, who took over leadership in 2008.
45+
46+
Acme's most successful product, the SuperWidget X1, was launched in 2018 and
47+
has sold over 2 million units worldwide. The product was developed by a team led
48+
by Robert Chen, who joined the company in 2016 after working at TechGiant for 8 years.
49+
"""
50+
51+
52+
def print_schema_summary(schema: GraphSchema, title: str) -> None:
53+
"""Print a formatted summary of the extracted schema."""
54+
print(f"\n{'='*60}")
55+
print(f"{title}")
56+
print(f"{'='*60}")
57+
58+
print(f"\nNode Types ({len(schema.node_types)}):")
59+
for node in schema.node_types:
60+
props = [f"{p.name} ({p.type})" for p in node.properties]
61+
print(f" - {node.label}")
62+
if props:
63+
print(f" Properties: {', '.join(props)}")
64+
if node.description:
65+
print(f" Description: {node.description}")
66+
67+
if schema.relationship_types:
68+
print(f"\nRelationship Types ({len(schema.relationship_types)}):")
69+
for rel in schema.relationship_types:
70+
props = [f"{p.name} ({p.type})" for p in rel.properties]
71+
print(f" - {rel.label}")
72+
if props:
73+
print(f" Properties: {', '.join(props)}")
74+
75+
if schema.patterns:
76+
print(f"\nPatterns ({len(schema.patterns)}):")
77+
for source, relationship, target in schema.patterns:
78+
print(f" {source} --[{relationship}]--> {target}")
79+
80+
if schema.constraints:
81+
print(f"\nConstraints ({len(schema.constraints)}):")
82+
for constraint in schema.constraints:
83+
print(
84+
f" - {constraint.type} on {constraint.node_type}.{constraint.property_name}"
85+
)
86+
87+
88+
async def test_v1_without_structured_output() -> GraphSchema:
89+
"""
90+
Test V1 approach (default): Prompt-based JSON extraction with manual cleanup.
91+
92+
With use_structured_output=False (default):
93+
- Uses LLMInterface V1 (plain string prompts)
94+
- LLM returns JSON string that needs parsing and cleanup
95+
- Extensive filtering and validation applied manually
96+
- More forgiving of LLM errors
97+
- Works with all LLM providers
98+
"""
99+
print("\n" + "=" * 60)
100+
print("Testing V1: Prompt-based JSON extraction (default)")
101+
print("=" * 60)
102+
103+
# Initialize LLM with response_format for JSON mode (V1 approach)
104+
llm = OpenAILLM(
105+
model_name="gpt-4o-mini",
106+
model_params={
107+
"temperature": 0,
108+
"response_format": {"type": "json_object"}, # JSON mode for V1
109+
},
110+
)
111+
112+
# For VertexAI V1, use:
113+
# llm = VertexAILLM(
114+
# model_name="gemini-2.5-flash",
115+
# model_params={"temperature": 0}
116+
# )
117+
118+
# Create extractor WITHOUT structured output (V1 default)
119+
extractor = SchemaFromTextExtractor(
120+
llm=llm,
121+
use_structured_output=False, # Default, can be omitted
122+
)
123+
124+
# Extract schema
125+
schema = await extractor.run(text=SAMPLE_TEXT)
126+
127+
print_schema_summary(schema, "V1 Result (Prompt-based)")
128+
129+
return schema
130+
131+
132+
async def test_v2_with_structured_output() -> GraphSchema:
133+
"""
134+
Test V2 approach: Structured output with GraphSchema validation.
135+
136+
With use_structured_output=True:
137+
- Uses LLMInterfaceV2 (list of messages)
138+
- Passes GraphSchema as response_format to ainvoke()
139+
- LLM returns properly structured data conforming to GraphSchema
140+
- Automatic validation via Pydantic
141+
- Less manual cleanup needed
142+
- Only works with OpenAI and VertexAI
143+
- Enforces min_length=1 on node properties
144+
"""
145+
print("\n" + "=" * 60)
146+
print("Testing V2: Structured output with GraphSchema")
147+
print("=" * 60)
148+
149+
# Initialize LLM - NO response_format in constructor for V2!
150+
llm = OpenAILLM(model_name="gpt-4o-mini", model_params={"temperature": 0})
151+
152+
# For VertexAI V2, use:
153+
# llm = VertexAILLM(
154+
# model_name="gemini-2.5-flash",
155+
# model_params={"temperature": 0}
156+
# )
157+
158+
# Create extractor WITH structured output (V2)
159+
extractor = SchemaFromTextExtractor(
160+
llm=llm,
161+
use_structured_output=True, # This is the key parameter!
162+
)
163+
164+
# Extract schema
165+
schema = await extractor.run(text=SAMPLE_TEXT)
166+
167+
print_schema_summary(schema, "V2 Result (Structured Output)")
168+
169+
return schema
170+
171+
172+
async def compare_approaches() -> None:
173+
"""Run both approaches and compare results."""
174+
load_dotenv()
175+
176+
# Test V1 (default)
177+
schema_v1 = await test_v1_without_structured_output()
178+
179+
# Test V2 (structured output)
180+
schema_v2 = await test_v2_with_structured_output()
181+
182+
# Comparison
183+
print("\n" + "=" * 60)
184+
print("COMPARISON")
185+
print("=" * 60)
186+
print("V1 (Prompt-based):")
187+
print(f" - Node types: {len(schema_v1.node_types)}")
188+
print(f" - Relationship types: {len(schema_v1.relationship_types)}")
189+
print(f" - Patterns: {len(schema_v1.patterns)}")
190+
print(
191+
f" - Total properties: {sum(len(n.properties) for n in schema_v1.node_types)}"
192+
)
193+
194+
print("\nV2 (Structured Output):")
195+
print(f" - Node types: {len(schema_v2.node_types)}")
196+
print(f" - Relationship types: {len(schema_v2.relationship_types)}")
197+
print(f" - Patterns: {len(schema_v2.patterns)}")
198+
print(
199+
f" - Total properties: {sum(len(n.properties) for n in schema_v2.node_types)}"
200+
)
201+
202+
203+
if __name__ == "__main__":
204+
# Run comparison between V1 and V2
205+
asyncio.run(compare_approaches())

src/neo4j_graphrag/experimental/components/entity_relation_extractor.py

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,6 @@
3737
from neo4j_graphrag.experimental.pipeline.exceptions import InvalidJSONError
3838
from neo4j_graphrag.generation.prompts import ERExtractionTemplate, PromptTemplate
3939
from neo4j_graphrag.llm import LLMInterface
40-
from neo4j_graphrag.llm.openai_llm import OpenAILLM
41-
from neo4j_graphrag.llm.vertexai_llm import VertexAILLM
4240
from neo4j_graphrag.types import LLMMessage
4341
from neo4j_graphrag.utils.logging import prettify
4442

@@ -217,10 +215,10 @@ def __init__(
217215
self.use_structured_output = use_structured_output
218216

219217
# Validate that structured output is only used with supported LLMs
220-
if use_structured_output and not isinstance(llm, (OpenAILLM, VertexAILLM)):
218+
if use_structured_output and not llm.supports_structured_output:
221219
raise ValueError(
222-
f"use_structured_output=True is only supported for OpenAILLM and VertexAILLM. "
223-
f"Got {type(llm).__name__}."
220+
f"Structured output is not supported by {type(llm).__name__}. "
221+
f"Please use a model that supports structured output, or set use_structured_output=False."
224222
)
225223

226224
if isinstance(prompt_template, str):
@@ -241,15 +239,15 @@ async def extract_for_chunk(
241239

242240
# Use structured output (V2) if enabled
243241
if self.use_structured_output:
244-
# Type narrowing with isinstance check
242+
# Capability check
245243
# This should never happen due to __init__ validation
246-
if not isinstance(self.llm, (OpenAILLM, VertexAILLM)):
244+
if not self.llm.supports_structured_output:
247245
raise RuntimeError(
248-
f"Structured output requires OpenAILLM or VertexAILLM, got {type(self.llm).__name__}"
246+
f"Structured output is not supported by {type(self.llm).__name__}"
249247
)
250248

251249
messages = [LLMMessage(role="user", content=prompt)]
252-
llm_result = await self.llm.ainvoke(messages, response_format=Neo4jGraph)
250+
llm_result = await self.llm.ainvoke(messages, response_format=Neo4jGraph) # type: ignore[call-arg, arg-type]
253251
try:
254252
chunk_graph = Neo4jGraph.model_validate_json(llm_result.content)
255253
except ValidationError as e:

src/neo4j_graphrag/experimental/components/graph_pruning.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121

2222
from neo4j_graphrag.experimental.components.schema import (
2323
GraphSchema,
24+
Pattern,
2425
PropertyType,
2526
NodeType,
2627
RelationshipType,
@@ -272,7 +273,7 @@ def _validate_relationship(
272273
pruning_stats: PruningStats,
273274
relationship_type: Optional[RelationshipType],
274275
additional_relationship_types: bool,
275-
patterns: tuple[tuple[str, str, str], ...],
276+
patterns: tuple[Pattern, ...],
276277
additional_patterns: bool,
277278
) -> Optional[Neo4jRelationship]:
278279
if not rel.type:

0 commit comments

Comments
 (0)