neo4j
diff --git a/‎CHANGELOG.md‎
Lines changed: 5 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/types.rst‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/types.rst‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/user_guide_kg_builder.rst‎
Lines changed: 44 additions & 1 deletion b/‎docs/source/user_guide_kg_builder.rst‎
Lines changed: 44 additions & 1 deletion
diff --git a/‎examples/customize/build_graph/components/pruners/graph_pruner.py‎
Lines changed: 3 additions & 2 deletions b/‎examples/customize/build_graph/components/pruners/graph_pruner.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎examples/customize/build_graph/components/schema_builders/schema_from_text_with_structured_output.py‎
Lines changed: 205 additions & 0 deletions b/‎examples/customize/build_graph/components/schema_builders/schema_from_text_with_structured_output.py‎
Lines changed: 205 additions & 0 deletions
diff --git a/‎src/neo4j_graphrag/experimental/components/entity_relation_extractor.py‎
Lines changed: 7 additions & 9 deletions b/‎src/neo4j_graphrag/experimental/components/entity_relation_extractor.py‎
Lines changed: 7 additions & 9 deletions
diff --git a/‎src/neo4j_graphrag/experimental/components/graph_pruning.py‎
Lines changed: 2 additions & 1 deletion b/‎src/neo4j_graphrag/experimental/components/graph_pruning.py‎
Lines changed: 2 additions & 1 deletion
@@ -10,12 +10,17 @@
 - Support for version 6.0.0 of the Neo4j Python driver
 - Support for structured output in `OpenAILLM` and `VertexAILLM` via `response_format` parameter. Accepts Pydantic models (requires `ConfigDict(extra="forbid")`) or JSON schemas.
 - Added `use_structured_output` parameter to `LLMEntityRelationExtractor` for improved entity extraction reliability with OpenAI/VertexAI LLMs.
+- Added `use_structured_output` parameter to `SchemaFromTextExtractor` for improved schema generation with OpenAI/VertexAI LLMs. Enforces `GraphSchema` structure via Pydantic model validation and includes automatic cleanup of invalid patterns/constraints.
+- Added `supports_structured_output` capability flag to `LLMInterface` for forward-compatible detection of structured output support across LLM implementations.
 
 ### Changed
 
 - Switched project/dependency management from Poetry to uv.
 - Dropped support for Python 3.9 (EOL)
 - Made `Neo4jNode`, `Neo4jRelationship`, and `Neo4jGraph` stricter: properties field now uses typed `PropertyValue` (Neo4j primitives, temporal values, lists, `GeoPoint`) and fixed mutable defaults with `Field(default_factory=...)`.
+- **Breaking**: `NodeType.properties` now requires at least one property (`min_length=1`). String-based node definitions (e.g., `NodeType("Person")`) automatically receive a default "name" property with `additional_properties=True`.
+- **Breaking**: `RelationshipType` with empty properties and `additional_properties=False` is now auto-corrected to `additional_properties=True` to prevent pruning of LLM-extracted properties.
+- Introduced `Pattern` Pydantic model for internal storage of graph patterns, replacing tuple format. Public APIs maintain backward compatibility by accepting both tuples and `Pattern` objects.
 
 ## 1.11.0
 
 
@@ -95,6 +95,11 @@ RelationshipType
 
 .. autoclass:: neo4j_graphrag.experimental.components.schema.RelationshipType
 
+Pattern
+=======
+
+.. autoclass:: neo4j_graphrag.experimental.components.schema.Pattern
+
 GraphSchema
 ===========
 
 
@@ -867,6 +867,49 @@ You can also save and reload the extracted schema:
     restored_schema = GraphSchema.from_file("my_schema.json")  # or my_schema.yaml
 
 
+Using Structured Output with Schema Extraction
+-----------------------------------------------
+
+For improved reliability with :ref:`OpenAILLM <openaillm>` or :ref:`VertexAILLM <vertexaillm>`, enable structured output mode. When ``use_structured_output=True``, the extractor passes the ``GraphSchema`` Pydantic model as ``response_format`` to the LLM, ensuring responses conform to the expected schema structure with automatic validation:
+
+.. code:: python
+
+    from neo4j_graphrag.experimental.components.schema import SchemaFromTextExtractor
+    from neo4j_graphrag.llm import OpenAILLM
+
+    llm = OpenAILLM(model_name="gpt-4o-mini", model_params={"temperature": 0})
+    schema_extractor = SchemaFromTextExtractor(
+        llm=llm,
+        use_structured_output=True
+    )
+    
+    extracted_schema = await schema_extractor.run(text="Some text")
+
+.. note::
+
+    Using ``use_structured_output=True`` with other LLM providers will raise a ``ValueError``. Do not pass ``response_format`` in constructor parameters; the extractor automatically sets it when calling ``invoke()``.
+
+
+Schema Validation and Node Properties
+--------------------------------------
+
+**Important:** All node types must have at least one property defined. When using string shorthand for node types (e.g., ``"Person"``), a default ``"name"`` property is automatically added with ``additional_properties=True`` to allow flexible LLM extraction:
+
+.. code:: python
+
+    # String shorthand - automatically gets default property
+    NodeType("Person")  # Becomes: properties=[{"name": "name", "type": "STRING"}], additional_properties=True
+    
+    # Explicit definition - must include at least one property
+    NodeType(
+        label="Person",
+        properties=[PropertyType(name="name", type="STRING")],
+        additional_properties=True  # Allow LLM to extract additional properties
+    )
+
+**Relationship types** with no properties automatically set ``additional_properties=True`` to preserve LLM-extracted properties during graph construction.
+
+
 Schema Visualization
 --------------------
 
@@ -949,7 +992,7 @@ For improved reliability and type safety with :ref:`OpenAILLM <openaillm>` or :r
 
 .. note::
 
-    Using `use_structured_output=True` with other LLM providers will raise a `ValueError`. Do not pass `response_format` in constructor parameters (`model_params` or `generation_config`); the extractor automatically sets it when calling `invoke()`.
+    Structured output is only available for LLMs with ``supports_structured_output=True`` (currently :ref:`OpenAILLM <openaillm>` and :ref:`VertexAILLM <vertexaillm>`). Using ``use_structured_output=True`` with other providers will raise a ``ValueError``. Do not pass ``response_format`` in constructor parameters (``model_params`` or ``generation_config``); the extractor automatically sets it when calling ``invoke()``.
 
 
 Error Behaviour
 
@@ -6,6 +6,7 @@
 from neo4j_graphrag.experimental.components.schema import (
     GraphSchema,
     NodeType,
+    Pattern,
     PropertyType,
     RelationshipType,
 )
@@ -94,8 +95,8 @@
         ),
     ),
     patterns=(
-        ("Person", "KNOWS", "Person"),
-        ("Person", "WORKS_FOR", "Organization"),
+        Pattern(source="Person", relationship="KNOWS", target="Person"),
+        Pattern(source="Person", relationship="WORKS_FOR", target="Organization"),
     ),
     additional_node_types=False,
     additional_relationship_types=False,
 
@@ -0,0 +1,205 @@
+"""
+Simple example demonstrating structured output with SchemaFromTextExtractor.
+
+This example shows how to use structured output for more reliable schema extraction
+with automatic validation against the GraphSchema Pydantic model.
+
+The GraphSchema is now compatible with both OpenAI and VertexAI structured output APIs,
+with strict validation and proper field definitions. With structured output enabled:
+- Uses LLMInterfaceV2 (list of messages)
+- Passes GraphSchema Pydantic model as response_format to ainvoke()
+- Ensures response conforms to expected schema structure
+- Provides automatic type validation
+- Reduces need for JSON repair and error handling
+- Enforces min_length=1 on node properties (nodes must have at least one property)
+
+Prerequisites:
+- Google Cloud credentials configured for VertexAI
+- Or OpenAI API key set in OPENAI_API_KEY environment variable
+"""
+
+import asyncio
+from dotenv import load_dotenv
+
+from neo4j_graphrag.experimental.components.schema import (
+    SchemaFromTextExtractor,
+    GraphSchema,
+)
+from neo4j_graphrag.llm import OpenAILLM
+
+
+# Sample text to extract schema from
+SAMPLE_TEXT = """
+Acme Corporation was founded in 1985 by John Smith in New York City.
+The company specializes in manufacturing high-quality widgets and gadgets
+for the consumer electronics industry.
+
+Sarah Johnson joined Acme in 2010 as a Senior Engineer and was promoted to
+Engineering Director in 2015. She oversees a team of 12 engineers working on
+next-generation products. Sarah holds a PhD in Electrical Engineering from MIT
+and has filed 5 patents during her time at Acme.
+
+The company expanded to international markets in 2012, opening offices in London,
+Tokyo, and Berlin. Each office is managed by a regional director who reports
+directly to the CEO, Michael Brown, who took over leadership in 2008.
+
+Acme's most successful product, the SuperWidget X1, was launched in 2018 and
+has sold over 2 million units worldwide. The product was developed by a team led
+by Robert Chen, who joined the company in 2016 after working at TechGiant for 8 years.
+"""
+
+
+def print_schema_summary(schema: GraphSchema, title: str) -> None:
+    """Print a formatted summary of the extracted schema."""
+    print(f"\n{'='*60}")
+    print(f"{title}")
+    print(f"{'='*60}")
+
+    print(f"\nNode Types ({len(schema.node_types)}):")
+    for node in schema.node_types:
+        props = [f"{p.name} ({p.type})" for p in node.properties]
+        print(f"  - {node.label}")
+        if props:
+            print(f"    Properties: {', '.join(props)}")
+        if node.description:
+            print(f"    Description: {node.description}")
+
+    if schema.relationship_types:
+        print(f"\nRelationship Types ({len(schema.relationship_types)}):")
+        for rel in schema.relationship_types:
+            props = [f"{p.name} ({p.type})" for p in rel.properties]
+            print(f"  - {rel.label}")
+            if props:
+                print(f"    Properties: {', '.join(props)}")
+
+    if schema.patterns:
+        print(f"\nPatterns ({len(schema.patterns)}):")
+        for source, relationship, target in schema.patterns:
+            print(f"  {source} --[{relationship}]--> {target}")
+
+    if schema.constraints:
+        print(f"\nConstraints ({len(schema.constraints)}):")
+        for constraint in schema.constraints:
+            print(
+                f"  - {constraint.type} on {constraint.node_type}.{constraint.property_name}"
+            )
+
+
+async def test_v1_without_structured_output() -> GraphSchema:
+    """
+    Test V1 approach (default): Prompt-based JSON extraction with manual cleanup.
+
+    With use_structured_output=False (default):
+    - Uses LLMInterface V1 (plain string prompts)
+    - LLM returns JSON string that needs parsing and cleanup
+    - Extensive filtering and validation applied manually
+    - More forgiving of LLM errors
+    - Works with all LLM providers
+    """
+    print("\n" + "=" * 60)
+    print("Testing V1: Prompt-based JSON extraction (default)")
+    print("=" * 60)
+
+    # Initialize LLM with response_format for JSON mode (V1 approach)
+    llm = OpenAILLM(
+        model_name="gpt-4o-mini",
+        model_params={
+            "temperature": 0,
+            "response_format": {"type": "json_object"},  # JSON mode for V1
+        },
+    )
+
+    # For VertexAI V1, use:
+    # llm = VertexAILLM(
+    #     model_name="gemini-2.5-flash",
+    #     model_params={"temperature": 0}
+    # )
+
+    # Create extractor WITHOUT structured output (V1 default)
+    extractor = SchemaFromTextExtractor(
+        llm=llm,
+        use_structured_output=False,  # Default, can be omitted
+    )
+
+    # Extract schema
+    schema = await extractor.run(text=SAMPLE_TEXT)
+
+    print_schema_summary(schema, "V1 Result (Prompt-based)")
+
+    return schema
+
+
+async def test_v2_with_structured_output() -> GraphSchema:
+    """
+    Test V2 approach: Structured output with GraphSchema validation.
+
+    With use_structured_output=True:
+    - Uses LLMInterfaceV2 (list of messages)
+    - Passes GraphSchema as response_format to ainvoke()
+    - LLM returns properly structured data conforming to GraphSchema
+    - Automatic validation via Pydantic
+    - Less manual cleanup needed
+    - Only works with OpenAI and VertexAI
+    - Enforces min_length=1 on node properties
+    """
+    print("\n" + "=" * 60)
+    print("Testing V2: Structured output with GraphSchema")
+    print("=" * 60)
+
+    # Initialize LLM - NO response_format in constructor for V2!
+    llm = OpenAILLM(model_name="gpt-4o-mini", model_params={"temperature": 0})
+
+    # For VertexAI V2, use:
+    # llm = VertexAILLM(
+    #     model_name="gemini-2.5-flash",
+    #     model_params={"temperature": 0}
+    # )
+
+    # Create extractor WITH structured output (V2)
+    extractor = SchemaFromTextExtractor(
+        llm=llm,
+        use_structured_output=True,  # This is the key parameter!
+    )
+
+    # Extract schema
+    schema = await extractor.run(text=SAMPLE_TEXT)
+
+    print_schema_summary(schema, "V2 Result (Structured Output)")
+
+    return schema
+
+
+async def compare_approaches() -> None:
+    """Run both approaches and compare results."""
+    load_dotenv()
+
+    # Test V1 (default)
+    schema_v1 = await test_v1_without_structured_output()
+
+    # Test V2 (structured output)
+    schema_v2 = await test_v2_with_structured_output()
+
+    # Comparison
+    print("\n" + "=" * 60)
+    print("COMPARISON")
+    print("=" * 60)
+    print("V1 (Prompt-based):")
+    print(f"  - Node types: {len(schema_v1.node_types)}")
+    print(f"  - Relationship types: {len(schema_v1.relationship_types)}")
+    print(f"  - Patterns: {len(schema_v1.patterns)}")
+    print(
+        f"  - Total properties: {sum(len(n.properties) for n in schema_v1.node_types)}"
+    )
+
+    print("\nV2 (Structured Output):")
+    print(f"  - Node types: {len(schema_v2.node_types)}")
+    print(f"  - Relationship types: {len(schema_v2.relationship_types)}")
+    print(f"  - Patterns: {len(schema_v2.patterns)}")
+    print(
+        f"  - Total properties: {sum(len(n.properties) for n in schema_v2.node_types)}"
+    )
+
+
+if __name__ == "__main__":
+    # Run comparison between V1 and V2
+    asyncio.run(compare_approaches())
@@ -37,8 +37,6 @@
 from neo4j_graphrag.experimental.pipeline.exceptions import InvalidJSONError
 from neo4j_graphrag.generation.prompts import ERExtractionTemplate, PromptTemplate
 from neo4j_graphrag.llm import LLMInterface
-from neo4j_graphrag.llm.openai_llm import OpenAILLM
-from neo4j_graphrag.llm.vertexai_llm import VertexAILLM
 from neo4j_graphrag.types import LLMMessage
 from neo4j_graphrag.utils.logging import prettify
 
@@ -217,10 +215,10 @@ def __init__(
         self.use_structured_output = use_structured_output
 
         # Validate that structured output is only used with supported LLMs
-        if use_structured_output and not isinstance(llm, (OpenAILLM, VertexAILLM)):
+        if use_structured_output and not llm.supports_structured_output:
             raise ValueError(
-                f"use_structured_output=True is only supported for OpenAILLM and VertexAILLM. "
-                f"Got {type(llm).__name__}."
+                f"Structured output is not supported by {type(llm).__name__}. "
+                f"Please use a model that supports structured output, or set use_structured_output=False."
             )
 
         if isinstance(prompt_template, str):
@@ -241,15 +239,15 @@ async def extract_for_chunk(
 
         # Use structured output (V2) if enabled
         if self.use_structured_output:
-            # Type narrowing with isinstance check
+            # Capability check
             # This should never happen due to __init__ validation
-            if not isinstance(self.llm, (OpenAILLM, VertexAILLM)):
+            if not self.llm.supports_structured_output:
                 raise RuntimeError(
-                    f"Structured output requires OpenAILLM or VertexAILLM, got {type(self.llm).__name__}"
+                    f"Structured output is not supported by {type(self.llm).__name__}"
                 )
 
             messages = [LLMMessage(role="user", content=prompt)]
-            llm_result = await self.llm.ainvoke(messages, response_format=Neo4jGraph)
+            llm_result = await self.llm.ainvoke(messages, response_format=Neo4jGraph)  # type: ignore[call-arg, arg-type]
             try:
                 chunk_graph = Neo4jGraph.model_validate_json(llm_result.content)
             except ValidationError as e:
 
@@ -21,6 +21,7 @@
 
 from neo4j_graphrag.experimental.components.schema import (
     GraphSchema,
+    Pattern,
     PropertyType,
     NodeType,
     RelationshipType,
@@ -272,7 +273,7 @@ def _validate_relationship(
         pruning_stats: PruningStats,
         relationship_type: Optional[RelationshipType],
         additional_relationship_types: bool,
-        patterns: tuple[tuple[str, str, str], ...],
+        patterns: tuple[Pattern, ...],
         additional_patterns: bool,
     ) -> Optional[Neo4jRelationship]:
         if not rel.type: