feat(neo4j): add apoc_sample parameter for large database schema introspection#20859
Open
eureka928 wants to merge 11 commits intorun-llama:mainfrom
Open
feat(neo4j): add apoc_sample parameter for large database schema introspection#20859eureka928 wants to merge 11 commits intorun-llama:mainfrom
eureka928 wants to merge 11 commits intorun-llama:mainfrom
Conversation
Parameterize apoc.meta.data() queries with $config to support the sample parameter for large Neo4j databases. This allows users to control the sampling size used by APOC meta procedures.
Same change as Neo4jGraphStore: parameterize apoc.meta.data() queries with $config and add apoc_sample parameter to __init__. The config is merged into existing param_map dicts that already contain EXCLUDED_LABELS.
Tests cover both Neo4jGraphStore and Neo4jPropertyGraphStore: - default empty config when apoc_sample is not provided - config storage when apoc_sample is set - config passed to all apoc.meta.data queries in refresh_schema
- Fix: apoc_sample=0 was silently ignored because `if apoc_sample` is falsy for 0. Changed to `if apoc_sample is not None`. - Add apoc_sample to Neo4jPropertyGraphStore docstring Args section. - Add test for apoc_sample=0 edge case.
The LLM | None and list[...] | None syntax in ChatPromptHelper method signatures requires Python 3.10+. Replace with Optional[LLM] and Optional[List[...]] to maintain Python 3.9 compatibility.
test_legacy_json_to_doc.py used dict | None syntax which requires Python 3.10+. Replace with Optional[dict] for 3.9 compatibility.
Revert the Python 3.9 compat fixes in llama-index-core to keep this PR scoped to neo4j graph stores only. Core changes triggered the full 660-package test suite, causing pre-existing failures in unrelated packages to block this PR.
Revert the dependency constraint change in neo4j-query-engine pack to keep this PR scoped to llama-index-graph-stores-neo4j only. The pack has a pre-existing Python 3.9 CI failure from llama-index-core using LLM | None syntax. Pack maintainer can update the constraint separately.
The neo4j-query-engine pack cannot work on Python 3.9 because its transitive dependency llama-index-core uses LLM | None syntax (Python 3.10+) in prompt_helper.py. Update requires-python to >=3.10 so the 3.9 test runner correctly skips this package. Also update graph-stores-neo4j dependency constraint to allow 0.6.x.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds an
apoc_sampleparameter to bothNeo4jGraphStoreandNeo4jPropertyGraphStoreto control the sampling size used byapoc.meta.data()during schema introspection.Fixes #18988
On large Neo4j databases,
apoc.meta.data()can be very slow because it scans all nodes and relationships. The APOC procedure supports a{sample: N}config parameter to limit sampling, but llama-index was calling it without any config. This change parameterizes the queries with$configand exposesapoc_samplein the constructor.Usage:
Changes:
CALL apoc.meta.data()queries with$configin bothbase.pyandneo4j_property_graph.pyapoc_sample: Optional[int] = Noneto__init__of both storesrefresh_schema()to pass config to all schema queriesis not Nonecheck to correctly handleapoc_sample=0edge caseNeo4jPropertyGraphStoreNew Package?
Did I fill in the
tool.llamahubsection in thepyproject.tomland provide a detailed README.md for my new integration or package?Version Bump?
Did I bump the version in the
pyproject.tomlfile of the package I am updating? (Except for thellama-index-corepackage)Type of Change
How Has This Been Tested?
8 unit tests covering both
Neo4jGraphStoreandNeo4jPropertyGraphStore:apoc_sampleis not providedapoc_sampleis setapoc_sample=0edge case (not treated as falsy)apoc.meta.dataqueries inrefresh_schema()EXCLUDED_LABELSstill passed alongside config in property graph storeSuggested Checklist:
uv run make format; uv run make lintto appease the lint gods