Skip to content

feat(neo4j): add apoc_sample parameter for large database schema introspection#20859

Open
eureka928 wants to merge 11 commits intorun-llama:mainfrom
eureka928:feat/neo4j-apoc-meta-sample
Open

feat(neo4j): add apoc_sample parameter for large database schema introspection#20859
eureka928 wants to merge 11 commits intorun-llama:mainfrom
eureka928:feat/neo4j-apoc-meta-sample

Conversation

@eureka928
Copy link
Contributor

Description

Adds an apoc_sample parameter to both Neo4jGraphStore and Neo4jPropertyGraphStore to control the sampling size used by apoc.meta.data() during schema introspection.

Fixes #18988

On large Neo4j databases, apoc.meta.data() can be very slow because it scans all nodes and relationships. The APOC procedure supports a {sample: N} config parameter to limit sampling, but llama-index was calling it without any config. This change parameterizes the queries with $config and exposes apoc_sample in the constructor.

Usage:

# Without sampling (default, backward-compatible)
store = Neo4jGraphStore(username="neo4j", password="pass", url="bolt://localhost:7687")

# With sampling for large databases
store = Neo4jGraphStore(username="neo4j", password="pass", url="bolt://localhost:7687", apoc_sample=1000)

Changes:

  • Parameterized all CALL apoc.meta.data() queries with $config in both base.py and neo4j_property_graph.py
  • Added apoc_sample: Optional[int] = None to __init__ of both stores
  • Updated refresh_schema() to pass config to all schema queries
  • Used is not None check to correctly handle apoc_sample=0 edge case
  • Added docstring for the new parameter in Neo4jPropertyGraphStore

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

8 unit tests covering both Neo4jGraphStore and Neo4jPropertyGraphStore:

  • Default empty config when apoc_sample is not provided
  • Config storage when apoc_sample is set
  • apoc_sample=0 edge case (not treated as falsy)
  • Config correctly passed to all 3 apoc.meta.data queries in refresh_schema()
  • EXCLUDED_LABELS still passed alongside config in property graph store
tests/test_apoc_sample.py::TestNeo4jGraphStoreApocSample::test_apoc_sample_default_empty_config PASSED
tests/test_apoc_sample.py::TestNeo4jGraphStoreApocSample::test_apoc_sample_sets_config PASSED
tests/test_apoc_sample.py::TestNeo4jGraphStoreApocSample::test_apoc_sample_zero_is_valid PASSED
tests/test_apoc_sample.py::TestNeo4jGraphStoreApocSample::test_refresh_schema_passes_config PASSED
tests/test_apoc_sample.py::TestNeo4jGraphStoreApocSample::test_refresh_schema_passes_empty_config_by_default PASSED
tests/test_apoc_sample.py::TestNeo4jPropertyGraphStoreApocSample::test_apoc_sample_default_empty_config PASSED
tests/test_apoc_sample.py::TestNeo4jPropertyGraphStoreApocSample::test_apoc_sample_sets_config PASSED
tests/test_apoc_sample.py::TestNeo4jPropertyGraphStoreApocSample::test_refresh_schema_passes_config PASSED

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Parameterize apoc.meta.data() queries with $config to support the
sample parameter for large Neo4j databases. This allows users to
control the sampling size used by APOC meta procedures.
Same change as Neo4jGraphStore: parameterize apoc.meta.data() queries
with $config and add apoc_sample parameter to __init__. The config is
merged into existing param_map dicts that already contain EXCLUDED_LABELS.
Tests cover both Neo4jGraphStore and Neo4jPropertyGraphStore:
- default empty config when apoc_sample is not provided
- config storage when apoc_sample is set
- config passed to all apoc.meta.data queries in refresh_schema
- Fix: apoc_sample=0 was silently ignored because `if apoc_sample`
  is falsy for 0. Changed to `if apoc_sample is not None`.
- Add apoc_sample to Neo4jPropertyGraphStore docstring Args section.
- Add test for apoc_sample=0 edge case.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 3, 2026
The LLM | None and list[...] | None syntax in ChatPromptHelper method
signatures requires Python 3.10+. Replace with Optional[LLM] and
Optional[List[...]] to maintain Python 3.9 compatibility.
test_legacy_json_to_doc.py used dict | None syntax which requires
Python 3.10+. Replace with Optional[dict] for 3.9 compatibility.
Revert the Python 3.9 compat fixes in llama-index-core to keep this
PR scoped to neo4j graph stores only. Core changes triggered the full
660-package test suite, causing pre-existing failures in unrelated
packages to block this PR.
Revert the dependency constraint change in neo4j-query-engine pack to
keep this PR scoped to llama-index-graph-stores-neo4j only. The pack
has a pre-existing Python 3.9 CI failure from llama-index-core using
LLM | None syntax. Pack maintainer can update the constraint separately.
The neo4j-query-engine pack cannot work on Python 3.9 because its
transitive dependency llama-index-core uses LLM | None syntax
(Python 3.10+) in prompt_helper.py. Update requires-python to >=3.10
so the 3.9 test runner correctly skips this package.

Also update graph-stores-neo4j dependency constraint to allow 0.6.x.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Change the apoc.meta.data() calls to account for large graphs

1 participant