Skip to content

Commit 024b4d6

Browse files
mimomenomea-s-g93
andauthored
Adding a NEO4J_SCHEMA_SAMPLE_SIZE parameter to enable control of apoc.meta.schema sample size (#211)
* Add configurable sample size for schema operations Introduces a --sample CLI argument and NEO4J_SAMPLE environment variable to control the sample size used in APOC schema inspection queries. This allows limiting the number of nodes scanned for schema operations, improving performance on large graphs. Includes updates to config processing, server logic, and unit tests for sample precedence and validation. * Updating readme to add in sample parameter Updating readme docs for sampling parameter * replace NEO4J_SAMPLE with NEO4J_SCHEMA_SAMPLE_SIZE * update config to use schema_sample_size key * update `get_neo4j_schema` tool and changelog * fix server constructor fn, update get schema docstring, test with claude * update default values in utils, update docstring from testing with claude * add Field to sample_size * update get_schema docstring --------- Co-authored-by: alex <[email protected]>
1 parent 7bf941b commit 024b4d6

File tree

6 files changed

+156
-13
lines changed

6 files changed

+156
-13
lines changed

servers/mcp-neo4j-cypher/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
### Changed
66

77
### Added
8+
* Add `NEO4J_SCHEMA_SAMPLE_SIZE` env variable and `schema-sample-size` cli argument to configure the `get_neo4j_schema` sample size
89
* Update write query detection to include `INSERT` in regex check
910

1011
## v0.4.1

servers/mcp-neo4j-cypher/README.md

Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,10 @@ The server offers these core tools:
4444

4545
- `get_neo4j_schema`
4646
- Get a list of all nodes types in the graph database, their attributes with name, type and relationships to other node types
47-
- No input required
47+
- Input:
48+
- `sample_param` (integer, optional): Number of nodes to sample for schema analysis. Overrides server default if provided.
4849
- Returns: JSON serialized list of node labels with two dictionaries: one for attributes and one for relationships
50+
- **Performance**: Uses sampling by default (1000 nodes per label). Reduce number for faster analysis on large databases. To stop sampling, set to -1.
4951

5052
### 🏷️ Namespacing
5153

@@ -105,6 +107,62 @@ When a response exceeds the token limit, it will be automatically truncated to f
105107

106108
**Note**: Token limits only apply to `read_neo4j_cypher` responses. Schema queries and write operations return summary information and are not affected.
107109

110+
#### 🔍 Schema Sampling
111+
112+
Control the performance and scope of schema inspection with the `sample` parameter for the `get_neo4j_schema` tool:
113+
114+
**Command Line:**
115+
```bash
116+
mcp-neo4j-cypher --sample 1000 # Sample 1000 nodes per label
117+
```
118+
119+
**Environment Variable:**
120+
```bash
121+
export NEO4J_SCHEMA_SAMPLE_SIZE=1000
122+
```
123+
124+
**Docker:**
125+
```bash
126+
docker run -e NEO4J_SCHEMA_SAMPLE_SIZE=1000 mcp-neo4j-cypher:latest
127+
```
128+
129+
The `sample` parameter controls how many nodes are examined when generating the database schema:
130+
131+
- **Default**: `1000` nodes per label are sampled for schema analysis
132+
- **Performance**: Lower values (`100`, `500`) provide faster schema inspection on large databases
133+
- **Accuracy**: Higher values (`5000`, `10000`) provide more comprehensive schema coverage
134+
- **Full Scan**: Set to `-1` to examine all nodes (can be very slow on large databases)
135+
- **Per-Call Override**: The `get_neo4j_schema` tool accepts a `sample_param` parameter to override the server default
136+
137+
**How Sampling Works** (via [APOC's apoc.meta.schema](https://neo4j.com/docs/apoc/current/overview/apoc.meta/apoc.meta.schema/)):
138+
139+
- For each node label, a skip count is calculated: `totalNodesForLabel / sample ± 10%`
140+
- Every Nth node is examined based on the skip count
141+
- Higher sample numbers result in more nodes being examined
142+
- Results may vary between runs due to random sampling
143+
144+
**Example Scenarios:**
145+
146+
```bash
147+
# Fast schema inspection for large databases
148+
export NEO4J_SCHEMA_SAMPLE_SIZE=100
149+
150+
# Balanced performance and accuracy (default)
151+
export NEO4J_SCHEMA_SAMPLE_SIZE=1000
152+
153+
# Comprehensive schema analysis
154+
export NEO4J_SCHEMA_SAMPLE_SIZE=5000
155+
156+
# Full database scan (use with caution on large databases)
157+
export NEO4J_SCHEMA_SAMPLE_SIZE=-1
158+
```
159+
160+
**Performance Considerations:**
161+
162+
- **Large Databases**: Use lower sample values (`100-500`) to prevent timeouts
163+
- **Development**: Higher sample values (`1000-5000`) for thorough schema understanding
164+
- **Production**: Balance between performance and schema completeness based on your use case
165+
108166
## 🏗️ Local Development & Deployment
109167

110168
### 🐳 Local Docker Development
@@ -407,6 +465,7 @@ docker run --rm -p 8000:8000 \
407465
| `NEO4J_RESPONSE_TOKEN_LIMIT` | _(none)_ | Maximum tokens for read query responses |
408466
| `NEO4J_READ_TIMEOUT` | `30` | Timeout in seconds for read queries |
409467
| `NEO4J_READ_ONLY` | `false` | Allow only read-only queries (true/false) |
468+
| `NEO4J_SCHEMA_SAMPLE_SIZE` | `1000` | Number of nodes to sample for schema inspection (set to -1 for full scan) |
410469

411470
### 🌐 SSE Transport for Legacy Web Access
412471

servers/mcp-neo4j-cypher/src/mcp_neo4j_cypher/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,12 @@ def main():
4343
help="Allow only read-only queries (default: False)",
4444
)
4545
parser.add_argument("--token-limit", default=None, help="Response token limit")
46+
parser.add_argument(
47+
"--schema-sample-size",
48+
type=int,
49+
default=None,
50+
help="Default sample size for schema operations (default: 1000)",
51+
)
4652

4753
args = parser.parse_args()
4854
config = process_config(args)

servers/mcp-neo4j-cypher/src/mcp_neo4j_cypher/server.py

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ def create_mcp_server(
4444
read_timeout: int = 30,
4545
token_limit: Optional[int] = None,
4646
read_only: bool = False,
47+
config_sample_size: int = 1000,
4748
) -> FastMCP:
4849
mcp: FastMCP = FastMCP(
4950
"mcp-neo4j-cypher", dependencies=["neo4j", "pydantic"], stateless_http=True
@@ -62,16 +63,26 @@ def create_mcp_server(
6263
openWorldHint=True,
6364
),
6465
)
65-
async def get_neo4j_schema() -> list[ToolResult]:
66-
"""
67-
List all nodes, their attributes and their relationships to other nodes in the neo4j database.
68-
This requires that the APOC plugin is installed and enabled.
66+
async def get_neo4j_schema(sample_size: int = Field(default=config_sample_size, description="The sample size used to infer the graph schema. Larger samples are slower, but more accurate. Smaller samples are faster, but might miss information.")) -> list[ToolResult]:
6967
"""
68+
Returns nodes, their properties (with types and indexed flags), and relationships
69+
using APOC's schema inspection.
70+
71+
You should only provide a `sample_size` value if requested by the user, or tuning the retrieval performance.
7072
71-
get_schema_query = """
72-
CALL apoc.meta.schema();
73+
Performance Notes:
74+
- If `sample_size` is not provided, uses the server's default sample setting defined in the server configuration.
75+
- If retrieving the schema times out, try lowering the sample size, e.g. `sample_size=100`.
76+
- To sample the entire graph use `sample_size=-1`.
7377
"""
7478

79+
# Use provided sample_size, otherwise fall back to server default - 1000
80+
effective_sample_size = sample_size if sample_size else config_sample_size
81+
82+
logger.info(f"Running `get_neo4j_schema` with sample size {effective_sample_size}.")
83+
84+
get_schema_query = f"CALL apoc.meta.schema({{sample: {effective_sample_size}}}) YIELD value RETURN value"
85+
7586
def clean_schema(schema: dict) -> dict:
7687
cleaned = {}
7788

@@ -132,16 +143,16 @@ def clean_schema(schema: dict) -> dict:
132143
return cleaned
133144

134145
try:
135-
results_json_str = await neo4j_driver.execute_query(
146+
results_json = await neo4j_driver.execute_query(
136147
get_schema_query,
137148
routing_control=RoutingControl.READ,
138149
database_=database,
139150
result_transformer_=lambda r: r.data(),
140151
)
152+
153+
logger.debug(f"Read query returned {len(results_json)} rows")
141154

142-
logger.debug(f"Read query returned {len(results_json_str)} rows")
143-
144-
schema_clean = clean_schema(results_json_str[0].get("value"))
155+
schema_clean = clean_schema(results_json[0].get("value"))
145156

146157
schema_clean_str = json.dumps(schema_clean, default=str)
147158

@@ -275,6 +286,7 @@ async def main(
275286
read_timeout: int = 30,
276287
token_limit: Optional[int] = None,
277288
read_only: bool = False,
289+
schema_sample_size: Optional[int] = None, # this is known as the config_sample_size in the create_mcp_server function
278290
) -> None:
279291
logger.info("Starting MCP neo4j Server")
280292

@@ -296,7 +308,7 @@ async def main(
296308
]
297309

298310
mcp = create_mcp_server(
299-
neo4j_driver, database, namespace, read_timeout, token_limit, read_only
311+
neo4j_driver, database, namespace, read_timeout, token_limit, read_only, schema_sample_size
300312
)
301313

302314
# Run the server with the specified transport

servers/mcp-neo4j-cypher/src/mcp_neo4j_cypher/utils.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,30 @@ def process_config(args: argparse.Namespace) -> dict[str, Union[str, int, None]]
296296
)
297297
config["read_only"] = False
298298

299+
# parse schema sample size
300+
if args.schema_sample_size is not None:
301+
config["schema_sample_size"] = args.schema_sample_size
302+
logger.info(
303+
f"Info: Default sample size set to {config['schema_sample_size']} via command line argument."
304+
)
305+
else:
306+
if os.getenv("NEO4J_SCHEMA_SAMPLE_SIZE") is not None:
307+
try:
308+
config["schema_sample_size"] = int(os.getenv("NEO4J_SCHEMA_SAMPLE_SIZE"))
309+
logger.info(
310+
f"Info: Default sample size set to {config['schema_sample_size']} via environment variable."
311+
)
312+
except ValueError:
313+
logger.warning(
314+
"Warning: Invalid sample size provided in NEO4J_SCHEMA_SAMPLE_SIZE environment variable. No default sample will be used."
315+
)
316+
config["schema_sample_size"] = 1000
317+
else:
318+
logger.info(
319+
"Info: No default sample size provided. Schema operations will scan entire graph unless explicitly specified."
320+
)
321+
config["schema_sample_size"] = 1000
322+
299323
return config
300324

301325

servers/mcp-neo4j-cypher/tests/unit/test_utils.py

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ def clean_env():
3131
"NEO4J_READ_TIMEOUT",
3232
"NEO4J_RESPONSE_TOKEN_LIMIT",
3333
"NEO4J_READ_ONLY",
34+
"NEO4J_SCHEMA_SAMPLE_SIZE",
3435
]
3536
# Store original values
3637
original_values = {}
@@ -66,6 +67,7 @@ def _create_args(**kwargs):
6667
"read_timeout": None,
6768
"token_limit": None,
6869
"read_only": None,
70+
"schema_sample_size": None,
6971
}
7072
defaults.update(kwargs)
7173
return argparse.Namespace(**defaults)
@@ -741,4 +743,43 @@ def test_read_only_defaults_and_precedence(clean_env, args_factory):
741743

742744
# When CLI flag is absent (False), env var is used
743745
os.environ["NEO4J_READ_ONLY"] = "true"
744-
assert process_config(args_factory(read_only=False))["read_only"] is True
746+
assert process_config(args_factory())["read_only"] is True
747+
748+
749+
def test_sample_cli_args(clean_env, args_factory):
750+
"""Test sample configuration via CLI arguments."""
751+
assert process_config(args_factory(sample=1000))["schema_sample_size"] == 1000
752+
assert process_config(args_factory(sample=500))["schema_sample_size"] == 500
753+
assert process_config(args_factory(sample=0))["schema_sample_size"] == 0
754+
755+
756+
def test_sample_env_vars(clean_env, args_factory):
757+
"""Test sample configuration via environment variables."""
758+
os.environ["NEO4J_SCHEMA_SAMPLE_SIZE"] = "2000"
759+
assert process_config(args_factory())["schema_sample_size"] == 2000
760+
761+
os.environ["NEO4J_SCHEMA_SAMPLE_SIZE"] = "100"
762+
assert process_config(args_factory())["schema_sample_size"] == 100
763+
764+
765+
def test_sample_defaults(clean_env, args_factory):
766+
"""Test sample defaults when not provided."""
767+
assert process_config(args_factory())["schema_sample_size"] is None
768+
769+
770+
def test_sample_cli_overrides_env(clean_env, args_factory):
771+
"""Test that CLI arguments override environment variables for sample."""
772+
os.environ["NEO4J_SCHEMA_SAMPLE_SIZE"] = "1000"
773+
assert process_config(args_factory(sample=500))["schema_sample_size"] == 500
774+
775+
776+
def test_sample_invalid_env_var(clean_env, args_factory, mock_logger):
777+
"""Test sample with invalid environment variable value."""
778+
os.environ["NEO4J_SCHEMA_SAMPLE_SIZE"] = "not_a_number"
779+
config = process_config(args_factory())
780+
781+
# Should default to None and log warning
782+
assert config["sa"] is None
783+
mock_logger.warning.assert_called_with(
784+
"Warning: Invalid sample size provided in NEO4J_SCHEMA_SAMPLE_SIZE environment variable. No default sample will be used."
785+
)

0 commit comments

Comments
 (0)