Skip to content

[Bug]: mh_lsh_band parameter lacks validation - accepts invalid values (0, -1, >num_hashes) #47748

@zhuwenxing

Description

@zhuwenxing

Is there an existing issue for this?

  • I have searched the existing issues

Environment

  • Milvus version: master
  • Deployment mode: standalone
  • MQ type: rocksmq
  • SDK version: pymilvus 2.7.0rc136
  • OS: Linux (K8s deployment)

Current Behavior

When creating a MINHASH_LSH index, the mh_lsh_band parameter accepts clearly invalid values without any error or warning:

  1. mh_lsh_band=0: Server accepts it, index is created, and search returns results (likely using a fallback/default)
  2. mh_lsh_band=-1: Server accepts it, index is created, and search returns results
  3. mh_lsh_band > num_hashes (e.g., band=26 with num_hashes=16): Server accepts it, index is created, and search returns results

All three cases should be rejected with clear validation errors.

Expected Behavior

The server should validate mh_lsh_band at index creation time:

  • mh_lsh_band must be a positive integer (> 0)
  • mh_lsh_band must not exceed num_hashes (from the MinHash function params)
  • mh_lsh_band should ideally be a divisor of num_hashes for optimal LSH behavior

Invalid values should be rejected with a clear error message at index creation time.

Steps To Reproduce

from pymilvus import MilvusClient, DataType, Function, FunctionType

client = MilvusClient(uri="http://<host>:19530")

schema = client.create_schema(enable_dynamic_field=False)
schema.add_field("id", DataType.INT64, is_primary=True, auto_id=False)
schema.add_field("text", DataType.VARCHAR, max_length=65535)
schema.add_field("minhash_sig", DataType.BINARY_VECTOR, dim=512)

schema.add_function(Function(
    name="text_to_minhash",
    function_type=FunctionType.MINHASH,
    input_field_names=["text"],
    output_field_names=["minhash_sig"],
    params={"num_hashes": 16, "shingle_size": 3},
))

# Case 1: mh_lsh_band = 0
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="minhash_sig",
    index_type="MINHASH_LSH",
    metric_type="MHJACCARD",
    params={"mh_lsh_band": 0},  # Should be rejected
)
client.create_collection("test_band_0", schema=schema, index_params=index_params)
# No error! Index created successfully.

# Case 2: mh_lsh_band = -1
index_params2 = client.prepare_index_params()
index_params2.add_index(
    field_name="minhash_sig",
    index_type="MINHASH_LSH",
    metric_type="MHJACCARD",
    params={"mh_lsh_band": -1},  # Should be rejected
)
client.create_collection("test_band_neg", schema=schema, index_params=index_params2)
# No error! Index created successfully.

# Case 3: mh_lsh_band = 26 > num_hashes = 16
index_params3 = client.prepare_index_params()
index_params3.add_index(
    field_name="minhash_sig",
    index_type="MINHASH_LSH",
    metric_type="MHJACCARD",
    params={"mh_lsh_band": 26},  # Should be rejected (> num_hashes=16)
)
client.create_collection("test_band_exceed", schema=schema, index_params=index_params3)
# No error! Index created successfully.

Anything else?

Root cause analysis:

There is no validation for mh_lsh_band at any layer:

  1. Proxy layer: No parameter range check when processing CreateIndex request
  2. Index node: No validation when building the MINHASH_LSH index
  3. Knowhere: The underlying index library silently accepts invalid band values

The fix should add validation at the proxy layer (index creation path) or in the Knowhere MINHASH_LSH index config to ensure:

  • mh_lsh_band > 0
  • mh_lsh_band <= num_hashes (requires cross-referencing the collection schema's MinHash function params)
  • Optionally warn if num_hashes % mh_lsh_band != 0 (non-divisor bands may cause suboptimal LSH behavior)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions