-
Notifications
You must be signed in to change notification settings - Fork 438
feat(lancedb): add VectorIndexMethod support for HNSW and IVF Flat indexes #1409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(lancedb): add VectorIndexMethod support for HNSW and IVF Flat indexes #1409
Conversation
…dexes Add support for configuring vector index methods in LanceDB target: - HNSW (Hierarchical Navigable Small World) with configurable m and ef_construction parameters, mapped to LanceDB's HnswPq index - IVF Flat with configurable lists (num_partitions) parameter, mapped to LanceDB's IvfFlat index When no method is specified, defaults to HNSW (HnswPq) for backward compatibility. Also adds unit tests for the _create_vector_index_config helper function. Fixes cocoindex-io#1054
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for configuring vector index methods (HNSW and IVF Flat) in the LanceDB target, allowing users to choose between different indexing strategies and customize their parameters.
Key Changes:
- Added
_create_vector_index_config()helper function that maps VectorIndexMethod to LanceDB index configurations (HnswPq or IvfFlat) - Extended
_VectorIndexdataclass with amethodfield to store the index method - Modified index naming to include the method type suffix for better identification
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| python/cocoindex/targets/lancedb.py | Added VectorIndexMethod support with helper function to create appropriate LanceDB index configs; updated index naming to include method suffix; integrated method configuration into apply_setup_change |
| python/cocoindex/tests/test_lancedb_index.py | Added unit tests for _create_vector_index_config with standalone copies of classes to avoid Rust engine dependency |
Comments suppressed due to low confidence (1)
python/cocoindex/targets/lancedb.py:482
- The RuntimeError constructor is being called with two positional arguments: the error message string and
index.name. However, RuntimeError (and most exception classes) typically only take a single message argument. This will cause theindex.nameto be stored as a separate tuple element in the exception's args, which is likely not the intended behavior.
Consider including index.name in the formatted error message string instead:
raise RuntimeError(
f"Exception in creating index '{index.name}' on field {index.field_name}. "
f"This may be caused by a limitation of LanceDB, "
f"which requires data existing in the table to train the index. "
f"See: https://github.com/lancedb/lance/issues/4034"
) from e raise RuntimeError(
f"Exception in creating index on field {index.field_name}. "
f"This may be caused by a limitation of LanceDB, "
f"which requires data existing in the table to train the index. "
f"See: https://github.com/lancedb/lance/issues/4034",
index.name,
) from e
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from dataclasses import dataclass | ||
| from typing import Any, Union, Optional | ||
|
|
||
| # Skip all tests if lancedb is not installed | ||
| lancedb = pytest.importorskip("lancedb") | ||
|
|
||
| # Import lancedb index types for assertion | ||
| from lancedb.index import HnswPq, IvfFlat # type: ignore | ||
|
|
||
|
|
||
| # Standalone copies of VectorIndexMethod classes for testing | ||
| # These mirror the definitions in cocoindex/index.py | ||
| @dataclass | ||
| class HnswVectorIndexMethod: | ||
| """HNSW vector index parameters.""" | ||
| kind: str = "Hnsw" | ||
| m: Optional[int] = None | ||
| ef_construction: Optional[int] = None | ||
|
|
||
|
|
||
| @dataclass | ||
| class IvfFlatVectorIndexMethod: | ||
| """IVFFlat vector index parameters.""" | ||
| kind: str = "IvfFlat" | ||
| lists: Optional[int] = None | ||
|
|
||
|
|
||
| VectorIndexMethod = Union[HnswVectorIndexMethod, IvfFlatVectorIndexMethod] | ||
|
|
||
|
|
||
| def _create_vector_index_config( | ||
| method: Optional[VectorIndexMethod], | ||
| distance_type: str, | ||
| ) -> Union[HnswPq, IvfFlat]: | ||
| """ | ||
| Create the appropriate LanceDB index configuration based on the VectorIndexMethod. | ||
| This mirrors the function from lancedb.py for testing without engine dependency. | ||
| """ | ||
| if method is None: | ||
| return HnswPq(distance_type=distance_type) | ||
|
|
||
| if isinstance(method, HnswVectorIndexMethod): | ||
| kwargs: dict[str, Any] = {"distance_type": distance_type} | ||
| if method.m is not None: | ||
| kwargs["m"] = method.m | ||
| if method.ef_construction is not None: | ||
| kwargs["ef_construction"] = method.ef_construction | ||
| return HnswPq(**kwargs) | ||
|
|
||
| if isinstance(method, IvfFlatVectorIndexMethod): | ||
| kwargs = {"distance_type": distance_type} | ||
| if method.lists is not None: | ||
| kwargs["num_partitions"] = method.lists | ||
| return IvfFlat(**kwargs) | ||
|
|
||
| return HnswPq(distance_type=distance_type) | ||
|
|
||
|
|
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataclass definitions and _create_vector_index_config function are duplicated from the main implementation. This creates a maintainability issue where changes to the actual implementation won't be reflected in the tests, potentially causing the tests to pass even when the actual code is broken.
Consider importing the actual classes from cocoindex.index and the function from cocoindex.targets.lancedb instead of duplicating them. If the Rust engine dependency is a concern for CI, you can use pytest.importorskip for the cocoindex module as well, or structure the tests differently to test the actual implementation.
| from dataclasses import dataclass | |
| from typing import Any, Union, Optional | |
| # Skip all tests if lancedb is not installed | |
| lancedb = pytest.importorskip("lancedb") | |
| # Import lancedb index types for assertion | |
| from lancedb.index import HnswPq, IvfFlat # type: ignore | |
| # Standalone copies of VectorIndexMethod classes for testing | |
| # These mirror the definitions in cocoindex/index.py | |
| @dataclass | |
| class HnswVectorIndexMethod: | |
| """HNSW vector index parameters.""" | |
| kind: str = "Hnsw" | |
| m: Optional[int] = None | |
| ef_construction: Optional[int] = None | |
| @dataclass | |
| class IvfFlatVectorIndexMethod: | |
| """IVFFlat vector index parameters.""" | |
| kind: str = "IvfFlat" | |
| lists: Optional[int] = None | |
| VectorIndexMethod = Union[HnswVectorIndexMethod, IvfFlatVectorIndexMethod] | |
| def _create_vector_index_config( | |
| method: Optional[VectorIndexMethod], | |
| distance_type: str, | |
| ) -> Union[HnswPq, IvfFlat]: | |
| """ | |
| Create the appropriate LanceDB index configuration based on the VectorIndexMethod. | |
| This mirrors the function from lancedb.py for testing without engine dependency. | |
| """ | |
| if method is None: | |
| return HnswPq(distance_type=distance_type) | |
| if isinstance(method, HnswVectorIndexMethod): | |
| kwargs: dict[str, Any] = {"distance_type": distance_type} | |
| if method.m is not None: | |
| kwargs["m"] = method.m | |
| if method.ef_construction is not None: | |
| kwargs["ef_construction"] = method.ef_construction | |
| return HnswPq(**kwargs) | |
| if isinstance(method, IvfFlatVectorIndexMethod): | |
| kwargs = {"distance_type": distance_type} | |
| if method.lists is not None: | |
| kwargs["num_partitions"] = method.lists | |
| return IvfFlat(**kwargs) | |
| return HnswPq(distance_type=distance_type) | |
| # Skip all tests if lancedb is not installed | |
| lancedb = pytest.importorskip("lancedb") | |
| # Skip tests if cocoindex (which requires the Rust engine) is not available | |
| pytest.importorskip("cocoindex") | |
| # Import lancedb index types for assertion | |
| from lancedb.index import HnswPq, IvfFlat # type: ignore | |
| # Import actual VectorIndexMethod classes and configuration function | |
| from cocoindex.index import HnswVectorIndexMethod, IvfFlatVectorIndexMethod | |
| from cocoindex.targets.lancedb import _create_vector_index_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix this comment to avoid duplication
| m: Optional[int] = None | ||
| ef_construction: Optional[int] = None | ||
|
|
||
|
|
||
| @dataclass | ||
| class IvfFlatVectorIndexMethod: | ||
| """IVFFlat vector index parameters.""" | ||
| kind: str = "IvfFlat" | ||
| lists: Optional[int] = None |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation inconsistency: This file uses Optional[int] (lines 29, 30, 37) while the actual implementation in cocoindex/index.py uses int | None (lines 17-18, 26). For consistency with the codebase, use int | None instead of Optional[int].
Note that Optional is not imported from typing either (line 14 shows only Any, Union, Optional but Optional should be used consistently across the codebase or not at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix this comment. Prefer int | None here.
| Note: This test file uses standalone copies of the VectorIndexMethod classes | ||
| and helper functions to avoid importing the cocoindex module, which requires | ||
| the Rust engine to be built. When running in CI, the full module will be tested. | ||
| """ |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states these are "standalone copies... to avoid importing the cocoindex module, which requires the Rust engine to be built. When running in CI, the full module will be tested." However, there's no evidence in the test file of any conditional testing for CI that would test the actual implementation. This means the actual _create_vector_index_config function in lancedb.py is never tested by these unit tests.
Either implement the CI-specific tests that import and test the actual implementation, or restructure the tests to import from the actual modules (using pytest.importorskip if needed).
python/cocoindex/targets/lancedb.py
Outdated
| [ | ||
| _VectorIndex( | ||
| name=f"__{index.field_name}__{_LANCEDB_VECTOR_METRIC[index.metric]}__idx", | ||
| name=f"__{index.field_name}__{_LANCEDB_VECTOR_METRIC[index.metric]}__{_get_method_suffix(index.method)}__idx", |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index naming strategy has changed to include the method suffix (e.g., __field__cosine__hnsw_pq__idx). This is a breaking change that will cause existing indexes to be dropped and recreated when upgrading, since the old index names don't include the method suffix.
While this may be intentional to support the new functionality, it's worth noting that:
- This could cause downtime during the migration as indexes are rebuilt
- For large datasets, index creation can be time-consuming and resource-intensive
- Users should be warned about this breaking change in the release notes
Consider documenting this breaking change clearly, or implementing a migration strategy that handles both old and new index naming conventions during a transition period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. Let's avoid adding this suffix for the default method, so existing index name won't change.
Add blank lines after docstrings in dataclass definitions to comply with ruff format rules. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove duplicate class definitions in test file, import from actual modules - Avoid breaking change: no suffix for default HNSW method to maintain backward compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
@majiayu000 sorry for the late reply. Code change looks good! There's a format problem. Can you reformat using |
|
@georgeh0 TFTR!Have fixed all format use ruff |
|
thanks a lot @majiayu000 for the improving the lancedb target! |
Summary
Add support for configuring vector index methods in LanceDB target:
HnswPqindex with configurablemandef_constructionparametersIvfFlatindex with configurablelists(num_partitions) parameterWhen no method is specified, defaults to HNSW (HnswPq) for backward compatibility.
Changes
Modified
python/cocoindex/targets/lancedb.py:HnswPqandIvfFlatfromlancedb.indexmethodfield to_VectorIndexdataclass_create_vector_index_config()helper function to create appropriate index configget_setup_state()to include method in vector indexesapply_setup_change()to use the method-aware index configurationAdded
python/cocoindex/tests/test_lancedb_index.py:_create_vector_index_config()functionTest Plan
_create_vector_index_config()pass locallyFixes #1054