Skip to content

Conversation

@majiayu000
Copy link
Contributor

Summary

Add support for configuring vector index methods in LanceDB target:

  • HNSW: Mapped to LanceDB's HnswPq index with configurable m and ef_construction parameters
  • IVF Flat: Mapped to LanceDB's IvfFlat index with configurable lists (num_partitions) parameter

When no method is specified, defaults to HNSW (HnswPq) for backward compatibility.

Changes

  • Modified python/cocoindex/targets/lancedb.py:

    • Added imports for HnswPq and IvfFlat from lancedb.index
    • Added method field to _VectorIndex dataclass
    • Added _create_vector_index_config() helper function to create appropriate index config
    • Updated get_setup_state() to include method in vector indexes
    • Updated apply_setup_change() to use the method-aware index configuration
  • Added python/cocoindex/tests/test_lancedb_index.py:

    • Unit tests for _create_vector_index_config() function
    • Tests cover default behavior, HNSW params, IVF Flat params, and all distance types

Test Plan

  • Unit tests for _create_vector_index_config() pass locally
  • Integration tests with actual LanceDB (requires Rust engine build)
  • Verify existing tests still pass in CI

Fixes #1054

…dexes

Add support for configuring vector index methods in LanceDB target:

- HNSW (Hierarchical Navigable Small World) with configurable m and
  ef_construction parameters, mapped to LanceDB's HnswPq index
- IVF Flat with configurable lists (num_partitions) parameter, mapped
  to LanceDB's IvfFlat index

When no method is specified, defaults to HNSW (HnswPq) for backward
compatibility.

Also adds unit tests for the _create_vector_index_config helper function.

Fixes cocoindex-io#1054
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for configuring vector index methods (HNSW and IVF Flat) in the LanceDB target, allowing users to choose between different indexing strategies and customize their parameters.

Key Changes:

  • Added _create_vector_index_config() helper function that maps VectorIndexMethod to LanceDB index configurations (HnswPq or IvfFlat)
  • Extended _VectorIndex dataclass with a method field to store the index method
  • Modified index naming to include the method type suffix for better identification

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
python/cocoindex/targets/lancedb.py Added VectorIndexMethod support with helper function to create appropriate LanceDB index configs; updated index naming to include method suffix; integrated method configuration into apply_setup_change
python/cocoindex/tests/test_lancedb_index.py Added unit tests for _create_vector_index_config with standalone copies of classes to avoid Rust engine dependency
Comments suppressed due to low confidence (1)

python/cocoindex/targets/lancedb.py:482

  • The RuntimeError constructor is being called with two positional arguments: the error message string and index.name. However, RuntimeError (and most exception classes) typically only take a single message argument. This will cause the index.name to be stored as a separate tuple element in the exception's args, which is likely not the intended behavior.

Consider including index.name in the formatted error message string instead:

raise RuntimeError(
    f"Exception in creating index '{index.name}' on field {index.field_name}. "
    f"This may be caused by a limitation of LanceDB, "
    f"which requires data existing in the table to train the index. "
    f"See: https://github.com/lancedb/lance/issues/4034"
) from e
                    raise RuntimeError(
                        f"Exception in creating index on field {index.field_name}. "
                        f"This may be caused by a limitation of LanceDB, "
                        f"which requires data existing in the table to train the index. "
                        f"See: https://github.com/lancedb/lance/issues/4034",
                        index.name,
                    ) from e

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 13 to 70
from dataclasses import dataclass
from typing import Any, Union, Optional

# Skip all tests if lancedb is not installed
lancedb = pytest.importorskip("lancedb")

# Import lancedb index types for assertion
from lancedb.index import HnswPq, IvfFlat # type: ignore


# Standalone copies of VectorIndexMethod classes for testing
# These mirror the definitions in cocoindex/index.py
@dataclass
class HnswVectorIndexMethod:
"""HNSW vector index parameters."""
kind: str = "Hnsw"
m: Optional[int] = None
ef_construction: Optional[int] = None


@dataclass
class IvfFlatVectorIndexMethod:
"""IVFFlat vector index parameters."""
kind: str = "IvfFlat"
lists: Optional[int] = None


VectorIndexMethod = Union[HnswVectorIndexMethod, IvfFlatVectorIndexMethod]


def _create_vector_index_config(
method: Optional[VectorIndexMethod],
distance_type: str,
) -> Union[HnswPq, IvfFlat]:
"""
Create the appropriate LanceDB index configuration based on the VectorIndexMethod.
This mirrors the function from lancedb.py for testing without engine dependency.
"""
if method is None:
return HnswPq(distance_type=distance_type)

if isinstance(method, HnswVectorIndexMethod):
kwargs: dict[str, Any] = {"distance_type": distance_type}
if method.m is not None:
kwargs["m"] = method.m
if method.ef_construction is not None:
kwargs["ef_construction"] = method.ef_construction
return HnswPq(**kwargs)

if isinstance(method, IvfFlatVectorIndexMethod):
kwargs = {"distance_type": distance_type}
if method.lists is not None:
kwargs["num_partitions"] = method.lists
return IvfFlat(**kwargs)

return HnswPq(distance_type=distance_type)


Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataclass definitions and _create_vector_index_config function are duplicated from the main implementation. This creates a maintainability issue where changes to the actual implementation won't be reflected in the tests, potentially causing the tests to pass even when the actual code is broken.

Consider importing the actual classes from cocoindex.index and the function from cocoindex.targets.lancedb instead of duplicating them. If the Rust engine dependency is a concern for CI, you can use pytest.importorskip for the cocoindex module as well, or structure the tests differently to test the actual implementation.

Suggested change
from dataclasses import dataclass
from typing import Any, Union, Optional
# Skip all tests if lancedb is not installed
lancedb = pytest.importorskip("lancedb")
# Import lancedb index types for assertion
from lancedb.index import HnswPq, IvfFlat # type: ignore
# Standalone copies of VectorIndexMethod classes for testing
# These mirror the definitions in cocoindex/index.py
@dataclass
class HnswVectorIndexMethod:
"""HNSW vector index parameters."""
kind: str = "Hnsw"
m: Optional[int] = None
ef_construction: Optional[int] = None
@dataclass
class IvfFlatVectorIndexMethod:
"""IVFFlat vector index parameters."""
kind: str = "IvfFlat"
lists: Optional[int] = None
VectorIndexMethod = Union[HnswVectorIndexMethod, IvfFlatVectorIndexMethod]
def _create_vector_index_config(
method: Optional[VectorIndexMethod],
distance_type: str,
) -> Union[HnswPq, IvfFlat]:
"""
Create the appropriate LanceDB index configuration based on the VectorIndexMethod.
This mirrors the function from lancedb.py for testing without engine dependency.
"""
if method is None:
return HnswPq(distance_type=distance_type)
if isinstance(method, HnswVectorIndexMethod):
kwargs: dict[str, Any] = {"distance_type": distance_type}
if method.m is not None:
kwargs["m"] = method.m
if method.ef_construction is not None:
kwargs["ef_construction"] = method.ef_construction
return HnswPq(**kwargs)
if isinstance(method, IvfFlatVectorIndexMethod):
kwargs = {"distance_type": distance_type}
if method.lists is not None:
kwargs["num_partitions"] = method.lists
return IvfFlat(**kwargs)
return HnswPq(distance_type=distance_type)
# Skip all tests if lancedb is not installed
lancedb = pytest.importorskip("lancedb")
# Skip tests if cocoindex (which requires the Rust engine) is not available
pytest.importorskip("cocoindex")
# Import lancedb index types for assertion
from lancedb.index import HnswPq, IvfFlat # type: ignore
# Import actual VectorIndexMethod classes and configuration function
from cocoindex.index import HnswVectorIndexMethod, IvfFlatVectorIndexMethod
from cocoindex.targets.lancedb import _create_vector_index_config

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this comment to avoid duplication

Comment on lines 29 to 37
m: Optional[int] = None
ef_construction: Optional[int] = None


@dataclass
class IvfFlatVectorIndexMethod:
"""IVFFlat vector index parameters."""
kind: str = "IvfFlat"
lists: Optional[int] = None
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation inconsistency: This file uses Optional[int] (lines 29, 30, 37) while the actual implementation in cocoindex/index.py uses int | None (lines 17-18, 26). For consistency with the codebase, use int | None instead of Optional[int].

Note that Optional is not imported from typing either (line 14 shows only Any, Union, Optional but Optional should be used consistently across the codebase or not at all).

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this comment. Prefer int | None here.

Comment on lines 7 to 10
Note: This test file uses standalone copies of the VectorIndexMethod classes
and helper functions to avoid importing the cocoindex module, which requires
the Rust engine to be built. When running in CI, the full module will be tested.
"""
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states these are "standalone copies... to avoid importing the cocoindex module, which requires the Rust engine to be built. When running in CI, the full module will be tested." However, there's no evidence in the test file of any conditional testing for CI that would test the actual implementation. This means the actual _create_vector_index_config function in lancedb.py is never tested by these unit tests.

Either implement the CI-specific tests that import and test the actual implementation, or restructure the tests to import from the actual modules (using pytest.importorskip if needed).

Copilot uses AI. Check for mistakes.
[
_VectorIndex(
name=f"__{index.field_name}__{_LANCEDB_VECTOR_METRIC[index.metric]}__idx",
name=f"__{index.field_name}__{_LANCEDB_VECTOR_METRIC[index.metric]}__{_get_method_suffix(index.method)}__idx",
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index naming strategy has changed to include the method suffix (e.g., __field__cosine__hnsw_pq__idx). This is a breaking change that will cause existing indexes to be dropped and recreated when upgrading, since the old index names don't include the method suffix.

While this may be intentional to support the new functionality, it's worth noting that:

  1. This could cause downtime during the migration as indexes are rebuilt
  2. For large datasets, index creation can be time-consuming and resource-intensive
  3. Users should be warned about this breaking change in the release notes

Consider documenting this breaking change clearly, or implementing a migration strategy that handles both old and new index naming conventions during a transition period.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. Let's avoid adding this suffix for the default method, so existing index name won't change.

majiayu000 and others added 2 commits December 27, 2025 08:37
Add blank lines after docstrings in dataclass definitions to comply with ruff format rules.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove duplicate class definitions in test file, import from actual modules
- Avoid breaking change: no suffix for default HNSW method to maintain backward compatibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@georgeh0
Copy link
Member

@majiayu000 sorry for the late reply.

Code change looks good! There's a format problem. Can you reformat using ruff format and commit the format fix? Thanks!

@majiayu000
Copy link
Contributor Author

@georgeh0 TFTR!Have fixed all format use ruff

@badmonster0
Copy link
Member

thanks a lot @majiayu000 for the improving the lancedb target!

@badmonster0 badmonster0 merged commit 6d701e3 into cocoindex-io:main Jan 30, 2026
11 checks passed
@majiayu000 majiayu000 deleted the feat/lancedb-vector-index-method branch January 30, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] support VectorIndexMethod in LanceDB

3 participants