Skip to content

Commit 536ebb0

Browse files
committed
feat: support FTS for LanceDB
1 parent 8cf9d96 commit 536ebb0

File tree

12 files changed

+189
-9
lines changed

12 files changed

+189
-9
lines changed

docs/docs/core/flow_def.mdx

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,44 @@ Following metrics are supported:
327327
| L2Distance | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
328328
| InnerProduct | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |
329329

330+
### Full-Text Search (FTS) Index
331+
332+
*Full-text search index* is specified by `fts_indexes` (`Sequence[FtsIndexDef]`). `FtsIndexDef` has the following fields:
333+
334+
* `field_name`: the field to create FTS index.
335+
* `parameters` (optional): a dictionary of parameters to pass to the target's FTS index creation. The supported parameters vary by target.
336+
337+
For example, with LanceDB:
338+
339+
<Tabs>
340+
<TabItem value="python" label="Python" default>
341+
342+
```python
343+
@cocoindex.flow_def(name="DemoFlow")
344+
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
345+
...
346+
demo_collector = data_scope.add_collector()
347+
...
348+
demo_collector.export(
349+
"demo_target", DemoTargetSpec(...),
350+
primary_key_fields=["id"],
351+
fts_indexes=[
352+
# Basic FTS index with default tokenizer
353+
cocoindex.FtsIndexDef("content"),
354+
# FTS index with custom tokenizer
355+
cocoindex.FtsIndexDef("description", parameters={"language": "English"})
356+
])
357+
```
358+
359+
</TabItem>
360+
</Tabs>
361+
362+
:::note
363+
364+
FTS indexes are currently only supported for LanceDB target on its enterprise edition. Other targets will raise an error if FTS indexes are specified.
365+
366+
:::
367+
330368
## Miscellaneous
331369

332370
### Getting App Namespace

docs/docs/targets/lancedb.md

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ Here's how CocoIndex data elements map to LanceDB elements during export:
2020
| a collected row | a row |
2121
| a field | a column |
2222

23-
2423
::::info Installation and import
2524

2625
This target is provided via an optional dependency `[lancedb]`:
@@ -41,14 +40,15 @@ import cocoindex.targets.lancedb as coco_lancedb
4140

4241
The spec `coco_lancedb.LanceDB` takes the following fields:
4342

44-
* `db_uri` (`str`, required): The LanceDB database location (e.g. `./lancedb_data`).
45-
* `table_name` (`str`, required): The name of the table to export the data to.
46-
* `db_options` (`coco_lancedb.DatabaseOptions`, optional): Advanced database options.
47-
* `storage_options` (`dict[str, Any]`, optional): Passed through to LanceDB when connecting.
43+
* `db_uri` (`str`, required): The LanceDB database location (e.g. `./lancedb_data`).
44+
* `table_name` (`str`, required): The name of the table to export the data to.
45+
* `db_options` (`coco_lancedb.DatabaseOptions`, optional): Advanced database options.
46+
* `storage_options` (`dict[str, Any]`, optional): Passed through to LanceDB when connecting.
4847

4948
Additional notes:
5049

51-
* Exactly one primary key field is required for LanceDB targets. We create B-Tree index on this key column.
50+
* Exactly one primary key field is required for LanceDB targets. We create B-Tree index on this key column.
51+
* **Full-Text Search (FTS) indexes** are supported via the `fts_indexes` parameter. Note that FTS functionality requires [LanceDB Enterprise](https://lancedb.com/docs/indexing/fts-index/). You can pass any parameters supported by the target's FTS index creation API (e.g., `tokenizer_name` for LanceDB). See [LanceDB FTS documentation](https://lancedb.com/docs/indexing/fts-index/) for full parameter details.
5252

5353
:::info
5454

@@ -59,6 +59,38 @@ If you want to use vector indexes, you can run the flow once to populate the tar
5959

6060
You can find an end-to-end example here: [examples/text_embedding_lancedb](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_lancedb).
6161

62+
### FTS Index Example
63+
64+
```python
65+
import cocoindex
66+
import cocoindex.targets.lancedb as coco_lancedb
67+
68+
@cocoindex.flow_def(name="DocumentSearchFlow")
69+
def document_search_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
70+
# ... source and transformations ...
71+
72+
doc_collector = data_scope.add_collector()
73+
# ... collect document data ...
74+
75+
doc_collector.export(
76+
"documents",
77+
coco_lancedb.LanceDB(
78+
db_uri="./lancedb_data",
79+
table_name="documents"
80+
),
81+
primary_key_fields=["id"],
82+
# Add FTS indexes for full-text search
83+
fts_indexes=[
84+
# Basic FTS index with default tokenizer
85+
cocoindex.FtsIndexDef("content"),
86+
# FTS index with stemming for better search recall
87+
cocoindex.FtsIndexDef("description", parameters={"tokenizer_name": "en_stem"}),
88+
# FTS index with position tracking for phrase searches
89+
cocoindex.FtsIndexDef("title", parameters={"tokenizer_name": "default", "with_position": True})
90+
]
91+
)
92+
```
93+
6294
## `connect_async()` helper
6395

6496
We provide a helper to obtain a shared `AsyncConnection` that is reused across your process and shared with CocoIndex's writer for strong read-after-write consistency:
@@ -85,6 +117,7 @@ Once `db_uri` matches, it automatically reuses the same connection instance with
85117
This achieves strong consistency between your indexing and querying logic, if they run in the same process.
86118

87119
## Example
120+
88121
<ExampleButton
89122
href="https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_lancedb"
90123
text="Text Embedding LanceDB Example"

examples/text_embedding_lancedb/main.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,11 @@ def text_embedding_flow(
7676
coco_lancedb.LanceDB(db_uri=LANCEDB_URI, table_name=LANCEDB_TABLE),
7777
primary_key_fields=["id"],
7878
vector_indexes=vector_indexes,
79+
fts_indexes=[
80+
cocoindex.FtsIndexDef(
81+
field_name="text", parameters={"tokenizer_name": "simple"}
82+
)
83+
],
7984
)
8085

8186

python/cocoindex/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
from .lib import settings, init, start_server, stop
2828
from .llm import LlmSpec, LlmApiType
2929
from .index import (
30+
FtsIndexDef,
3031
VectorSimilarityMetric,
3132
VectorIndexDef,
3233
IndexOptions,
@@ -95,6 +96,7 @@
9596
# Index
9697
"VectorSimilarityMetric",
9798
"VectorIndexDef",
99+
"FtsIndexDef",
98100
"IndexOptions",
99101
"HnswVectorIndexMethod",
100102
"IvfFlatVectorIndexMethod",

python/cocoindex/flow.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,6 +407,7 @@ def export(
407407
primary_key_fields: Sequence[str],
408408
attachments: Sequence[op.TargetAttachmentSpec] = (),
409409
vector_indexes: Sequence[index.VectorIndexDef] = (),
410+
fts_indexes: Sequence[index.FtsIndexDef] = (),
410411
vector_index: Sequence[tuple[str, index.VectorSimilarityMetric]] = (),
411412
setup_by_user: bool = False,
412413
) -> None:
@@ -432,6 +433,7 @@ def export(
432433
index_options = index.IndexOptions(
433434
primary_key_fields=primary_key_fields,
434435
vector_indexes=vector_indexes,
436+
fts_indexes=fts_indexes,
435437
)
436438
self._flow_builder_state.engine_flow_builder.export(
437439
target_name,

python/cocoindex/index.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from enum import Enum
22
from dataclasses import dataclass
3-
from typing import Sequence, Union
3+
from typing import Sequence, Union, Any
44

55

66
class VectorSimilarityMetric(Enum):
@@ -40,6 +40,19 @@ class VectorIndexDef:
4040
method: VectorIndexMethod | None = None
4141

4242

43+
@dataclass
44+
class FtsIndexDef:
45+
"""
46+
Define a full-text search index on a field.
47+
48+
The parameters field can contain any keyword arguments supported by the target's
49+
FTS index creation API (e.g., tokenizer_name for LanceDB).
50+
"""
51+
52+
field_name: str
53+
parameters: dict[str, Any] | None = None
54+
55+
4356
@dataclass
4457
class IndexOptions:
4558
"""
@@ -48,3 +61,4 @@ class IndexOptions:
4861

4962
primary_key_fields: Sequence[str]
5063
vector_indexes: Sequence[VectorIndexDef] = ()
64+
fts_indexes: Sequence[FtsIndexDef] = ()

python/cocoindex/targets/lancedb.py

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
VectorTypeSchema,
2121
TableType,
2222
)
23-
from ..index import VectorIndexDef, IndexOptions, VectorSimilarityMetric
23+
from ..index import VectorIndexDef, FtsIndexDef, IndexOptions, VectorSimilarityMetric
2424

2525
_logger = logging.getLogger(__name__)
2626

@@ -48,11 +48,19 @@ class _VectorIndex:
4848
metric: VectorSimilarityMetric
4949

5050

51+
@dataclasses.dataclass
52+
class _FtsIndex:
53+
name: str
54+
field_name: str
55+
parameters: dict[str, Any] | None = None
56+
57+
5158
@dataclasses.dataclass
5259
class _State:
5360
key_field_schema: FieldSchema
5461
value_fields_schema: list[FieldSchema]
5562
vector_indexes: list[_VectorIndex] | None = None
63+
fts_indexes: list[_FtsIndex] | None = None
5664
db_options: DatabaseOptions | None = None
5765

5866

@@ -318,6 +326,18 @@ def get_setup_state(
318326
if index_options.vector_indexes is not None
319327
else None
320328
),
329+
fts_indexes=(
330+
[
331+
_FtsIndex(
332+
name=f"__{index.field_name}__fts__idx",
333+
field_name=index.field_name,
334+
parameters=index.parameters,
335+
)
336+
for index in index_options.fts_indexes
337+
]
338+
if index_options.fts_indexes is not None
339+
else None
340+
),
321341
)
322342

323343
@staticmethod
@@ -412,6 +432,30 @@ async def apply_setup_change(
412432
if vector_index_name in existing_vector_indexes:
413433
await table.drop_index(vector_index_name)
414434

435+
# Handle FTS indexes
436+
unseen_prev_fts_indexes = {
437+
index.name for index in (previous and previous.fts_indexes) or []
438+
}
439+
existing_fts_indexes = {index.name for index in await table.list_indices()}
440+
441+
for fts_index in current.fts_indexes or []:
442+
if fts_index.name in unseen_prev_fts_indexes:
443+
unseen_prev_fts_indexes.remove(fts_index.name)
444+
else:
445+
try:
446+
# Create FTS index using create_fts_index() API
447+
# Pass parameters as kwargs to support any future FTS index options
448+
kwargs = fts_index.parameters if fts_index.parameters else {}
449+
await table.create_fts_index(fts_index.field_name, **kwargs)
450+
except Exception as e: # pylint: disable=broad-exception-caught
451+
raise RuntimeError(
452+
f"Exception in creating FTS index on field {fts_index.field_name}: {e}"
453+
) from e
454+
455+
for fts_index_name in unseen_prev_fts_indexes:
456+
if fts_index_name in existing_fts_indexes:
457+
await table.drop_index(fts_index_name)
458+
415459
@staticmethod
416460
async def prepare(
417461
spec: LanceDB,

rust/cocoindex/src/base/spec.rs

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -459,12 +459,33 @@ impl fmt::Display for VectorIndexDef {
459459
}
460460
}
461461

462+
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
463+
pub struct FtsIndexDef {
464+
pub field_name: FieldName,
465+
#[serde(default, skip_serializing_if = "Option::is_none")]
466+
pub parameters: Option<serde_json::Map<String, serde_json::Value>>,
467+
}
468+
469+
impl fmt::Display for FtsIndexDef {
470+
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
471+
match &self.parameters {
472+
None => write!(f, "{}", self.field_name),
473+
Some(params) => {
474+
let params_str = serde_json::to_string(params).unwrap_or_else(|_| "{}".to_string());
475+
write!(f, "{}:{}", self.field_name, params_str)
476+
}
477+
}
478+
}
479+
}
480+
462481
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
463482
pub struct IndexOptions {
464483
#[serde(default, skip_serializing_if = "Option::is_none")]
465484
pub primary_key_fields: Option<Vec<FieldName>>,
466485
#[serde(default, skip_serializing_if = "Vec::is_empty")]
467486
pub vector_indexes: Vec<VectorIndexDef>,
487+
#[serde(default, skip_serializing_if = "Vec::is_empty")]
488+
pub fts_indexes: Vec<FtsIndexDef>,
468489
}
469490

470491
impl IndexOptions {
@@ -490,7 +511,16 @@ impl fmt::Display for IndexOptions {
490511
.map(|v| v.to_string())
491512
.collect::<Vec<_>>()
492513
.join(",");
493-
write!(f, "keys={primary_keys}, indexes={vector_indexes}")
514+
let fts_indexes = self
515+
.fts_indexes
516+
.iter()
517+
.map(|f| f.to_string())
518+
.collect::<Vec<_>>()
519+
.join(",");
520+
write!(
521+
f,
522+
"keys={primary_keys}, vector_indexes={vector_indexes}, fts_indexes={fts_indexes}"
523+
)
494524
}
495525
}
496526

rust/cocoindex/src/ops/targets/kuzu.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -772,6 +772,9 @@ impl TargetFactoryBase for Factory {
772772
if !data_coll.index_options.vector_indexes.is_empty() {
773773
api_bail!("Vector indexes are not supported for Kuzu yet");
774774
}
775+
if !data_coll.index_options.fts_indexes.is_empty() {
776+
api_bail!("FTS indexes are not supported for Kuzu target");
777+
}
775778
fn to_dep_table(
776779
field_mapping: &AnalyzedGraphElementFieldMapping,
777780
) -> Result<ReferencedNodeTable> {

rust/cocoindex/src/ops/targets/neo4j.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -557,6 +557,9 @@ impl SetupState {
557557
.iter()
558558
.map(|f| (f.name.as_str(), &f.value_type.typ))
559559
.collect::<HashMap<_, _>>();
560+
if !index_options.fts_indexes.is_empty() {
561+
api_bail!("FTS indexes are not supported for Neo4j target");
562+
}
560563
for index_def in index_options.vector_indexes.iter() {
561564
sub_components.push(ComponentState {
562565
object_label: schema.elem_type.clone(),

0 commit comments

Comments
 (0)