Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 85 additions & 50 deletions docs/integrations/data/duckdb.mdx
Original file line number Diff line number Diff line change
@@ -1,73 +1,108 @@
---
title: "DuckDB"
sidebarTitle: "DuckDB"

description: "Learn how to use the DuckDB-Lance extension to query Lance tables with SQL."
---

import {
PyPlatformsDuckdbCreateTable,
PyPlatformsDuckdbMeanPrice,
PyPlatformsDuckdbQueryTable,
} from '/snippets/integrations.mdx';
LanceDB integrates with [DuckDB](https://duckdb.org/) through the DuckDB Lance extension. In this page, we'll show how LanceDB manages table lifecycle, and DuckDB provides SQL analytics (including joins) and search over those tables.

Note that earlier versions of LanceDB used to recommend converting Lance tables to Arrow tables via `table.to_arrow()`. Although this method is still available (because DuckDB [natively scans Arrow tables](https://duckdb.org/2021/12/03/duck-arrow)), it is no longer the recommended workflow for working with Lance tables in DuckDB. This page shows how to use the Lance extension with namespace-attached LanceDB tables, allowing you to pushdown SQL queries directly to the Lance layer.

<Badge color="purple">OSS-only</Badge>

In Python, LanceDB tables can also be queried with [DuckDB](https://duckdb.org/), an in-process SQL OLAP database.
This means you can write complex SQL queries to analyze your data in LanceDB.
## Install

The integration is done via [Apache Arrow](https://duckdb.org/docs/guides/python/sql_on_arrow), which provides
zero-copy data sharing between LanceDB and DuckDB. DuckDB is capable of passing down column selections and basic
filters to LanceDB, reducing the amount of data that needs to be scanned to perform your query. Finally, the
integration allows streaming data from LanceDB tables, allowing you to aggregate tables that don't fit into
memory.
Install the DuckDB CLI as per [their docs](https://duckdb.org/install) and alternatively, their Python package with `pip install duckdb`.

<Tip>
**DuckDB quacks Arrow**
Then, open the DuckDB CLI and install and load the Lance extension as follows:

All of this uses the same mechanism described in DuckDB's [blog post](https://duckdb.org/2021/12/03/duck-arrow.html)"
on how it integrates with Apache Arrow.
</Tip>
```sql SQL icon="database"
INSTALL lance;
LOAD lance;
```

We can demonstrate this by first installing `duckdb` and `lancedb`.
## Attach the directory namespace in DuckDB

<CodeBlock filename="bash" language="bash" icon="terminal">
pip install duckdb lancedb
</CodeBlock>
Attach the LanceDB root directory as a Lance namespace:

We will re-use the dataset [created previously](/integrations/data/pandas_and_pyarrow/):
```sql SQL icon="database"
ATTACH './local_lancedb' AS lance_ns (TYPE LANCE);
```

<CodeBlock filename="Python" language="Python" icon="python">
{PyPlatformsDuckdbCreateTable}
</CodeBlock>
In this page, tables are referenced using `lance_ns.main.<table_name>`, so the table path is `lance_ns.main.lance_duck`.

The `to_lance` method converts the LanceDB table to a `LanceDataset`, which is accessible to DuckDB through the Arrow compatibility layer.
To query the resulting Lance dataset in DuckDB, all you need to do is reference the dataset by the same name in your SQL query.
## Write Lance table

<CodeBlock filename="Python" language="Python" icon="python">
{PyPlatformsDuckdbQueryTable}
</CodeBlock>
Create the `lance_duck` table using SQL and populate it with sample data:

```sql SQL icon="database"
CREATE OR REPLACE TABLE lance_ns.main.lance_duck AS
SELECT *
FROM (
VALUES
('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
) AS t(animal, noise, vector);
```
┌─────────────┬─────────┬────────┐
│ vector │ item │ price │
│ float[] │ varchar │ double │
├─────────────┼─────────┼────────┤
│ [3.1, 4.1] │ foo │ 10.0 │
│ [5.9, 26.5] │ bar │ 20.0 │
└─────────────┴─────────┴────────┘

This table is the source of truth for all DuckDB queries below.

## Query the table with SQL

```sql SQL icon="database"
SELECT *
FROM lance_ns.main.lance_duck
LIMIT 5;
```

You can very easily run any other DuckDB SQL queries on your data.
## Vector search

```sql SQL icon="database"
SELECT animal, noise, vector, _distance
FROM lance_vector_search(
'lance_ns.main.lance_duck',
'vector',
[0.8, 0.7, 0.2]::FLOAT[],
k = 1,
prefilter = true
)
ORDER BY _distance ASC;
```

<CodeBlock filename="Python" language="Python" icon="python">
{PyPlatformsDuckdbMeanPrice}
</CodeBlock>
## Full-text search

```sql SQL icon="database"
SELECT animal, noise, vector, _score
FROM lance_fts(
'lance_ns.main.lance_duck',
'animal',
'the brave knight faced the dragon',
k = 1,
prefilter = true
)
ORDER BY _score DESC;
```

## Hybrid search

```sql SQL icon="database"
SELECT animal, noise, vector, _hybrid_score, _distance, _score
FROM lance_hybrid_search(
'lance_ns.main.lance_duck',
'vector',
[0.8, 0.7, 0.2]::FLOAT[],
'animal',
'the duck surprised the dragon',
k = 2,
prefilter = false,
alpha = 0.5,
oversample_factor = 4
)
ORDER BY _hybrid_score DESC;
```
┌─────────────┐
│ mean(price) │
│ double │
├─────────────┤
│ 15.0 │
└─────────────┘
```

## Directory namespace model

A directory namespace maps a LanceDB catalog root to namespace-qualified table identifiers in DuckDB. This keeps table discovery and table naming stable as your project grows.

To learn more about the catalog and namespace model, see [Namespaces and the Catalog Model](/namespaces).
Loading