Skip to content

Commit a7ecf7f

Browse files
authored
Update duckdb extension docs (#174)
1 parent 5a45fbe commit a7ecf7f

File tree

1 file changed

+85
-50
lines changed

1 file changed

+85
-50
lines changed

docs/integrations/data/duckdb.mdx

Lines changed: 85 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,108 @@
11
---
22
title: "DuckDB"
33
sidebarTitle: "DuckDB"
4-
4+
description: "Learn how to use the DuckDB-Lance extension to query Lance tables with SQL."
55
---
66

7-
import {
8-
PyPlatformsDuckdbCreateTable,
9-
PyPlatformsDuckdbMeanPrice,
10-
PyPlatformsDuckdbQueryTable,
11-
} from '/snippets/integrations.mdx';
7+
LanceDB integrates with [DuckDB](https://duckdb.org/) through the DuckDB Lance extension. In this page, we'll show how LanceDB manages table lifecycle, and DuckDB provides SQL analytics (including joins) and search over those tables.
8+
9+
Note that earlier versions of LanceDB used to recommend converting Lance tables to Arrow tables via `table.to_arrow()`. Although this method is still available (because DuckDB [natively scans Arrow tables](https://duckdb.org/2021/12/03/duck-arrow)), it is no longer the recommended workflow for working with Lance tables in DuckDB. This page shows how to use the Lance extension with namespace-attached LanceDB tables, allowing you to pushdown SQL queries directly to the Lance layer.
1210

13-
<Badge color="purple">OSS-only</Badge>
1411

15-
In Python, LanceDB tables can also be queried with [DuckDB](https://duckdb.org/), an in-process SQL OLAP database.
16-
This means you can write complex SQL queries to analyze your data in LanceDB.
12+
## Install
1713

18-
The integration is done via [Apache Arrow](https://duckdb.org/docs/guides/python/sql_on_arrow), which provides
19-
zero-copy data sharing between LanceDB and DuckDB. DuckDB is capable of passing down column selections and basic
20-
filters to LanceDB, reducing the amount of data that needs to be scanned to perform your query. Finally, the
21-
integration allows streaming data from LanceDB tables, allowing you to aggregate tables that don't fit into
22-
memory.
14+
Install the DuckDB CLI as per [their docs](https://duckdb.org/install) and alternatively, their Python package with `pip install duckdb`.
2315

24-
<Tip>
25-
**DuckDB quacks Arrow**
16+
Then, open the DuckDB CLI and install and load the Lance extension as follows:
2617

27-
All of this uses the same mechanism described in DuckDB's [blog post](https://duckdb.org/2021/12/03/duck-arrow.html)"
28-
on how it integrates with Apache Arrow.
29-
</Tip>
18+
```sql SQL icon="database"
19+
INSTALL lance;
20+
LOAD lance;
21+
```
3022

31-
We can demonstrate this by first installing `duckdb` and `lancedb`.
23+
## Attach the directory namespace in DuckDB
3224

33-
<CodeBlock filename="bash" language="bash" icon="terminal">
34-
pip install duckdb lancedb
35-
</CodeBlock>
25+
Attach the LanceDB root directory as a Lance namespace:
3626

37-
We will re-use the dataset [created previously](/integrations/data/pandas_and_pyarrow/):
27+
```sql SQL icon="database"
28+
ATTACH './local_lancedb' AS lance_ns (TYPE LANCE);
29+
```
3830

39-
<CodeBlock filename="Python" language="Python" icon="python">
40-
{PyPlatformsDuckdbCreateTable}
41-
</CodeBlock>
31+
In this page, tables are referenced using `lance_ns.main.<table_name>`, so the table path is `lance_ns.main.lance_duck`.
4232

43-
The `to_lance` method converts the LanceDB table to a `LanceDataset`, which is accessible to DuckDB through the Arrow compatibility layer.
44-
To query the resulting Lance dataset in DuckDB, all you need to do is reference the dataset by the same name in your SQL query.
33+
## Write Lance table
4534

46-
<CodeBlock filename="Python" language="Python" icon="python">
47-
{PyPlatformsDuckdbQueryTable}
48-
</CodeBlock>
35+
Create the `lance_duck` table using SQL and populate it with sample data:
4936

37+
```sql SQL icon="database"
38+
CREATE OR REPLACE TABLE lance_ns.main.lance_duck AS
39+
SELECT *
40+
FROM (
41+
VALUES
42+
('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
43+
('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
44+
('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
45+
) AS t(animal, noise, vector);
5046
```
51-
┌─────────────┬─────────┬────────┐
52-
│ vector │ item │ price │
53-
│ float[] │ varchar │ double │
54-
├─────────────┼─────────┼────────┤
55-
│ [3.1, 4.1] │ foo │ 10.0 │
56-
│ [5.9, 26.5] │ bar │ 20.0 │
57-
└─────────────┴─────────┴────────┘
47+
48+
This table is the source of truth for all DuckDB queries below.
49+
50+
## Query the table with SQL
51+
52+
```sql SQL icon="database"
53+
SELECT *
54+
FROM lance_ns.main.lance_duck
55+
LIMIT 5;
5856
```
5957

60-
You can very easily run any other DuckDB SQL queries on your data.
58+
## Vector search
59+
60+
```sql SQL icon="database"
61+
SELECT animal, noise, vector, _distance
62+
FROM lance_vector_search(
63+
'lance_ns.main.lance_duck',
64+
'vector',
65+
[0.8, 0.7, 0.2]::FLOAT[],
66+
k = 1,
67+
prefilter = true
68+
)
69+
ORDER BY _distance ASC;
70+
```
6171

62-
<CodeBlock filename="Python" language="Python" icon="python">
63-
{PyPlatformsDuckdbMeanPrice}
64-
</CodeBlock>
72+
## Full-text search
73+
74+
```sql SQL icon="database"
75+
SELECT animal, noise, vector, _score
76+
FROM lance_fts(
77+
'lance_ns.main.lance_duck',
78+
'animal',
79+
'the brave knight faced the dragon',
80+
k = 1,
81+
prefilter = true
82+
)
83+
ORDER BY _score DESC;
84+
```
6585

86+
## Hybrid search
87+
88+
```sql SQL icon="database"
89+
SELECT animal, noise, vector, _hybrid_score, _distance, _score
90+
FROM lance_hybrid_search(
91+
'lance_ns.main.lance_duck',
92+
'vector',
93+
[0.8, 0.7, 0.2]::FLOAT[],
94+
'animal',
95+
'the duck surprised the dragon',
96+
k = 2,
97+
prefilter = false,
98+
alpha = 0.5,
99+
oversample_factor = 4
100+
)
101+
ORDER BY _hybrid_score DESC;
66102
```
67-
┌─────────────┐
68-
│ mean(price) │
69-
│ double │
70-
├─────────────┤
71-
│ 15.0 │
72-
└─────────────┘
73-
```
103+
104+
## Directory namespace model
105+
106+
A directory namespace maps a LanceDB catalog root to namespace-qualified table identifiers in DuckDB. This keeps table discovery and table naming stable as your project grows.
107+
108+
To learn more about the catalog and namespace model, see [Namespaces and the Catalog Model](/namespaces).

0 commit comments

Comments
 (0)