Skip to content

Commit 06a248d

Browse files
author
Evert
committed
docs: comprehensive update to Python package development docs
This commit updates the Python package development documentation to provide complete guidance on building from source, setting up development environments, and troubleshooting common issues. Includes new sections for debugging, IDEs and building with extensions, and moves snippets previously in the Python package README. Related to duckdb/duckdb#17483
1 parent 4edcb9c commit 06a248d

File tree

1 file changed

+289
-30
lines changed

1 file changed

+289
-30
lines changed

docs/stable/dev/building/python.md

Lines changed: 289 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,291 @@ redirect_from:
55
title: Python
66
---
77

8-
In general, if you would like to build DuckDB from source, it's recommended to avoid using the `BUILD_PYTHON=1` flag unless you are actively developing the DuckDB Python client.
8+
The DuckDB Python package lives in the main [DuckDB source on Github](https://github.com/duckdb/duckdb/) under the `/tools/pythonpkg/` folder. It uses [pybind11](https://pybind11.readthedocs.io/en/stable/) to create Python bindings with DuckDB.
99

10-
## Python Package on macOS: Building the httpfs Extension Fails
10+
# Prerequisites
1111

12-
**Problem:**
13-
The build fails on macOS when both the [`httpfs` extension]({% link docs/stable/extensions/httpfs/overview.md %}) and the Python package are included:
12+
For everything described on this page we make the following assumptions:
1413

15-
```batch
16-
GEN=ninja BUILD_PYTHON=1 CORE_EXTENSIONS="httpfs" make
14+
1. You have a working copy of the duckdb source (including the git tags) and you run commands from the root of the source
15+
2. You have a suitable Python installation available in a dedicated virtual env
16+
17+
## 1. DuckDB code
18+
19+
Make sure you have checked out the [DuckDB source](https://github.com/duckdb/duckdb/) and that you are in its root. E.g.:
20+
21+
```bash
22+
$ git clone https://github.com/duckdb/duckdb.git
23+
...
24+
$ cd duckdb
25+
```
26+
27+
If you've _forked_ DuckDB you may run into trouble when building the Python package when you haven't pulled in the tags.
28+
29+
```bash
30+
# Check your remotes
31+
git remote -v
32+
33+
# If you don't see upstream [email protected]:duckdb/duckdb.git, then add it
34+
git remote add upstream [email protected]:duckdb/duckdb.git
35+
36+
# Now you can pull & push the tags
37+
git fetch --tags upstream
38+
git push --tags
39+
```
40+
41+
## 2. Python Virtual Env
42+
43+
For everything described here you will need a suitable Python installation. While you technically might be able to use your system Python, we **strongly** recommend you use a Python virtual environment. A virtual environment isolates dependencies and, depending on the tooling you use, gives you control over which Python interpreter you use. This way you don't pollute your system-wide Python with the different packages you need for your projects.
44+
45+
While we use Python's built-in `venv` module in our examples below, and technically this might (or migth not!) work for you, we also **strongly** recommend use a tool like [astral uv](https://docs.astral.sh/uv/) (or Poetry, conda, etc) that allows you to manage _both_ Python interpreter versions and virtual environments.
46+
47+
Create and activate a virtual env as follows:
48+
49+
```bash
50+
# Create a virtual environment in the .venv folder (in the duckdb source root)
51+
$ python3 -m venv --prompt duckdb .venv
52+
53+
# Activate the virtual env
54+
$ source .venv/bin/activate
55+
```
56+
57+
Make sure you have a modern enough version of pip available in your virtual env:
58+
59+
```bash
60+
# Print pip's help
61+
$ python3 -m pip install --upgrade pip
62+
```
63+
64+
If that fails with `No module named pip` and you use `uv`, then run:
65+
66+
```bash
67+
# Install pip
68+
$ uv pip install pip
69+
```
70+
71+
# Building From Source
72+
73+
Below are a number of options to build the python library from source, with or without debug symbols, and with a default or custom set of [extensions]({% link docs/stable/extensions/overview.md %}). Make sure to check out the [DuckDB build documentation]({% link docs/stable/dev/building/overview.md %}) if you run into trouble building the DuckDB main library.
74+
75+
## Default release, debug build or cloud storage
76+
77+
The following will build the package with the default set of extensions (json, parquet, icu and core_function).
78+
79+
### Release build
80+
81+
```bash
82+
GEN=ninja BUILD_PYTHON=1 make release
83+
```
84+
85+
### Debug build
86+
87+
```bash
88+
GEN=ninja BUILD_PYTHON=1 make debug
89+
```
90+
91+
### Cloud Storage
92+
93+
You may need the package files to reside under the same prefix where the library is installed; e.g., when installing to cloud storage from a notebook.
94+
95+
First, get the repository based version number and extract the source distribution.
96+
97+
```bash
98+
python3 -m pip install build # required for PEP 517 compliant source dists
99+
cd tools/pythonpkg
100+
export SETUPTOOLS_SCM_PRETEND_VERSION=$(python3 -m setuptools_scm)
101+
pyproject-build . --sdist
102+
cd ../..
103+
```
104+
105+
Next, copy over the python package related files, and install the package.
106+
107+
```bash
108+
mkdir -p $DUCKDB_PREFIX/src/duckdb-pythonpkg
109+
tar --directory=$DUCKDB_PREFIX/src/duckdb-pythonpkg -xzpf tools/pythonpkg/dist/duckdb-${SETUPTOOLS_SCM_PRETEND_VERSION}.tar.gz
110+
pip install --prefix $DUCKDB_PREFIX -e $DUCKDB_PREFIX/src/duckdb-pythonpkg/duckdb-${SETUPTOOLS_SCM_PRETEND_VERSION}
111+
```
112+
113+
### Verify
114+
115+
```bash
116+
python3 -c "import duckdb; print(duckdb.sql('SELECT 42').fetchall())"
117+
```
118+
119+
## Adding extensions
120+
121+
Before thinking about statically linking extensions you should know that the Python package currently doesn't handle linked in extensions very well. If you don't really need to have an extension baked in than the advice is to just stick to [installing them at runtime]({% link docs/stable/extensions/installing_extensions.md %}). See `tools/pythonpkg/duckdb_extension_config.cmake` for the default list of extensions that are built with the python package. Any other extension should be considered problematic.
122+
123+
Having said that, if you do want to give it a try, here's how.
124+
125+
For more details on building DuckDB extensions look at the [documentation]({% link docs/stable/dev/building/building_extensions.md %}).
126+
127+
The DuckDB build process follows the following logic for building extensions:
128+
1. First compose the complete set of extensions that might be included in the build
129+
1. Then compose the complete set of extensions that should be excluded from the build
130+
1. Assemble the final set of extensions to be compiled by subtracting the set of excluded extensions from the set of included extensions.
131+
132+
The following mechanisms add to the set of **_included_ extensions**:
133+
134+
| Mechanism | Syntax / Example |
135+
| --- | --- |
136+
| **Built-in extensions enabled by default** | `extension/extension_config.cmake` (≈30 built-ins) |
137+
| **Python package extensions enabled by default** | `tools/pythonpkg/duckdb_extension_config.cmake` (`json;parquet;icu`) |
138+
| **Semicolon-separated include list** | `DUCKDB_EXTENSIONS=fts;tpch;json` |
139+
| **Flags** | `BUILD_TPCH=1`, `BUILD_JEMALLOC=1`, `BUILD_FTS=1`, … |
140+
| **Presets** | `BUILD_ALL_EXT=1` - Build all in-tree extensions<br/>`BUILD_ALL_IT_EXT=1` - _Only_ build in-tree extensions<br/>`BUILD_ALL_OOT_EXT=1` - Build all out-of-tree extensions |
141+
| **Custom config file(s)** | `DUCKDB_EXTENSION_CONFIGS=path/to/my.cmake` |
142+
| **Core-only overrides** <br/>_only relevant with `DISABLE_BUILTIN_EXTENSIONS=1`_ | `CORE_EXTENSIONS=httpfs;fts` |
143+
144+
---
145+
146+
The following mechanisms add to the set of **_excluded_ extensions**:
147+
148+
| Mechanism | Syntax / Example |
149+
| --- | --- |
150+
| **Semicolon-separated skip list** | `SKIP_EXTENSIONS=parquet;jemalloc` |
151+
| **Flags** | `DISABLE_PARQUET=1`, `DISABLE_CORE_FUNCTIONS=1`, … |
152+
| **“No built-ins” switch** <br/>_Throws out *every* statically linked extension **except** `core_functions`. Use `CORE_EXTENSIONS=…` to whitelist a subset back in._ | `DISABLE_BUILTIN_EXTENSIONS=1` |
153+
154+
---
155+
156+
## Show all installed extensions
157+
158+
```bash
159+
python3 -c "import duckdb; print(duckdb.sql('SELECT extension_name, installed, description FROM duckdb_extensions();'))"
160+
```
161+
162+
# Development Environment
163+
164+
To set up the codebase for development you should run build duckdb as follows:
165+
166+
```bash
167+
GEN=ninja BUILD_PYTHON=1 PYTHON_DEV=1 make debug
168+
```
169+
170+
This will take care of the following:
171+
* Builds both the main duckdb library and the python library with debug symbols
172+
* Generates a `compile-commands.json` file that includes CPython and pybind11 headers so that intellisense and clang-tidy checks work in your IDE
173+
* Installs the required Python dependencies in your virtual env
174+
175+
Once the build completes, do a sanity check to make sure everything works:
176+
177+
```bash
178+
python3 -c "import duckdb; print(duckdb.sql('SELECT 42').fetchall())"
179+
```
180+
181+
## Debugging
182+
183+
The basic recipe is to start `lldb` with your virtual env's Python interpreter and your script, then set a breakpoint and run your script.
184+
185+
For example, given a script `dataframe.df` with the following contents:
186+
187+
```python
188+
import duckdb
189+
print(duckdb.sql("select * from range(1000)").df())
17190
```
18191

192+
The following should work:
193+
194+
```bash
195+
lldb -- .venv/bin/python3 my_script.py
196+
...
197+
# Set a breakpoint
198+
(lldb) br s -n duckdb::DuckDBPyRelation::FetchDF
199+
Breakpoint 1: no locations (pending).
200+
WARNING: Unable to resolve breakpoint to any actual locations.
201+
# The above warning is harmless - the library hasn't been imported yet
202+
203+
# Run the script
204+
(lldb) r
205+
...
206+
frame #0: 0x000000013025833c duckdb.cpython-310-darwin.so`duckdb::DuckDBPyRelation::FetchDF(this=0x00006000012f8d20, date_as_object=false) at pyrelation.cpp:808:7
207+
805 }
208+
806
209+
807 PandasDataFrame DuckDBPyRelation::FetchDF(bool date_as_object) {
210+
-> 808 if (!result) {
211+
809 if (!rel) {
212+
810 return py::none();
213+
811 }
214+
Target 0: (python3) stopped.
215+
```
216+
217+
## Debugging in an IDE / CLion
218+
219+
After creating a debug build with `PYTHON_DEV` enabled, you should be able to get debugging going in an IDE that support `lldb`. Below are the instructions for CLion, but you should be able to get this going in e.g. VSCode as well.
220+
221+
### Configure the CMake Debug Profile
222+
223+
This is a prerequisite for debugging, and will enable Intellisense and clang-tidy by generating a `compile-commands.json` file so your IDE knows how to inspect the source code. It also makes sure your Python virtual env can be found by your IDE's cmake.
224+
225+
Under `Settings | Build, Execution, Deployment | CMake` add the following CMake options:
226+
```console
227+
-DCMAKE_PREFIX_PATH=$CMakeProjectDir$/.venv;$CMAKE_PREFIX_PATH
228+
-DPython3_EXECUTABLE=$CMakeProjectDir$/.venv/bin/python3
229+
-DBUILD_PYTHON=1 -DPYTHON_DEV=1
230+
```
231+
232+
### Create a run config for debugging
233+
234+
Under Run -> Edit Configurations... create a new CMake Application. Use the following values:
235+
* Name: Python Debug
236+
* Target: `python_src` (it doesn't actually matter what you select here)
237+
* Program arguments: `$FilePath$`
238+
* Working directory: `$ProjectFileDir$`
239+
240+
That should be enough: Save and close.
241+
242+
Now you can set a breakpoint in a C++ file. You then open your Python script in your editor and use this config to start a debug session.
243+
244+
## Development and Stubs
245+
246+
`*.pyi` stubs in `duckdb-stubs` are manually maintained. The connection-related stubs are generated using dedicated scripts in `tools/pythonpkg/scripts/`:
247+
- `generate_connection_stubs.py`
248+
- `generate_connection_wrapper_stubs.py`
249+
250+
These stubs are important for autocomplete in many IDEs, as static-analysis based language servers can't introspect `duckdb`'s binary module.
251+
252+
To verify the stubs match the actual implementation:
253+
```bash
254+
python3 -m pytest tests/stubs
255+
```
256+
257+
If you add new methods to the DuckDB Python API, you'll need to manually add corresponding type hints to the stub files.
258+
259+
## What are py::objects and a py::handles??
260+
261+
These are classes provided by pybind11, the library we use to manage our interaction with the python environment.
262+
py::handle is a direct wrapper around a raw PyObject* and does not manage any references.
263+
py::object is similar to py::handle but it can handle refcounts.
264+
265+
I say *can* because it doesn't have to, using `py::reinterpret_borrow<py::object>(...)` we can create a non-owning py::object, this is essentially just a py::handle but py::handle can't be used if the prototype requires a py::object.
266+
267+
`py::reinterpret_steal<py::object>(...)` creates an owning py::object, this will increase the refcount of the python object and will decrease the refcount when the py::object goes out of scope.
268+
269+
When directly interacting with python functions that return a `PyObject*`, such as `PyDateTime_DATE_GET_TZINFO`, you should generally wrap the call in `py::reinterpret_steal` to take ownership of the returned object.
270+
271+
# Troubleshooting
272+
273+
## Pip fails with `No names found, cannot describe anything`
274+
275+
If you've forked DuckDB you may run into trouble when building the Python package when you haven't pulled in the tags.
276+
277+
```bash
278+
# Check your remotes
279+
git remote -v
280+
281+
# If you don't see upstream [email protected]:duckdb/duckdb.git, then add it
282+
git remote add upstream [email protected]:duckdb/duckdb.git
283+
284+
# Now you can pull & push the tags
285+
git fetch --tags upstream
286+
git push --tags
287+
```
288+
289+
## Building with the httpfs extension Fails
290+
291+
The build fails on OSX when both the [`httpfs` extension]({% link docs/stable/extensions/httpfs/overview.md %}) and the Python package are included:
292+
19293
```console
20294
ld: library not found for -lcrypto
21295
clang: error: linker command failed with exit code 1 (use -v to see invocation)
@@ -24,34 +298,21 @@ ninja: build stopped: subcommand failed.
24298
make: *** [release] Error 1
25299
```
26300
27-
**Solution:**
28-
As stated above, avoid using the `BUILD_PYTHON` flag.
29-
Instead, first build the `httpfs` extension (if required), then build and install the Python package separately using pip:
301+
Linking in the httpfs extension is problematic. Please install it at runtime, if you can.
30302
31-
```batch
32-
GEN=ninja CORE_EXTENSIONS="httpfs" make
33-
python3 -m pip install tools/pythonpkg --use-pep517 --user
34-
```
303+
## Importing duckdb fails with `symbol not found in flat namespace`
35304
36-
If the second line complains about pybind11 being missing, or `--use-pep517` not being supported, make sure you're using a modern version of pip and setuptools.
37-
The default `python3-pip` on your OS may not be modern, so you may need to update it using:
305+
If you seen an error that looks like this:
38306
39-
```batch
40-
python3 -m pip install pip -U
307+
```console
308+
ImportError: dlopen(/usr/bin/python3/site-packages/duckdb/duckdb.cpython-311-darwin.so, 0x0002): symbol not found in flat namespace '_MD5_Final'
41309
```
42310
43-
## `No module named 'duckdb.duckdb'` Build Error
44-
45-
**Problem:**
46-
Building the Python package succeeds but the package cannot be imported:
311+
... then you've probably tried to link in a problematic extension. As mentioned above: `tools/pythonpkg/duckdb_extension_config.cmake` contains the default list of extensions that are built with the python package. Any other extension might cause problems.
47312
48-
```batch
49-
cd tools/pythonpkg/
50-
python3 -m pip install .
51-
python3 -c "import duckdb"
52-
```
313+
## Python fails with `No module named 'duckdb.duckdb'`
53314
54-
This returns the following error message:
315+
If you're in `tools/pythonpkg` and try to `import duckdb` you might see:
55316
56317
```console
57318
Traceback (most recent call last):
@@ -63,6 +324,4 @@ Traceback (most recent call last):
63324
ModuleNotFoundError: No module named 'duckdb.duckdb'
64325
```
65326
66-
**Solution:**
67-
The problem is caused by Python trying to import from the current working directory.
68-
To work around this, navigate to a different directory (e.g., `cd ..`) and try running Python import again.
327+
This is because Python imported from the `duckdb` directory (i.e. `tools/pythonpkg/duckdb/`), rather than from the installed package. You should start your interpreter from a different directory instead.

0 commit comments

Comments
 (0)