Skip to content

relytcloud/pg_ducklake

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

634 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pg_ducklake

PostgreSQL Extension for DuckLake

dockerhub License

This project is under high development and is not yet ready for production use.

pg_ducklake brings a native datalake experience into PostgreSQL, powered by DuckLake (a DuckDB lakehouse format with SQL catalog metadata and open Parquet data files).

Key Features

  • Managed DuckLake tables: Create/Write/Query DuckLake tables in PostgreSQL via SQL (e.g., psql/JDBC).
  • DuckDB compatibility: tables created by pg_ducklake are directly queryable from DuckDB clients.
  • Cloud storage: store data files in AWS S3 (or GCS, R2) to decouple storage and compute for serverless analytics.
  • Fast analytics: columnar storage + DuckDB vectorized execution, with hybrid queries over PostgreSQL heap tables supported.

See it in action

Your first Data Lake in PostgreSQL

-- Or use AWS S3 as data storage.
-- SET ducklake.default_table_path = 's3://my-bucket/prefix/';

CREATE TABLE my_table (
    id INT,
    name TEXT,
    age INT
) USING ducklake;

INSERT INTO my_table VALUES (1, 'Alice', 25), (2, 'Bob', 30);

SELECT * FROM my_table;

Access your data with DuckDB

INSTALL ducklake;
LOAD ducklake;
ATTACH 'ducklake:postgres:dbname=postgres host=localhost' AS my_ducklake (METADATA_SCHEMA 'ducklake');
SELECT * FROM my_ducklake.public.my_table;

For cloud storage (AWS S3 or Azure Blob Storage), see the Secrets Management guide.

Quick Start

Docker

Run PostgreSQL with pg_ducklake pre-installed in a docker container:

docker run -d -e POSTGRES_PASSWORD=duckdb --name pgducklake pgducklake/pgducklake:18-main

docker exec -it pgducklake psql

Compile from source

Requirements:

  • PostgreSQL: 14, 15, 16, 17, 18
  • Operating Systems: Ubuntu 22.04-24.04, macOS
git clone https://github.com/relytcloud/pg_ducklake
cd pg_ducklake
make install

See compilation guide for detailed instructions.

Usecases

Convert a PostgreSQL heap table into a DuckLake table

This pattern performs a one-time ETL copy from row-store (PostgreSQL heap) tables to DuckLake (column-store) tables for fast analytics, while OLTP continues to use the original heap tables.

-- Create a PostgreSQL row-store (heap) table.
CREATE TABLE row_store_table AS
SELECT i AS id, 'hello pg_ducklake' AS msg
FROM generate_series(1, 10000) AS i;

-- Create a DuckLake column-store table via ETL.
CREATE TABLE col_store_table USING ducklake AS
SELECT *
FROM row_store_table;

-- Run analytics against the converted table.
SELECT max(id) FROM col_store_table;

Load an external dataset

External datasets (e.g., CSV/Parquet) can be ingested with DuckDB readers and materialized as tables for analytics.

CREATE TABLE titanic USING ducklake AS
SELECT * FROM read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv');

SELECT "Pclass", "Sex", COUNT(*), AVG("Survived") AS survival_rate
FROM titanic
GROUP BY "Pclass", "Sex";

Query DuckDB-created DuckLake tables from PostgreSQL

DuckLake tables created by DuckDB clients (CLI, Python, etc.) that use the current PostgreSQL instance as their metadata catalog can be queried directly from PostgreSQL using the Foreign Data Wrapper (FDW). This enables scenarios where data engineers create and manage tables via DuckDB tools, while application developers query them seamlessly through PostgreSQL.

Example scenario: A data pipeline uses DuckDB CLI to create DuckLake tables from Parquet files in S3, storing metadata in PostgreSQL. Your PostgreSQL application can then query these tables using standard SQL without needing DuckDB clients.

See the DuckLake FDW User Guide for setup and usage details.

Documentation

Roadmap

Features

  • CREATE / CREATE_TABLE_AS for DuckLake tables
  • INSERT / SELECT / DELETE / UPDATE for DuckLake tables
  • Online schema evolution (ADD COLUMN / DROP COLUMN / type promotion)
  • Time-travel queries
  • Partitioned tables
  • Read-only pg_ducklake tables referencing shared DuckLake datasets (e.g., frozen DuckLake)
  • Table maintenance (e.g., compaction / GC) via PostgreSQL (e.g., VACUUM or UDFs) [^]
  • HTAP support for incremental row-store → column-store conversion (PostgreSQL heap → DuckLake)
  • Complex types

[^]: Table maintenance can be carried out by standalone DuckDB clients (preferable, since it is serverless and avoids burdening the PostgreSQL server); pg_ducklake still plans to expose these operations for ease of use.

Performance

  • Native inlined (heap) table for small writes
  • Better transaction concurrency model (based on PostgreSQL XID)
  • Faster metadata operations via PostgreSQL native functions (e.g., SPI)

Docs

  • Access control behavior for DuckLake tables [^]

[^]: DuckLake tables are exposed via PostgreSQL table access methods (AM), so PostgreSQL table/column privileges may already apply; the current behavior and gaps will be reviewed and documented. DuckLake itself relies on its metadata service for ACL management.

Contributing

We welcome contributions! Please see:

Acknowledgments

This project is forked from pg_duckdb, and integrating with ducklake.

About

Native lakehouse experience in Postgres powered by DuckDB & Ducklake

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Languages

  • C++ 73.0%
  • Python 13.7%
  • PLpgSQL 11.2%
  • Makefile 0.9%
  • C 0.6%
  • Dockerfile 0.3%
  • Other 0.3%