This guide provides a quick reference for the most common SQL patterns used with pg_duckdb, along with key behaviors and limitations to be aware of.
-- This is a standard PostgreSQL table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
product_name TEXT,
amount NUMERIC,
order_date DATE
);
INSERT INTO orders (product_name, amount, order_date)
VALUES ('Laptop', 1200.00, '2024-07-01'),
('Keyboard', 75.50, '2024-07-01'),
('Mouse', 25.00, '2024-07-02');For analytical queries on your existing PostgreSQL tables, use standard SQL. No special syntax is needed. pg_duckdb automatically accelerates these queries when you configure duckdb.force_execution = true.
SET duckdb.force_execution = true;
-- Standard SELECT on a PostgreSQL table
SELECT
category,
AVG(price) as avg_price,
COUNT(*) as item_count
FROM
products -- This is a regular PostgreSQL table
GROUP BY
category
ORDER BY
avg_price DESC;To query files from a data lake (e.g., S3, local storage), use the read_* functions. You must use the r['column_name'] syntax to access columns.
-- Query a single Parquet file
SELECT
r['product_id'],
r['review_text']
FROM
read_parquet('s3://my-bucket/reviews.parquet') r -- 'r' is a required alias
LIMIT 100;
-- Query multiple CSV files using a glob pattern
SELECT
r['timestamp'],
r['event_type'],
COUNT(*) as event_count
FROM
read_csv('s3://my-datalake/logs/2024-*.csv') r
GROUP BY
r['timestamp'],
r['event_type'];You can seamlessly join PostgreSQL tables with external data sources in a single query.
-- Join a local PostgreSQL 'customers' table with a remote Parquet file of 'orders'
SELECT
c.customer_name,
c.signup_date,
SUM(r['order_total']) AS total_spent
FROM
customers c -- This is a PostgreSQL table
JOIN
read_parquet('s3://my-bucket/orders/*.parquet') r ON c.customer_id = r['customer_id']
WHERE
c.status = 'active'
GROUP BY
c.customer_name,
c.signup_date
ORDER BY
total_spent DESC;To persist the results of an analytical query, you can create a new table that uses MotherDuck to store the data in columnar format. Use the USING duckdb clause.
-- Create a new table 'sales_summary' with DuckDB storage
CREATE TABLE sales_summary USING duckdb AS
SELECT
r['region'],
r['product_category'],
SUM(r['sales_amount']) AS total_sales
FROM
read_parquet('s3://my-datalake/sales_data/year=2024/**/*.parquet') r
GROUP BY
r['region'],
r['product_category'];
-- Query the newly created analytical table
SELECT * FROM sales_summary WHERE region = 'North America';The duckdb.query() function is an advanced feature and is not needed for most queries. You should only use it when you need to run a query that uses DuckDB-specific syntax that is not valid in standard PostgreSQL.
Example (Using DuckDB's PIVOT statement):
-- This query uses DuckDB's PIVOT syntax, so it must be wrapped in duckdb.query()
SELECT * FROM duckdb.query($$
PIVOT sales_summary
ON product_category
USING SUM(total_sales)
GROUP BY region;
$$);For all standard analytical queries, prefer the direct SQL syntax shown in the sections above.
This section outlines behaviors and limitations in pg_duckdb, things to know to use the extension effectively.
-
Standard SQL is the default. For analytical queries on your PostgreSQL tables, use standard SQL.
pg_duckdbaccelerates them automatically as long as you configureduckdb.force_excecution=true. No special functions are needed. -
Use
r['column']for external files. When usingread_parquet(),read_csv(), etc., you must access columns with ther['column_name']syntax, whereris the alias for the function call. -
duckdb.query()is for special cases. Only useduckdb.query()when you need DuckDB-specific syntax that PostgreSQL does not support, likePIVOT. It is not needed for regular queries. -
CTE and subquery behavior. When using
read_parquet()inside a CTE or subquery, column references can behave differently. You may need to explicitly alias your columns (SELECT r['col'] AS col) or continue using ther['col']syntax outside the CTE. -
Data type limitations. There are bunch of limitations in how data types behave because of differences between PostgreSQL and DuckDB
-
Writes must be separate. You cannot write to both a PostgreSQL table and a DuckDB table in the same transaction. This will cause an error. Perform these operations in separate transactions.
-
DDL must be separate. You cannot mix
CREATE TABLEstatements for PostgreSQL andCREATE TABLE ... USING duckdbstatements in the same transaction. -
Avoid mixed transactions. There is a setting
duckdb.unsafe_allow_mixed_transactionsto bypass the separate-write rule. This is not recommended as it can lead to data inconsistency if one part of the transaction fails.
- Schema mapping is specific. MotherDuck databases and schemas are mapped to PostgreSQL schemas with a special naming convention. Your default database's
mainschema maps topublic. Other schemas are mapped to names likeddb$<db_name>$<schema_name>.
-
Forcing DuckDB execution. For a query that only involves PostgreSQL tables,
pg_duckdbwill not run it by default. To force it, you can setduckdb.force_execution = true. -
File system access is restricted. By default, non-superusers cannot access the local file system. This is a security measure.