Skip to content

Latest commit

 

History

History
165 lines (122 loc) · 5.88 KB

File metadata and controls

165 lines (122 loc) · 5.88 KB

pg_duckdb Syntax Guide & Gotchas

This guide provides a quick reference for the most common SQL patterns used with pg_duckdb, along with key behaviors and limitations to be aware of.

Syntax Reference

Create a table

-- This is a standard PostgreSQL table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    product_name TEXT,
    amount NUMERIC,
    order_date DATE
);

INSERT INTO orders (product_name, amount, order_date)
VALUES ('Laptop', 1200.00, '2024-07-01'),
       ('Keyboard', 75.50, '2024-07-01'),
       ('Mouse', 25.00, '2024-07-02');

Querying standard PostgreSQL tables

For analytical queries on your existing PostgreSQL tables, use standard SQL. No special syntax is needed. pg_duckdb automatically accelerates these queries when you configure duckdb.force_execution = true.

SET duckdb.force_execution = true;
-- Standard SELECT on a PostgreSQL table
SELECT
    category,
    AVG(price) as avg_price,
    COUNT(*) as item_count
FROM
    products -- This is a regular PostgreSQL table
GROUP BY
    category
ORDER BY
    avg_price DESC;

Querying external files (Parquet, CSV, etc.)

To query files from a data lake (e.g., S3, local storage), use the read_* functions. You must use the r['column_name'] syntax to access columns.

-- Query a single Parquet file
SELECT
    r['product_id'],
    r['review_text']
FROM
    read_parquet('s3://my-bucket/reviews.parquet') r -- 'r' is a required alias
LIMIT 100;

-- Query multiple CSV files using a glob pattern
SELECT
    r['timestamp'],
    r['event_type'],
    COUNT(*) as event_count
FROM
    read_csv('s3://my-datalake/logs/2024-*.csv') r
GROUP BY
    r['timestamp'],
    r['event_type'];

Hybrid queries (joining PostgreSQL and external data)

You can seamlessly join PostgreSQL tables with external data sources in a single query.

-- Join a local PostgreSQL 'customers' table with a remote Parquet file of 'orders'
SELECT
    c.customer_name,
    c.signup_date,
    SUM(r['order_total']) AS total_spent
FROM
    customers c -- This is a PostgreSQL table
JOIN
    read_parquet('s3://my-bucket/orders/*.parquet') r ON c.customer_id = r['customer_id']
WHERE
    c.status = 'active'
GROUP BY
    c.customer_name,
    c.signup_date
ORDER BY
    total_spent DESC;

Creating MotherDuck-backed tables

To persist the results of an analytical query, you can create a new table that uses MotherDuck to store the data in columnar format. Use the USING duckdb clause.

-- Create a new table 'sales_summary' with DuckDB storage
CREATE TABLE sales_summary USING duckdb AS
SELECT
    r['region'],
    r['product_category'],
    SUM(r['sales_amount']) AS total_sales
FROM
    read_parquet('s3://my-datalake/sales_data/year=2024/**/*.parquet') r
GROUP BY
    r['region'],
    r['product_category'];

-- Query the newly created analytical table
SELECT * FROM sales_summary WHERE region = 'North America';

When to use duckdb.query()

The duckdb.query() function is an advanced feature and is not needed for most queries. You should only use it when you need to run a query that uses DuckDB-specific syntax that is not valid in standard PostgreSQL.

Example (Using DuckDB's PIVOT statement):

-- This query uses DuckDB's PIVOT syntax, so it must be wrapped in duckdb.query()
SELECT * FROM duckdb.query($$
    PIVOT sales_summary
    ON product_category
    USING SUM(total_sales)
    GROUP BY region;
$$);

For all standard analytical queries, prefer the direct SQL syntax shown in the sections above.


Things to Know

This section outlines behaviors and limitations in pg_duckdb, things to know to use the extension effectively.

Querying

  • Standard SQL is the default. For analytical queries on your PostgreSQL tables, use standard SQL. pg_duckdb accelerates them automatically as long as you configure duckdb.force_excecution=true. No special functions are needed.

  • Use r['column'] for external files. When using read_parquet(), read_csv(), etc., you must access columns with the r['column_name'] syntax, where r is the alias for the function call.

  • duckdb.query() is for special cases. Only use duckdb.query() when you need DuckDB-specific syntax that PostgreSQL does not support, like PIVOT. It is not needed for regular queries.

  • CTE and subquery behavior. When using read_parquet() inside a CTE or subquery, column references can behave differently. You may need to explicitly alias your columns (SELECT r['col'] AS col) or continue using the r['col'] syntax outside the CTE.

  • Data type limitations. There are bunch of limitations in how data types behave because of differences between PostgreSQL and DuckDB

Transactions

  • Writes must be separate. You cannot write to both a PostgreSQL table and a DuckDB table in the same transaction. This will cause an error. Perform these operations in separate transactions.

  • DDL must be separate. You cannot mix CREATE TABLE statements for PostgreSQL and CREATE TABLE ... USING duckdb statements in the same transaction.

  • Avoid mixed transactions. There is a setting duckdb.unsafe_allow_mixed_transactions to bypass the separate-write rule. This is not recommended as it can lead to data inconsistency if one part of the transaction fails.

MotherDuck

  • Schema mapping is specific. MotherDuck databases and schemas are mapped to PostgreSQL schemas with a special naming convention. Your default database's main schema maps to public. Other schemas are mapped to names like ddb$<db_name>$<schema_name>.

Configuration

  • Forcing DuckDB execution. For a query that only involves PostgreSQL tables, pg_duckdb will not run it by default. To force it, you can set duckdb.force_execution = true.

  • File system access is restricted. By default, non-superusers cannot access the local file system. This is a security measure.