[clickpipes] Add FAQ section for Postgres CDC (#3006)

iskakaushik · web-flow · commit ac4ed65d023c · 2025-01-06T10:46:25.000-06:00
diff --git a/docs/en/integrations/data-ingestion/clickpipes/postgres/faq.md b/docs/en/integrations/data-ingestion/clickpipes/postgres/faq.md
@@ -0,0 +1,25 @@
+---
+sidebar_label: ClickPipes for Postgres FAQ
+description: Frequently asked questions about ClickPipes for Postgres.
+slug: /en/integrations/clickpipes/postgres/faq
+sidebar_position: 2
+---
+
+# ClickPipes for Postgres FAQ
+
+### How does idling affect my Postgres CDC Clickpipe?
+
+If your ClickHouse Cloud service is idling, your Postgres CDC clickpipe will continue to sync data, your service will wake-up at the next sync interval to handle the incoming data. Once the sync is finished and the idle period is reached, your service will go back to idling.
+
+As an example, if your sync interval is set to 30 mins and your service idle time is set to 10 mins, Your service will wake-up every 30 mins and be active for 10 mins, then go back to idling.
+
+
+### How are TOAST columns handled in ClickPipes for Postgres?
+
+Please refer to the [Handling TOAST Columns](./toast) page for more information.
+
+
+### How are generated columns handled in ClickPipes for Postgres?
+
+Please refer to the [Postgres Generated Columns: Gotchas and Best Practices](./generated_columns) page for more information.
+
diff --git a/docs/en/integrations/data-ingestion/clickpipes/postgres/index.md b/docs/en/integrations/data-ingestion/clickpipes/postgres/index.md
@@ -134,6 +134,8 @@ Once the connection details are filled in, click on "Next".
 
 Once you've moved data from Postgres to ClickHouse, the next obvious question is how to model your data in ClickHouse to make the most of it. Please refer to this page on [ClickHouse Data Modeling Tips for Postgres users](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling) to help you model data in ClickHouse.
 
+Also, please refer to the [ClickPipes for Postgres FAQ](./postgres/faq) for more information about common issues and how to resolve them.
+
 :::info
 
 [This](https://docs.peerdb.io/bestpractices/clickhouse_datamodeling) is especially important as ClickHouse differs from Postgres, and you might encounter some surprises. This guide helps address potential pitfalls and ensures you can take full advantage of ClickHouse.
diff --git a/docs/en/integrations/data-ingestion/clickpipes/postgres/postgres_generated_columns.md b/docs/en/integrations/data-ingestion/clickpipes/postgres/postgres_generated_columns.md
@@ -0,0 +1,28 @@
+---
+title: "Postgres Generated Columns: Gotchas and Best Practices"
+slug: /en/integrations/clickpipes/postgres/generated_columns
+---
+
+When using PostgreSQL's generated columns in tables that are being replicated, there are some important considerations to keep in mind. These gotchas can affect the replication process and data consistency in your destination systems.
+
+## The Problem with Generated Columns
+
+1. **Not Published via pgoutput:** Generated columns are not published through the pgoutput logical replication plugin. This means that when you're replicating data from PostgreSQL to another system, the values of generated columns are not included in the replication stream.
+
+2. **Issues with Primary Keys:** If a generated column is part of your primary key, it can cause deduplication problems on the destination. Since the generated column values are not replicated, the destination system won't have the necessary information to properly identify and deduplicate rows.
+
+## Best Practices
+
+To work around these limitations, consider the following best practices:
+
+1. **Recreate Generated Columns on the Destination:** Instead of relying on the replication process to handle generated columns, it's recommended to recreate these columns on the destination using tools like dbt (data build tool) or other data transformation mechanisms.
+
+2. **Avoid Using Generated Columns in Primary Keys:** When designing tables that will be replicated, it's best to avoid including generated columns as part of the primary key.
+
+## Upcoming improvements to UI
+
+In upcoming versions, we are planning to add a UI to help users with the following:
+
+1. **Identify Tables with Generated Columns:** The UI will have a feature to identify tables that contain generated columns. This will help users understand which tables are affected by this issue.
+
+2. **Documentation and Best Practices:** The UI will include best practices for using generated columns in replicated tables, including guidance on how to avoid common pitfalls.
diff --git a/docs/en/integrations/data-ingestion/clickpipes/postgres/toast.md b/docs/en/integrations/data-ingestion/clickpipes/postgres/toast.md
@@ -0,0 +1,63 @@
+---
+title: "ClickPipes for Postgres: Handling TOAST Columns"
+description: Learn how to handle TOAST columns when replicating data from PostgreSQL to ClickHouse.
+slug: /en/integrations/clickpipes/postgres/toast
+---
+
+When replicating data from PostgreSQL to ClickHouse, it's important to understand the limitations and special considerations for TOAST (The Oversized-Attribute Storage Technique) columns. This guide will help you identify and properly handle TOAST columns in your replication process.
+
+## What are TOAST columns in PostgreSQL?
+
+TOAST (The Oversized-Attribute Storage Technique) is PostgreSQL's mechanism for handling large field values. When a row exceeds the maximum row size (typically 2KB, but this can vary depending on the PostgreSQL version and exact settings), PostgreSQL automatically moves large field values into a separate TOAST table, storing only a pointer in the main table.
+
+It's important to note that during Change Data Capture (CDC), unchanged TOAST columns are not included in the replication stream. This can lead to incomplete data replication if not handled properly.
+
+During the initial load (snapshot), all column values, including TOAST columns, will be replicated correctly regardless of their size. The limitations described in this guide primarily affect the ongoing CDC process after the initial load.
+
+You can read more about TOAST and its implementation in PostgreSQL here: https://www.postgresql.org/docs/current/storage-toast.html
+
+## Identifying TOAST columns in a table
+
+To identify if a table has TOAST columns, you can use the following SQL query:
+
+```sql
+SELECT a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) as data_type
+FROM pg_attribute a
+JOIN pg_class c ON a.attrelid = c.oid
+WHERE c.relname = 'your_table_name'
+  AND a.attlen = -1
+  AND a.attstorage != 'p'
+  AND a.attnum > 0;
+```
+
+This query will return the names and data types of columns that could potentially be TOASTed. However, it's important to note that this query only identifies columns that are eligible for TOAST storage based on their data type and storage attributes. To determine if these columns actually contain TOASTed data, you'll need to consider whether the values in these columns exceed the size. The actual TOASTing of data depends on the specific content stored in these columns.
+
+## Ensuring proper handling of TOAST columns
+
+To ensure that TOAST columns are handled correctly during replication, you should set the `REPLICA IDENTITY` of the table to `FULL`. This tells PostgreSQL to include the full old row in the WAL for UPDATE and DELETE operations, ensuring that all column values (including TOAST columns) are available for replication.
+
+You can set the `REPLICA IDENTITY` to `FULL` using the following SQL command:
+
+```sql
+ALTER TABLE your_table_name REPLICA IDENTITY FULL;
+```
+
+Refer to [this blog post](https://xata.io/blog/replica-identity-full-performance) for performance considerations when setting `REPLICA IDENTITY FULL`.
+
+## Replication behavior when REPLICA IDENTITY FULL is not set
+
+If `REPLICA IDENTITY FULL` is not set for a table with TOAST columns, you may encounter the following issues when replicating to ClickHouse:
+
+1. For INSERT operations, all columns (including TOAST columns) will be replicated correctly.
+
+2. For UPDATE operations:
+   - If a TOAST column is not modified, its value will appear as NULL or empty in ClickHouse.
+   - If a TOAST column is modified, it will be replicated correctly.
+
+3. For DELETE operations, TOAST column values will appear as NULL or empty in ClickHouse.
+
+These behaviors can lead to data inconsistencies between your PostgreSQL source and ClickHouse destination. Therefore, it's crucial to set `REPLICA IDENTITY FULL` for tables with TOAST columns to ensure accurate and complete data replication.
+
+## Conclusion
+
+Properly handling TOAST columns is essential for maintaining data integrity when replicating from PostgreSQL to ClickHouse. By identifying TOAST columns and setting the appropriate `REPLICA IDENTITY`, you can ensure that your data is replicated accurately and completely.
diff --git a/sidebars.js b/sidebars.js
@@ -603,6 +603,7 @@ const sidebars = {
           collapsible: true,
           items: [
             "en/integrations/data-ingestion/clickpipes/postgres/index",
+            "en/integrations/data-ingestion/clickpipes/postgres/faq",
             {
               type: "category",
               label: "Source",

Original file line number	Diff line number	Diff line change
`@@ -603,6 +603,7 @@ const sidebars = {`
`603`	`603`	`collapsible: true,`
`604`	`604`	`items: [`
`605`	`605`	`"en/integrations/data-ingestion/clickpipes/postgres/index",`
	`606`	`+ "en/integrations/data-ingestion/clickpipes/postgres/faq",`
`606`	`607`	`{`
`607`	`608`	`type: "category",`
`608`	`609`	`label: "Source",`