Use return_stats option to collect column statistics #108

sfc-gh-agedemenli · 2025-12-11T13:05:35Z

Description

Using return_stats option with copy commands to collect column statistics, instead of issuing multiple queries.

Checklist

I have tested my changes and added tests if necessary
I updated documentation if needed
I confirm that all my commits are signed off (DCO)

DCO Reminder (important)

This project uses the Developer Certificate of Origin (DCO).
DCO is a simple way for you to confirm that you wrote your code and that you have the right to contribute it.

If the DCO check fails, please sign off your commits.

How to sign off

For your last commit:
git commit --amend -s
git push --force

For multiple commits:
git rebase --signoff main
git push --force

More info: https://developercertificate.org/

sfc-gh-abozkurt · 2025-12-15T08:24:29Z

pg_lake_table/src/fdw/multi_data_file_dest.c

 		copyModification->partition = modification->partition;
+		if (modification->fileStats != NULL)
+		{
+			copyModification->fileStats = DeepCopyDataFileStats(modification->fileStats);


fyi: this is required as we reset child dest receiver memory context. Modifications are persisted in the parent context.

sfc-gh-abozkurt · 2025-12-15T08:31:46Z

pg_lake_table/src/fdw/writable_table.c

+					 &dataFileStats);

 	/* find which files were generated by DuckDB COPY */
 	List	   *dataFiles = NIL;


we should get rid of ListRemoteFileNames() here as we do in other place. We do not need to list the path anymore as we keep file names in DataFileStat list.

Yes, that's very important indeed, I just came here to write this, thanks @sfc-gh-abozkurt :)

So, one of the motivations of this PR is to prevent accessing data files that we have just written. It is especially critical for deployments with small cache sizes, as it might cause friction in the cache (e.g., some processes keeps writing new files, while cache manager is removing not used files, and then if we re-access the files via ListRemoteFileNames() below, cache manager needs to re-fetch these files again, causing a mess).

sfc-gh-okalaci · 2025-12-15T08:33:10Z

pg_lake_engine/src/pgduck/write_data.c

+
+
+static void
+ParseDuckdbColumnMinMaxFromText(const char *input, List **names, List **mins, List **maxs)


can we use the infrastructure in pg_lake_engine/src/pgduck/type.c to parse the output value of RETURN_STATS, which is map(varchar, map(varchar, varchar))

sfc-gh-abozkurt · 2025-12-15T08:44:25Z

pg_lake_engine/src/pgduck/write_data.c

+		if (useReturnStats && dataFileStats != NULL)
+		{
+			/* DuckDB returns COPY 0 when return_stats is used. */
+			*dataFileStats = GetDataFileStatsListFromPGResult(result, leafFields, schema);


might be good to pass int **rowCount to GetDataFileStatsListFromPGResult(result, leafFields, schema, rowCount); and remove below duplicate loop.

pg_lake_engine/src/pgduck/write_data.c

sfc-gh-abozkurt · 2025-12-15T08:49:07Z

pg_lake_engine/src/pgduck/delete_data.c

+		List	   *dataFileStats = GetDataFileStatsListFromPGResult(result, leafFields, schema);
+
+		Assert(dataFileStats != NIL);
+		*newFileStats = DeepCopyDataFileStats((DataFileStats *) linitial(dataFileStats));


Do we need deep copy here? (if not, better to remove)

Yes, doesn't seem necessary, we are copying from/to the same memory context anyway

sfc-gh-abozkurt · 2025-12-15T09:05:11Z

We should not throw error when min/max is missing, instead we should skip this column.

Now we should be able to get stats for uuid (previously we were unable to do so) but the stats are not returned for boolean type and numeric(W > 18, ) (returns stats for all other numerics) anymore. See https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_writer.cpp#L874

sfc-gh-abozkurt · 2025-12-15T09:27:30Z

We should call ApplyColumnStatsMode on returned column stats as we do in remote stats.

sfc-gh-mslot · 2025-12-16T09:21:12Z

pg_lake_engine/include/pg_lake/pgduck/write_data.h

-										 DataFileSchema * schema);
+										 DataFileSchema * schema,
+										 List *leafFields,
+										 List **dataFileStats);


minor: there are a lot of arguments here, maybe slightly cleaner to have something like a ColumnStatsCollector struct that contains both of these lists.

+1 on this comment, and we use it in other places such as WriteQueryResultTo and PerformDeleteFromParquet

sfc-gh-abozkurt · 2025-12-22T09:42:58Z

float +-inf returns wrong stats. That is fixed at duckdb 1.4.3. It should be fixed after merging PR #119.

sfc-gh-abozkurt · 2025-12-22T09:43:53Z

We should not throw error when min/max is missing, instead we should skip this column.

Now we should be able to get stats for uuid (previously we were unable to do so) but the stats are not returned for boolean type and numeric(W > 18, ) (returns stats for all other numerics) anymore. See https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_writer.cpp#L874

duckdb patch PR #114 should add stats for missing types.

sfc-gh-abozkurt · 2025-12-22T09:45:43Z

return_stats returns stats for bytea, uuid and geometry. Lets keep skipping those in this PR as well. We can try to enable them in separate PR.

sfc-gh-abozkurt · 2025-12-22T09:48:36Z

pg_lake_iceberg.enable_stats_collection_for_nested_types is used to write stats for nested types. return_stats also returns stats for nested types but column names are dot separated. Either we need to fallback to remote parquet stats (old method) or properly parse the column names.

sfc-gh-abozkurt · 2025-12-22T09:51:14Z

return_stats parser currently parses the result from raw text result. It first removes backslashes and quotes, which is broken when the values contain quote or backslash. Either we need to carefully parse the result or better try to parse the result into pg_map.

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

sfc-gh-abozkurt · 2026-01-07T10:51:56Z

It is impressive to see all the tests pass, we need few more iterations to refine the implementation, but overall in the right direction.
I have done some scale tests, with 1600 column table generating 100s files per insert, all seems to work as expected.

Ok, changed my test results. Maybe there is no memory leak, but there is definitely a jump when we get to processing of the column stats. Is there a possibility to release memory earlier in the loop? Definitely worth checking, I have seen +5GB memory utilization while doing this, and the memory is not released until the query finishes. Sure, the query / table I used is an extreme, 1600 column 1000s of files generated per insert, but still, shows that there is room for improvement.

If you want to test with 1600 columns, many files, some starting point

-- to generate more files per insert
SET pg_lake_table.target_file_size_mb TO '1MB';
SET pg_lake_iceberg.default_avro_writer_block_size_kb TO  '32MB';


-- create table with 1600 columns 
   DO $$
    DECLARE
        i   INT;
        sql TEXT;
        col_name TEXT;
        partition_by TEXT := '';
    BEGIN
        sql := 'CREATE TABLE long_atble (';

        FOR i IN 1..1600 LOOP
            col_name := 'col' || i;
            sql := sql || col_name || ' TEXT ';

            IF i < 1600 THEN
                sql := sql || ',';
            END IF;
        END LOOP;

        sql := sql || ') USING iceberg WITH (column_stats_mode = ''full'');';

        RAISE NOTICE 'Creating table with SQL: %', sql;
        EXECUTE sql;
    END
    $$;

-- insert 1200000 rows, you can change how many you want
                DO $$
                DECLARE
                    i   INT;
                    sql TEXT;
                BEGIN
                    sql := 'INSERT INTO long_atble SELECT ';

                    FOR i IN 1..1600 LOOP
                        -- Generate a random text value.
                        -- This example uses MD5 of a random number, just as an example.
                        -- You can change this part if you want different random text.
                        sql := sql || quote_literal(repeat(md5(random()::text), 10));

                        IF i < 1600 THEN
                            sql := sql || ',';
                        END IF;
                    END LOOP;

                    sql := sql || ' FROM generate_series(0,1200000)i';

                    EXECUTE sql;
                END
                $$;

When I inspect pgduck_server's memory usage, during the insertions, duckdb uses 2-3GB memory actively and goes down to 0 at the end (which is also verified by FROM duckdb_memory();). But the allocator allocates some big chunks which shows up as resident memory at htop or top results. Those are reusable memory chunks, allocated by the allocator, and not given back to OS immediately.

I think this could be common for queries that operates on big result sets. IMHO, such cases, when there are many fragemented (unusable) big chunks, need graceful shutdown and then restart pgduck_server.

Another observation though: our cache worker allocates up to ~15GB after the insertions generated a bunch of files (~1500), that might need some fix.

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

sfc-gh-abozkurt · 2026-01-09T11:14:57Z

pg_lake_engine/src/data_file/data_file_stats.c

+			char	   *commandTuples = PQcmdTuples(result);
+			statsCollector = palloc0(sizeof(StatsCollector));
+			statsCollector->totalRowCount = atoll(commandTuples);
+			statsCollector->dataFileStats = NIL;


We are sure we have a single file generated when the destination format != DATA_FORMAT_PARQUET. That means we can fill statsCollector->dataFileStats here as single item list. We need to pass the file path as argument. Then we do not need to check and create dataFileStats at FlushChildDestReceiver. Isn't it?

sfc-gh-okalaci

This looks pretty good to me, some minor comments before merging

sfc-gh-okalaci · 2026-01-09T06:40:16Z

pg_lake_engine/src/data_file/data_file_stats.c

+
+	if (returnStatsMapId == InvalidOid)
+		ereport(ERROR, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+						errmsg("unexpected return_stats result %s", input)));


nit: I think we can have a better error message, smth like: Cannot find required map type for parsing return stats

This is almost like an assert, no one would ever hit, but still better for readability of the code.

sfc-gh-okalaci · 2026-01-09T06:47:55Z

pg_lake_engine/src/data_file/data_file_stats.c

+
+		if (leafField == NULL)
+		{
+			ereport(DEBUG3, (errmsg("leaf field with id %d not found in leaf fields, skipping", fieldId)));


nit: maybe use fieldName instead of fieldId here and below in the err message

sfc-gh-okalaci · 2026-01-09T06:53:58Z

pg_lake_engine/src/data_file/data_file_stats.c

+
+	if (minText != NULL && maxText != NULL)
+	{
+		*names = lappend(*names, pstrdup(colName));


pstrdup is not needed, TextDatumGetCString already does that

sfc-gh-okalaci · 2026-01-09T07:14:55Z

pg_lake_engine/src/data_file/data_file_stats.c

+		statsList = lappend(statsList, fileStats);
+	}
+
+	ColumnStatsCollector *statsCollector = palloc0(sizeof(ColumnStatsCollector));


We have a concept in the code EnableHeavyAsserts, where we do enable it only in CI, and add any complex assertions we want.

So, I think we should probably add a heavy assert function / code block, which asserts that the old way of collecting the stats yields the exact same stats here.

Why we do that? Because we still heavily rely on that methods for external tables, and they are tested very lightly compared to this method. Given both internal and external tables used the same logic, we refrained adding enough tests for the external table stats collection. If we ever see a divergence between this method vs old method, we can react accordingly.

Note that at this point we are sure that they return the same results, otherwise tests would have failed. But still let's be robust to any future DuckDB changes -- which happens a lot

sfc-gh-okalaci · 2026-01-09T07:21:14Z

pg_lake_engine/pg_lake_engine--3.0--3.1.sql

+-- we prefer to create in the extension script to avoid concurrent attempts to create
+-- the same map, which may throw errors 
+SELECT map_type.create('TEXT','TEXT');
+SELECT map_type.create('TEXT','map_type.key_text_val_text');


hmmm, we cannot assume the type for the first one is map_type.key_text_val_text. We should get the name from the output of map_type.create, something like:

WITH text_text_map_name AS (SELECT map_type.create('TEXT','TEXT') as name) SELECT map_type.create('TEXT', name) as text_map_of_text from text_text_map_name;

sfc-gh-okalaci · 2026-01-09T08:52:15Z

pg_lake_table/src/fdw/multi_data_file_dest.c

+		}
+		else
+		{
+			copyModification->fileStats =


maybe assert format != PARQUET here, within an assertion block, we should never call this for parquet (fix the syntax, definitely wrong):

# .. ASSERT_ENABLED .. PgLakeTableType tableType = GetPgLakeTableType(self->relationId); FindDataFormatAndCompression(..., &format) /* * assert format != PARQUET #endif assertenabled

sfc-gh-okalaci · 2026-01-09T09:02:10Z

pg_lake_table/src/fdw/writable_table.c

 			Partition  *partition = GetDataFilePartition(relationId, transforms, sourcePath,
 														 &partitionSpecId);

+			Assert(statsCollector->dataFileStats != NIL);


maybe assert len == 1 is better/safer, we should never have more than 1 entry.

sfc-gh-okalaci · 2026-01-09T10:36:02Z

pg_lake_engine/src/data_file/data_file_stats.c

+ * of type map(text,text).
+ */
+static void
+ExtractMinMaxForAllColumns(Datum map, List **names, List **mins, List **maxs)


nit: maybe map -> returnStatsMap or such to make it slightly easier to follow

sfc-gh-okalaci · 2026-01-09T10:40:54Z

pg_lake_iceberg/include/pg_lake/iceberg/iceberg_field.h

 #include "pg_lake/parquet/leaf_field.h"

-extern bool EnableStatsCollectionForNestedTypes;
+extern bool DeprecatedEnableStatsCollectionForNestedTypes;


instead of extern, we could make this variable static in pg_lake_iceberg/src/init.c, that's a more common pattern we have used in the past.

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Signed-off-by: Aykut Bozkurt <aykut.bozkurt@snowflake.com>

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 4 times, most recently from 9900257 to 6a162c8 Compare December 12, 2025 10:01

sfc-gh-abozkurt reviewed Dec 15, 2025

View reviewed changes

sfc-gh-okalaci reviewed Dec 15, 2025

View reviewed changes

sfc-gh-abozkurt reviewed Dec 15, 2025

View reviewed changes

pg_lake_engine/src/pgduck/write_data.c Outdated Show resolved Hide resolved

sfc-gh-abozkurt reviewed Dec 15, 2025

View reviewed changes

sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 4 times, most recently from 8d060da to 581168b Compare December 15, 2025 17:44

sfc-gh-mslot reviewed Dec 16, 2025

View reviewed changes

sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 3 times, most recently from b4faf92 to 47a17f3 Compare December 18, 2025 09:52

sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 5 times, most recently from 71cee23 to 7230806 Compare December 23, 2025 09:51

sfc-gh-agedemenli added 3 commits January 7, 2026 13:05

Move ShouldSkipStatistics to engine

994cc33

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Add leaf_field.c

78c536a

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Comment

40ec785

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

sfc-gh-agedemenli added 15 commits January 7, 2026 14:43

fix reference stats list

1810013

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

fixup

2721262

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Comment

076e761

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Get rid of redundant string duplication

0db1368

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Use the collector as return type

5276ba3

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

fixup

4c8a83e

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

fixup

3b9a2df

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Move CreateDataFileStatsForTable

fad3ea5

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Add ExecuteCopyCommandOnPGDuckConnection

f7b993a

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Handle modification in case stats is empty

cfb770c

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Move stats related logic to new file: data_file_stats.c

5b297b1

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Move field&leaf field functions

887ac71

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Remove unnecessary includes and whitespaces

2db85e2

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Use returned stats for deleted files

5647b47

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

Rename ColumnStatsCollector to StatsCollector

e3b51e7

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

sfc-gh-abozkurt reviewed Jan 9, 2026

View reviewed changes

sfc-gh-okalaci approved these changes Jan 9, 2026

View reviewed changes

Reindent

fa8ae41

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch from 24877d4 to fa8ae41 Compare January 9, 2026 12:35

sfc-gh-abozkurt added 2 commits January 9, 2026 15:40

generate stats for all files at WriteQueryResultTo

699eb3f

Signed-off-by: Aykut Bozkurt <aykut.bozkurt@snowflake.com>

add assertion

5037045

Signed-off-by: Aykut Bozkurt <aykut.bozkurt@snowflake.com>

sfc-gh-abozkurt mentioned this pull request Jan 9, 2026

Aykut/refactor stats #136

Closed

sfc-gh-abozkurt approved these changes Jan 9, 2026

View reviewed changes

minor improvements

f1cf990

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>

sfc-gh-agedemenli merged commit 455c4f2 into main Jan 9, 2026
63 checks passed

sfc-gh-agedemenli deleted the return-stats-for-column-stats branch January 9, 2026 13:45



		static void
		ParseDuckdbColumnMinMaxFromText(const char input, List names, List mins, List *maxs)

Use return_stats option to collect column statistics #108

Use return_stats option to collect column statistics #108

Uh oh!

Conversation

sfc-gh-agedemenli commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

DCO Reminder (important)

How to sign off

Uh oh!

sfc-gh-abozkurt Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-abozkurt commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-abozkurt commented Dec 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-abozkurt commented Dec 22, 2025

Uh oh!

sfc-gh-abozkurt commented Dec 22, 2025

Uh oh!

sfc-gh-abozkurt commented Dec 22, 2025

Uh oh!

sfc-gh-abozkurt commented Dec 22, 2025

Uh oh!

sfc-gh-abozkurt commented Dec 22, 2025

Uh oh!

sfc-gh-abozkurt commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-okalaci left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

sfc-gh-agedemenli commented Dec 11, 2025 •

edited

Loading

sfc-gh-abozkurt Dec 15, 2025 •

edited

Loading

sfc-gh-abozkurt commented Dec 15, 2025 •

edited

Loading

sfc-gh-abozkurt commented Jan 7, 2026 •

edited

Loading