Skip to content

Conversation

@sfc-gh-agedemenli
Copy link
Collaborator

@sfc-gh-agedemenli sfc-gh-agedemenli commented Dec 11, 2025

Description

Using return_stats option with copy commands to collect column statistics, instead of issuing multiple queries.


Checklist

  • I have tested my changes and added tests if necessary
  • I updated documentation if needed
  • I confirm that all my commits are signed off (DCO)

DCO Reminder (important)

This project uses the Developer Certificate of Origin (DCO).
DCO is a simple way for you to confirm that you wrote your code and that you have the right to contribute it.

If the DCO check fails, please sign off your commits.

How to sign off

For your last commit:
git commit --amend -s
git push --force

For multiple commits:
git rebase --signoff main
git push --force

More info: https://developercertificate.org/

@sfc-gh-agedemenli sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 4 times, most recently from 9900257 to 6a162c8 Compare December 12, 2025 10:01
copyModification->partition = modification->partition;
if (modification->fileStats != NULL)
{
copyModification->fileStats = DeepCopyDataFileStats(modification->fileStats);
Copy link
Collaborator

@sfc-gh-abozkurt sfc-gh-abozkurt Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: this is required as we reset child dest receiver memory context. Modifications are persisted in the parent context.

&dataFileStats);

/* find which files were generated by DuckDB COPY */
List *dataFiles = NIL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get rid of ListRemoteFileNames() here as we do in other place. We do not need to list the path anymore as we keep file names in DataFileStat list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's very important indeed, I just came here to write this, thanks @sfc-gh-abozkurt :)

So, one of the motivations of this PR is to prevent accessing data files that we have just written. It is especially critical for deployments with small cache sizes, as it might cause friction in the cache (e.g., some processes keeps writing new files, while cache manager is removing not used files, and then if we re-access the files via ListRemoteFileNames() below, cache manager needs to re-fetch these files again, causing a mess).



static void
ParseDuckdbColumnMinMaxFromText(const char *input, List **names, List **mins, List **maxs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the infrastructure in pg_lake_engine/src/pgduck/type.c to parse the output value of RETURN_STATS, which is map(varchar, map(varchar, varchar))

if (useReturnStats && dataFileStats != NULL)
{
/* DuckDB returns COPY 0 when return_stats is used. */
*dataFileStats = GetDataFileStatsListFromPGResult(result, leafFields, schema);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to pass int **rowCount to GetDataFileStatsListFromPGResult(result, leafFields, schema, rowCount); and remove below duplicate loop.

List *dataFileStats = GetDataFileStatsListFromPGResult(result, leafFields, schema);

Assert(dataFileStats != NIL);
*newFileStats = DeepCopyDataFileStats((DataFileStats *) linitial(dataFileStats));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need deep copy here? (if not, better to remove)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, doesn't seem necessary, we are copying from/to the same memory context anyway

@sfc-gh-abozkurt
Copy link
Collaborator

sfc-gh-abozkurt commented Dec 15, 2025

We should not throw error when min/max is missing, instead we should skip this column.

Now we should be able to get stats for uuid (previously we were unable to do so) but the stats are not returned for boolean type and numeric(W > 18, ) (returns stats for all other numerics) anymore. See https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_writer.cpp#L874

@sfc-gh-abozkurt
Copy link
Collaborator

We should call ApplyColumnStatsMode on returned column stats as we do in remote stats.

@sfc-gh-agedemenli sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 4 times, most recently from 8d060da to 581168b Compare December 15, 2025 17:44
DataFileSchema * schema);
DataFileSchema * schema,
List *leafFields,
List **dataFileStats);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: there are a lot of arguments here, maybe slightly cleaner to have something like a ColumnStatsCollector struct that contains both of these lists.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this comment, and we use it in other places such as WriteQueryResultTo and PerformDeleteFromParquet

@sfc-gh-agedemenli sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 3 times, most recently from b4faf92 to 47a17f3 Compare December 18, 2025 09:52
@sfc-gh-abozkurt
Copy link
Collaborator

float +-inf returns wrong stats. That is fixed at duckdb 1.4.3. It should be fixed after merging PR #119.

@sfc-gh-abozkurt
Copy link
Collaborator

We should not throw error when min/max is missing, instead we should skip this column.

Now we should be able to get stats for uuid (previously we were unable to do so) but the stats are not returned for boolean type and numeric(W > 18, ) (returns stats for all other numerics) anymore. See https://github.com/duckdb/duckdb/blob/main/extension/parquet/parquet_writer.cpp#L874

duckdb patch PR #114 should add stats for missing types.

@sfc-gh-abozkurt
Copy link
Collaborator

return_stats returns stats for bytea, uuid and geometry. Lets keep skipping those in this PR as well. We can try to enable them in separate PR.

@sfc-gh-abozkurt
Copy link
Collaborator

pg_lake_iceberg.enable_stats_collection_for_nested_types is used to write stats for nested types. return_stats also returns stats for nested types but column names are dot separated. Either we need to fallback to remote parquet stats (old method) or properly parse the column names.

@sfc-gh-abozkurt
Copy link
Collaborator

return_stats parser currently parses the result from raw text result. It first removes backslashes and quotes, which is broken when the values contain quote or backslash. Either we need to carefully parse the result or better try to parse the result into pg_map.

@sfc-gh-agedemenli sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch 5 times, most recently from 71cee23 to 7230806 Compare December 23, 2025 09:51
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
@sfc-gh-abozkurt
Copy link
Collaborator

sfc-gh-abozkurt commented Jan 7, 2026

It is impressive to see all the tests pass, we need few more iterations to refine the implementation, but overall in the right direction.
I have done some scale tests, with 1600 column table generating 100s files per insert, all seems to work as expected.

Ok, changed my test results. Maybe there is no memory leak, but there is definitely a jump when we get to processing of the column stats. Is there a possibility to release memory earlier in the loop? Definitely worth checking, I have seen +5GB memory utilization while doing this, and the memory is not released until the query finishes. Sure, the query / table I used is an extreme, 1600 column 1000s of files generated per insert, but still, shows that there is room for improvement.

If you want to test with 1600 columns, many files, some starting point

-- to generate more files per insert
SET pg_lake_table.target_file_size_mb TO '1MB';
SET pg_lake_iceberg.default_avro_writer_block_size_kb TO  '32MB';


-- create table with 1600 columns 
   DO $$
    DECLARE
        i   INT;
        sql TEXT;
        col_name TEXT;
        partition_by TEXT := '';
    BEGIN
        sql := 'CREATE TABLE long_atble (';

        FOR i IN 1..1600 LOOP
            col_name := 'col' || i;
            sql := sql || col_name || ' TEXT ';

            IF i < 1600 THEN
                sql := sql || ',';
            END IF;
        END LOOP;

        sql := sql || ') USING iceberg WITH (column_stats_mode = ''full'');';

        RAISE NOTICE 'Creating table with SQL: %', sql;
        EXECUTE sql;
    END
    $$;

-- insert 1200000 rows, you can change how many you want
                DO $$
                DECLARE
                    i   INT;
                    sql TEXT;
                BEGIN
                    sql := 'INSERT INTO long_atble SELECT ';

                    FOR i IN 1..1600 LOOP
                        -- Generate a random text value.
                        -- This example uses MD5 of a random number, just as an example.
                        -- You can change this part if you want different random text.
                        sql := sql || quote_literal(repeat(md5(random()::text), 10));

                        IF i < 1600 THEN
                            sql := sql || ',';
                        END IF;
                    END LOOP;

                    sql := sql || ' FROM generate_series(0,1200000)i';

                    EXECUTE sql;
                END
                $$;

When I inspect pgduck_server's memory usage, during the insertions, duckdb uses 2-3GB memory actively and goes down to 0 at the end (which is also verified by FROM duckdb_memory();). But the allocator allocates some big chunks which shows up as resident memory at htop or top results. Those are reusable memory chunks, allocated by the allocator, and not given back to OS immediately.

I think this could be common for queries that operates on big result sets. IMHO, such cases, when there are many fragemented (unusable) big chunks, need graceful shutdown and then restart pgduck_server.

Another observation though: our cache worker allocates up to ~15GB after the insertions generated a bunch of files (~1500), that might need some fix.

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
char *commandTuples = PQcmdTuples(result);
statsCollector = palloc0(sizeof(StatsCollector));
statsCollector->totalRowCount = atoll(commandTuples);
statsCollector->dataFileStats = NIL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are sure we have a single file generated when the destination format != DATA_FORMAT_PARQUET. That means we can fill statsCollector->dataFileStats here as single item list. We need to pass the file path as argument. Then we do not need to check and create dataFileStats at FlushChildDestReceiver. Isn't it?

Copy link
Collaborator

@sfc-gh-okalaci sfc-gh-okalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, some minor comments before merging


if (returnStatsMapId == InvalidOid)
ereport(ERROR, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("unexpected return_stats result %s", input)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can have a better error message, smth like: Cannot find required map type for parsing return stats

This is almost like an assert, no one would ever hit, but still better for readability of the code.


if (leafField == NULL)
{
ereport(DEBUG3, (errmsg("leaf field with id %d not found in leaf fields, skipping", fieldId)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe use fieldName instead of fieldId here and below in the err message


if (minText != NULL && maxText != NULL)
{
*names = lappend(*names, pstrdup(colName));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pstrdup is not needed, TextDatumGetCString already does that

statsList = lappend(statsList, fileStats);
}

ColumnStatsCollector *statsCollector = palloc0(sizeof(ColumnStatsCollector));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a concept in the code EnableHeavyAsserts, where we do enable it only in CI, and add any complex assertions we want.

So, I think we should probably add a heavy assert function / code block, which asserts that the old way of collecting the stats yields the exact same stats here.

Why we do that? Because we still heavily rely on that methods for external tables, and they are tested very lightly compared to this method. Given both internal and external tables used the same logic, we refrained adding enough tests for the external table stats collection. If we ever see a divergence between this method vs old method, we can react accordingly.

Note that at this point we are sure that they return the same results, otherwise tests would have failed. But still let's be robust to any future DuckDB changes -- which happens a lot

-- we prefer to create in the extension script to avoid concurrent attempts to create
-- the same map, which may throw errors
SELECT map_type.create('TEXT','TEXT');
SELECT map_type.create('TEXT','map_type.key_text_val_text');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, we cannot assume the type for the first one is map_type.key_text_val_text. We should get the name from the output of map_type.create, something like:

WITH text_text_map_name AS (SELECT map_type.create('TEXT','TEXT') as name) SELECT map_type.create('TEXT', name) as text_map_of_text  from text_text_map_name;

}
else
{
copyModification->fileStats =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assert format != PARQUET here, within an assertion block, we should never call this for parquet (fix the syntax, definitely wrong):

# .. ASSERT_ENABLED ..
PgLakeTableType tableType = GetPgLakeTableType(self->relationId);
FindDataFormatAndCompression(..., &format)

/* * 
assert format != PARQUET
#endif assertenabled

Partition *partition = GetDataFilePartition(relationId, transforms, sourcePath,
&partitionSpecId);

Assert(statsCollector->dataFileStats != NIL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assert len == 1 is better/safer, we should never have more than 1 entry.

* of type map(text,text).
*/
static void
ExtractMinMaxForAllColumns(Datum map, List **names, List **mins, List **maxs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe map -> returnStatsMap or such to make it slightly easier to follow

#include "pg_lake/parquet/leaf_field.h"

extern bool EnableStatsCollectionForNestedTypes;
extern bool DeprecatedEnableStatsCollectionForNestedTypes;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of extern, we could make this variable static in pg_lake_iceberg/src/init.c, that's a more common pattern we have used in the past.

Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
@sfc-gh-agedemenli sfc-gh-agedemenli force-pushed the return-stats-for-column-stats branch from 24877d4 to fa8ae41 Compare January 9, 2026 12:35
Signed-off-by: Aykut Bozkurt <aykut.bozkurt@snowflake.com>
Signed-off-by: Aykut Bozkurt <aykut.bozkurt@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
@sfc-gh-agedemenli sfc-gh-agedemenli merged commit 455c4f2 into main Jan 9, 2026
63 checks passed
@sfc-gh-agedemenli sfc-gh-agedemenli deleted the return-stats-for-column-stats branch January 9, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants