-
Notifications
You must be signed in to change notification settings - Fork 70
Use return_stats option to collect column statistics #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9900257 to
6a162c8
Compare
| copyModification->partition = modification->partition; | ||
| if (modification->fileStats != NULL) | ||
| { | ||
| copyModification->fileStats = DeepCopyDataFileStats(modification->fileStats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: this is required as we reset child dest receiver memory context. Modifications are persisted in the parent context.
| &dataFileStats); | ||
|
|
||
| /* find which files were generated by DuckDB COPY */ | ||
| List *dataFiles = NIL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should get rid of ListRemoteFileNames() here as we do in other place. We do not need to list the path anymore as we keep file names in DataFileStat list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's very important indeed, I just came here to write this, thanks @sfc-gh-abozkurt :)
So, one of the motivations of this PR is to prevent accessing data files that we have just written. It is especially critical for deployments with small cache sizes, as it might cause friction in the cache (e.g., some processes keeps writing new files, while cache manager is removing not used files, and then if we re-access the files via ListRemoteFileNames() below, cache manager needs to re-fetch these files again, causing a mess).
|
|
||
|
|
||
| static void | ||
| ParseDuckdbColumnMinMaxFromText(const char *input, List **names, List **mins, List **maxs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the infrastructure in pg_lake_engine/src/pgduck/type.c to parse the output value of RETURN_STATS, which is map(varchar, map(varchar, varchar))
| if (useReturnStats && dataFileStats != NULL) | ||
| { | ||
| /* DuckDB returns COPY 0 when return_stats is used. */ | ||
| *dataFileStats = GetDataFileStatsListFromPGResult(result, leafFields, schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to pass int **rowCount to GetDataFileStatsListFromPGResult(result, leafFields, schema, rowCount); and remove below duplicate loop.
| List *dataFileStats = GetDataFileStatsListFromPGResult(result, leafFields, schema); | ||
|
|
||
| Assert(dataFileStats != NIL); | ||
| *newFileStats = DeepCopyDataFileStats((DataFileStats *) linitial(dataFileStats)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need deep copy here? (if not, better to remove)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, doesn't seem necessary, we are copying from/to the same memory context anyway
|
We should not throw error when min/max is missing, instead we should skip this column. Now we should be able to get stats for uuid (previously we were unable to do so) but the stats are not returned for boolean type and |
|
We should call |
8d060da to
581168b
Compare
| DataFileSchema * schema); | ||
| DataFileSchema * schema, | ||
| List *leafFields, | ||
| List **dataFileStats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: there are a lot of arguments here, maybe slightly cleaner to have something like a ColumnStatsCollector struct that contains both of these lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this comment, and we use it in other places such as WriteQueryResultTo and PerformDeleteFromParquet
b4faf92 to
47a17f3
Compare
|
float +-inf returns wrong stats. That is fixed at duckdb 1.4.3. It should be fixed after merging PR #119. |
duckdb patch PR #114 should add stats for missing types. |
|
return_stats returns stats for bytea, uuid and geometry. Lets keep skipping those in this PR as well. We can try to enable them in separate PR. |
|
|
|
return_stats parser currently parses the result from raw text result. It first removes backslashes and quotes, which is broken when the values contain quote or backslash. Either we need to carefully parse the result or better try to parse the result into pg_map. |
71cee23 to
7230806
Compare
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
When I inspect pgduck_server's memory usage, during the insertions, duckdb uses 2-3GB memory actively and goes down to 0 at the end (which is also verified by I think this could be common for queries that operates on big result sets. IMHO, such cases, when there are many fragemented (unusable) big chunks, need graceful shutdown and then restart pgduck_server. Another observation though: our cache worker allocates up to ~15GB after the insertions generated a bunch of files (~1500), that might need some fix. |
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
| char *commandTuples = PQcmdTuples(result); | ||
| statsCollector = palloc0(sizeof(StatsCollector)); | ||
| statsCollector->totalRowCount = atoll(commandTuples); | ||
| statsCollector->dataFileStats = NIL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are sure we have a single file generated when the destination format != DATA_FORMAT_PARQUET. That means we can fill statsCollector->dataFileStats here as single item list. We need to pass the file path as argument. Then we do not need to check and create dataFileStats at FlushChildDestReceiver. Isn't it?
sfc-gh-okalaci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me, some minor comments before merging
|
|
||
| if (returnStatsMapId == InvalidOid) | ||
| ereport(ERROR, (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), | ||
| errmsg("unexpected return_stats result %s", input))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we can have a better error message, smth like: Cannot find required map type for parsing return stats
This is almost like an assert, no one would ever hit, but still better for readability of the code.
|
|
||
| if (leafField == NULL) | ||
| { | ||
| ereport(DEBUG3, (errmsg("leaf field with id %d not found in leaf fields, skipping", fieldId))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe use fieldName instead of fieldId here and below in the err message
|
|
||
| if (minText != NULL && maxText != NULL) | ||
| { | ||
| *names = lappend(*names, pstrdup(colName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pstrdup is not needed, TextDatumGetCString already does that
| statsList = lappend(statsList, fileStats); | ||
| } | ||
|
|
||
| ColumnStatsCollector *statsCollector = palloc0(sizeof(ColumnStatsCollector)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a concept in the code EnableHeavyAsserts, where we do enable it only in CI, and add any complex assertions we want.
So, I think we should probably add a heavy assert function / code block, which asserts that the old way of collecting the stats yields the exact same stats here.
Why we do that? Because we still heavily rely on that methods for external tables, and they are tested very lightly compared to this method. Given both internal and external tables used the same logic, we refrained adding enough tests for the external table stats collection. If we ever see a divergence between this method vs old method, we can react accordingly.
Note that at this point we are sure that they return the same results, otherwise tests would have failed. But still let's be robust to any future DuckDB changes -- which happens a lot
| -- we prefer to create in the extension script to avoid concurrent attempts to create | ||
| -- the same map, which may throw errors | ||
| SELECT map_type.create('TEXT','TEXT'); | ||
| SELECT map_type.create('TEXT','map_type.key_text_val_text'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, we cannot assume the type for the first one is map_type.key_text_val_text. We should get the name from the output of map_type.create, something like:
WITH text_text_map_name AS (SELECT map_type.create('TEXT','TEXT') as name) SELECT map_type.create('TEXT', name) as text_map_of_text from text_text_map_name;| } | ||
| else | ||
| { | ||
| copyModification->fileStats = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert format != PARQUET here, within an assertion block, we should never call this for parquet (fix the syntax, definitely wrong):
# .. ASSERT_ENABLED ..
PgLakeTableType tableType = GetPgLakeTableType(self->relationId);
FindDataFormatAndCompression(..., &format)
/* *
assert format != PARQUET
#endif assertenabled
| Partition *partition = GetDataFilePartition(relationId, transforms, sourcePath, | ||
| &partitionSpecId); | ||
|
|
||
| Assert(statsCollector->dataFileStats != NIL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert len == 1 is better/safer, we should never have more than 1 entry.
| * of type map(text,text). | ||
| */ | ||
| static void | ||
| ExtractMinMaxForAllColumns(Datum map, List **names, List **mins, List **maxs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe map -> returnStatsMap or such to make it slightly easier to follow
| #include "pg_lake/parquet/leaf_field.h" | ||
|
|
||
| extern bool EnableStatsCollectionForNestedTypes; | ||
| extern bool DeprecatedEnableStatsCollectionForNestedTypes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of extern, we could make this variable static in pg_lake_iceberg/src/init.c, that's a more common pattern we have used in the past.
24877d4 to
fa8ae41
Compare
Signed-off-by: Aykut Bozkurt <aykut.bozkurt@snowflake.com>
Signed-off-by: Aykut Bozkurt <aykut.bozkurt@snowflake.com>
Signed-off-by: Ahmet Gedemenli <ahmet.gedemenli@snowflake.com>
Description
Using
return_statsoption with copy commands to collect column statistics, instead of issuing multiple queries.Checklist
DCO Reminder (important)
This project uses the Developer Certificate of Origin (DCO).
DCO is a simple way for you to confirm that you wrote your code and that you have the right to contribute it.
If the DCO check fails, please sign off your commits.
How to sign off
For your last commit:
git commit --amend -s
git push --force
For multiple commits:
git rebase --signoff main
git push --force
More info: https://developercertificate.org/