-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat: Lance schema evolution (add column, type promotion) #17904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@the-other-tim-brown @voonhous Sharing some recent findings to give context to the current implementation, for maintaining separate classes Instead of reusing Currently for our parquet schema evolution support, we rely on spark's native parquet reader for handling columns that do not exist in the file and as well as handling the null padding. When running a basic schema evolution test with parquet as the base file format I see it enters the following code
The missingColumns variable documents the following behavior around null padding [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spar[…]xecution/datasources/parquet/VectorizedParquetRecordReader.java] Unfortunately the Lance File reader does not behave this way and does not handle null padding. For example if you give the lance file reader a column that does not exist in the file it will immediately error out unlike parquet. For a quick test i tried passing the I also do not believe that the After I explicitly provided a |
|
Yeap, parquetReader does null padding. This is why This occurs during an operation like this: schema_v1: schema operation: schema_v2: When reading the latest snapshot, hudi will pass this schema to parquetReader for older filegroups of the latest snapshot that were written before the schema evolution operation: This will cause The unsafe projection makes sense to me by returning null LITERALs of the column's latest snapshot's data type. |
| // Create lance schema evolution helper | ||
| val evolution = new LanceBasicSchemaEvolution( | ||
| fileSchema, | ||
| requiredSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a NIT, is there a difference between requestSchema and requiredSchema? We should keep the nomenclature the same. I traced the schema code 2 years ago and the number of *Schemas variables got me seeing stars.
Although there's LLMs that can help us read through and breakdown code now, it's still good practice to not rename variables that mean the same thing and keep nomenclature the same for the entire codebase as much as possible.
|
Was able to dig further on this memory leak issue yesterday, which helped expose some bug in the existing code as well.
Currently if an exception occurs (such as not padding with null), the actual true exception would occur in However this exception would end up getting suppressed due to the try catch logic we have.
Now coming to why we are still getting a memory leak exception instead of eventually printing out the original exception stacktrace here https://github.com/apache/hudi/pull/17904/files#diff-bdccaaaeb061abdf550efec86661f9d3790c66d53e04b1ed2e9cf9a61ea06e13R142 This is due to the fact that we prematurely closing the even though technically the life cycle of the allocator should handled by the Since the I have a fix for this flow that will try to put out for closing resources correctly when exceptions occur. |

Describe the issue this Pull Request addresses
Issue: #17627
Summary and Changelog
To rebase on master once following lance MOR support pr is landed: #17768
SparkLanceReaderBasefor handling NULL padding when older file schema does not contain newer column from a recent table schema change.testSchemaEvolutionAddColumninTestLanceDataSourceImpact
None
Risk Level
low
Documentation Update
none
Contributor's checklist