Commit 523bb6a
feat: add functions for add and replacing data directly with datafiles (apache#723)
If you want to write your own parquet files and only use iceberg to
handle the metadata, you are only left with the option (for the most
part) of leveraging the `ReplaceDataFiles` function.
This function takes in a list of existing files and a list of new file
paths to override that previous data with.
This function works fine for the most part, but the function includes a
scan in it which means it's not actually taking your word that your new
parquet files match the table schema.
This scan proves to be problematic in some cases when you are writing
files very fast and leveraging multipart uploads. You know the location
of all files, know they are valid parquet files, but the commit has the
possibility to return an error because at the time of commit the file
might not be fully available.
the error looks something like this at commit time: `failed to replace
data files: error encountered during file conversion: parquet: could not
read 8 bytes from end of file`.
We have tested this out in vendor code and opened a fork that adds a new
function.
`ReplaceDataFiles` is scanning your file paths to try and ensure the
schema of said files match the schema of the table you are inputting
them into.
We, and I would assume a lot of people writing their own parquet files,
don't need this. Our ingestion framework guarantees we will never get a
incorrect parquet file, and we also have access to our Parquet Schema
and Arrow Schema for the entirety of the ingestion.
So I can build data files directly and would much rather just pass my
own datafiles to this function, as I know the files will eventually be
available and they will be correct. all this is doing is telling the
metadata where to look at said file, there is no real harm in committing
before that file is actually available unless you are querying it right
away and it happens to not be available.
This also speeds up the commit time tremendously as this library doesn't
need to go through scan all of the files for every single commit.
Co-authored-by: Adam Gaddis <adamtyler@cloudflare.com>1 parent 414db82 commit 523bb6a
2 files changed
+757
-19
lines changed
0 commit comments