Skip to content

Commit c15e010

Browse files
authored
Delta Tables in Amazon S3 destination connector: new output format (#674)
1 parent a665298 commit c15e010

File tree

1 file changed

+25
-2
lines changed

1 file changed

+25
-2
lines changed

snippets/general-shared-text/delta-table.mdx

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic
1010
allowfullscreen
1111
></iframe>
1212

13-
The preceding video does not show how to create an AWS account or an S3 bucket.
13+
The preceding video does not show how to create an AWS account.
1414

1515
For more information about requirements, see the following:
1616

@@ -88,4 +88,27 @@ import S3BucketCloudFormation from '/snippets/general-shared-text/s3-cf-setup.md
8888

8989
import S3BucketCLI from '/snippets/general-shared-text/s3-cli-setup.mdx';
9090

91-
<S3BucketCLI />
91+
<S3BucketCLI />
92+
93+
## Delta table output format
94+
95+
A Delta table consists of Parquet files that contain data and a transaction log that stores metadata about the transactions.
96+
[Learn more](https://delta-io.github.io/delta-rs/how-delta-lake-works/architecture-of-delta-table/).
97+
98+
The Delta Tables in Amazon S3 destination connector generates the following output within the specified path to the S3 bucket (or the specified folder within the bucket):
99+
100+
- Initially, one Parquet (`.parquet`) file per file in the source location. For example, for a file in the source location named `my-file.pdf`, an associated
101+
file with the extension `.parquet` is generated. Various kinds of file transactions can result in additional Parquet files being generated. These Parquet filenames are automatically generated by the Delta Lake engine and are not meant to be manually modified.
102+
- A folder named `_delta_log` that contains metadata and change history about the `.parquet` files. As Parquet files are added to, changed, or removed from
103+
the specified bucket or folder path, the `_delta_log` folder is updated with any related metadata and change history details.
104+
105+
Together, this set of Parquet files and their associated `_delta_log` folder (and its contents) describe a single, versioned Delta table. Because of this, Unstructured recommends the following usage best practices:
106+
107+
- In the source location, each set of source files that is to be considered as a unit for change management purposes should be controlled by a unique, dedicated
108+
Delta Tables in S3 destination connector. This connector should reference a unique, dedicated output folder within the bucket. Having
109+
multiple workflows refer to different sets of source files, yet all share the same Delta table, could results in data loss or table corruption.
110+
- Avoid directly modifying, adding, or deleting Parquet data files or the `_delta_log` folder within a Delta table's directory. This can lead to data loss or table corruption.
111+
- If you need to copy or move a Delta table to a different location,
112+
you must move or copy its entire set of Parquet files and its associated `_delta_log` folder (and its contents) together as a unit.
113+
Note that the copied or moved Delta table will
114+
no longer be controlled by the original Delta Tables in S3 destination connector.

0 commit comments

Comments
 (0)