Skip to content

Commit 828592e

Browse files
authored
Update deltalake_optimizations
1 parent 608f1a0 commit 828592e

File tree

1 file changed

+40
-2
lines changed

1 file changed

+40
-2
lines changed

data-platform/open-source-data-platforms/oci-data-flow/code-examples/DeltaLake_Optimize/deltalake_optimizations

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,51 @@ Developers can also use Spark Streaming to perform cloud ETL on their continuous
55
However Spark structured streaming application can produce thousants of small files (according to microbatching and number of executors), which leads to performance degradadion.
66
![small files in datalake](https://github.com/oracle-devrel/technology-engineering/blob/sylwesterdec-patch-6/data-platform/open-source-data-platforms/oci-data-flow/code-examples/DeltaLake_Optimize/files_in_datalake.png)
77

8-
That's why the most crucial decision is file format for your datalake.
8+
That's why the most crucial decision is file format for your datalake. Small files can be a problem because they slow down your query reads. Listing, opening and closing many small files incurs expensive overhead. This is called “the Small File Problem”.
9+
You can reduce the Small File Problem overhead by combining the data into bigger, more efficient files. Instead of doing it manually, pick the datalake format (delta, iceberg) and use build-in functions.
910

1011
Delta Lake enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes.
1112
For spark streaming application and realtime processing DeltaLake has one sighificant advantage - [built-in optimization](https://delta.io/blog/delta-lake-optimize/)
12-
1313
OCI Data Flow supports Delta Lake by default when your Applications run Spark 3.2.1 or later - [doc](https://docs.oracle.com/en-us/iaas/data-flow/using/delta-lake-about.htm)
1414

15+
How to optimize data lake using DeltaLake functions:
16+
Configure your preferences (please check DeltaLake doc):
17+
18+
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', 'False')
19+
spark.conf.set('spark.databricks.delta.optimize.repartition.enabled','True')
20+
spark.conf.set('spark.databricks.delta.optimize.preserveInsertionOrder', 'False')
21+
22+
Preserve vacuum history:
23+
spark.conf.set('spark.databricks.delta.vacuum.logging.enabled','True')
24+
25+
Set retention time for optimized files (ready to delete:
26+
spark.conf.set("spark.databricks.delta.deletedFileRetentionDuration","0")
27+
28+
29+
Check existing table details (look for numFiles:
30+
spark.sql("describe detail atm").show(truncate=False)
31+
+------+------------------------------------+-------------------------+-----------+-------------------------------------+-----------------------+-------------------+------------------+--------+-----------+-----------------------------------------------+----------------+----------------+------------------------+
32+
|format|id |name |description|location |createdAt |lastModified |partitionColumns |numFiles|sizeInBytes|properties |minReaderVersion|minWriterVersion|tableFeatures |
33+
+------+------------------------------------+-------------------------+-----------+-------------------------------------+-----------------------+-------------------+------------------+--------+-----------+-----------------------------------------------+----------------+----------------+------------------------+
34+
|delta |81336019-7998-4b1d-b4da-1b7ca9d5c745|spark_catalog.default.atm|NULL |oci://atm_data@fro8fl9kuqli/atm_delta|2024-07-15 12:57:15.812|2024-08-21 07:57:05|[year, month, day]|1822 |286807670 |{delta.deletedFileRetentionDuration -> 0 hours}|1 |2 |[appendOnly, invariants]|
35+
+------+------------------------------------+-------------------------+-----------+-------------------------------------+-----------------------+-------------------+------------------+--------+-----------+-----------------------------------------------+----------------+----------------+------------------------+
36+
37+
Run optimzation:
38+
spark.sql("OPTIMIZE atm").show(truncate=False)
39+
40+
Check files you can delete:
41+
spark.sql("vacuum atm RETAIN 0 HOURS DRY RUN")
42+
43+
Delete optimized and consolidated files:
44+
spark.sql("vacuum atm RETAIN 0 HOURS")
45+
46+
and check details of your table:
47+
spark.sql("describe detail atm").show(truncate=False)
48+
+------+------------------------------------+-------------------------+-----------+-------------------------------------+-----------------------+-------------------+------------------+--------+-----------+-----------------------------------------------+----------------+----------------+------------------------+
49+
|format|id |name |description|location |createdAt |lastModified |partitionColumns |numFiles|sizeInBytes|properties |minReaderVersion|minWriterVersion|tableFeatures |
50+
+------+------------------------------------+-------------------------+-----------+-------------------------------------+-----------------------+-------------------+------------------+--------+-----------+-----------------------------------------------+----------------+----------------+------------------------+
51+
|delta |81336019-7998-4b1d-b4da-1b7ca9d5c745|spark_catalog.default.atm|NULL |oci://atm_data@fro8fl9kuqli/atm_delta|2024-07-15 12:57:15.812|2024-09-06 08:26:45|[year, month, day]|21 |286807670 |{delta.deletedFileRetentionDuration -> 0 hours}|1 |2 |[appendOnly, invariants]|
52+
+------+------------------------------------+-------------------------+-----------+-------------------------------------+-----------------------+-------------------+------------------+--------+-----------+-----------------------------------------------+----------------+----------------+------------------------+
1553

1654

1755

0 commit comments

Comments
 (0)