|
| 1 | +# Delta Lake Optimization |
| 2 | + |
| 3 | +Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets—without infrastructure to deploy or manage. |
| 4 | +Developers can also use Spark Streaming to perform cloud ETL on their continuously produced streaming data. |
| 5 | +However Spark structured streaming application can produce thousants of small files (according to microbatching and number of executors), which leads to performance degradadion. |
| 6 | + |
| 7 | + |
| 8 | +That's why the most crucial decision is file format for your datalake. Small files can be a problem because they slow down your query reads. Listing, opening and closing many small files incurs expensive overhead. This is called “the Small File Problem”. |
| 9 | +You can reduce the Small File Problem overhead by combining the data into bigger, more efficient files. Instead of doing it manually, pick the datalake format (delta, iceberg) and use build-in functions. |
| 10 | + |
| 11 | +Delta Lake enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes. |
| 12 | +For spark streaming application and realtime processing DeltaLake has one sighificant advantage - [built-in optimization](https://delta.io/blog/delta-lake-optimize/) |
| 13 | +OCI Data Flow supports Delta Lake by default when your Applications run Spark 3.2.1 or later - [doc](https://docs.oracle.com/en-us/iaas/data-flow/using/delta-lake-about.htm) |
| 14 | + |
| 15 | +How to optimize data lake using DeltaLake functions: |
| 16 | +Configure your preferences (please check DeltaLake doc): |
| 17 | + |
| 18 | +spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', 'False') |
| 19 | +spark.conf.set('spark.databricks.delta.optimize.repartition.enabled','True') |
| 20 | +spark.conf.set('spark.databricks.delta.optimize.preserveInsertionOrder', 'False') |
| 21 | + |
| 22 | +Preserve vacuum history: |
| 23 | +spark.conf.set('spark.databricks.delta.vacuum.logging.enabled','True') |
| 24 | + |
| 25 | +Set retention time for optimized files (ready to delete: |
| 26 | +spark.conf.set("spark.databricks.delta.deletedFileRetentionDuration","0") |
| 27 | + |
| 28 | + |
| 29 | +Check existing table details (look for numFiles and sizeInBytes: |
| 30 | +spark.sql("describe detail atm").show(truncate=False) |
| 31 | ++------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+ |
| 32 | +|format|id |name |description|location |createdAt |lastModified |partitionColumns |numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures | |
| 33 | ++------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+ |
| 34 | +|delta |c15ad4ca-8c0f-4747-b064-1492d7b4b3c4|spark_catalog.default.hsl_trains|NULL |oci://dataflow_app@fro8fl9kuqli/hsl_trains_data_part|2024-09-05 10:19:10.057|2024-09-06 08:45:01|[year, month, day]|2024 |16333676 |{} |1 |2 |[appendOnly, invariants]| |
| 35 | ++------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+ |
| 36 | + |
| 37 | +Run optimzation: |
| 38 | +spark.sql("OPTIMIZE atm").show(truncate=False) |
| 39 | + |
| 40 | +Check files you can delete: |
| 41 | +spark.sql("vacuum atm RETAIN 0 HOURS DRY RUN") |
| 42 | + |
| 43 | +Delete optimized and consolidated files: |
| 44 | +spark.sql("vacuum atm RETAIN 0 HOURS") |
| 45 | + |
| 46 | +and check details of your table: |
| 47 | +spark.sql("describe detail atm").show(truncate=False) |
| 48 | ++----------------+----------------+------------------------+ |
| 49 | +|format|id |name |description|location |createdAt |lastModified |partitionColumns |numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|tableFeatures | |
| 50 | ++------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+ |
| 51 | +|delta |c15ad4ca-8c0f-4747-b064-1492d7b4b3c4|spark_catalog.default.hsl_trains|NULL |oci://dataflow_app@fro8fl9kuqli/hsl_trains_data_part|2024-09-05 10:19:10.057|2024-09-06 08:47:48|[year, month, day]|7 |1583521 |{} |1 |2 |[appendOnly, invariants]| |
| 52 | ++------+------------------------------------+--------------------------------+-----------+----------------------------------------------------+-----------------------+-------------------+------------------+--------+-----------+----------+----------------+----------------+------------------------+ |
| 53 | + |
| 54 | +Enjoy increased performance of your queries! |
| 55 | + |
| 56 | + |
| 57 | + |
| 58 | +# License |
| 59 | +Copyright (c) 2024 Oracle and/or its affiliates. |
| 60 | +Licensed under the Universal Permissive License (UPL), Version 1.0. |
| 61 | +See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details. |
0 commit comments