Added detect_small_files notebook (#70)

edurdevic · web-flow · commit e203605c51b2 · 2023-09-08T15:44:47.000+02:00
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@ Operations are applied concurrently across multiple tables
 * **Maintenance**
   * [VACUUM all tables](docs/Vacuum.md) ([example notebook](examples/vacuum_multiple_tables.py))
   * OPTIMIZE with z-order on tables having specified columns
+  * Detect tables having too many small files ([example notebook](examples/detect_small_files.py))
   * Visualise quantity of data written per table per period
 * **Governance**
   * PII detection with Presidio ([example notebook](examples/pii_detection_presidio.py))
diff --git a/examples/detect_small_files.py b/examples/detect_small_files.py
@@ -0,0 +1,53 @@
+# Databricks notebook source
+# MAGIC %md
+# MAGIC # Detect tables with too many small files
+# MAGIC
+# MAGIC Delta tables are composed of multiple `parquet` files. A table with too many small files might lead to performance degradation. The optimal file size depends on the workload, but it generally ranges between `10 MB` and `1000 MB`.
+# MAGIC
+# MAGIC As a rule of thumb, if a table has more than `100` files and average file size smaller than `10 MB`, then we can consider it having too many small files.
+# MAGIC
+# MAGIC Some common causes of too many small files are:
+# MAGIC * Overpartitioning: the cardinality of the partition columns is too high 
+# MAGIC * Lack of scheduled maintenance operations like `OPTIMIZE`
+# MAGIC * Missing auto optimize on write
+# MAGIC
+# MAGIC This notebook will help you to identify the tables that might require a review.
+
+# COMMAND ----------
+
+# MAGIC %pip install dbl-discoverx
+
+# COMMAND ----------
+
+dbutils.widgets.text("from_tables", "*.*.*")
+from_tables = dbutils.widgets.get("from_tables")
+
+# Define how small is too small
+small_file_max_size_MB = 10
+
+# It's okay to have small files as long as there are not too many
+min_file_number = 100
+
+# COMMAND ----------
+
+from discoverx import DX
+
+dx = DX()
+
+# COMMAND ----------
+
+from pyspark.sql.functions import col, lit
+
+dx.from_tables(from_tables)\
+  .apply_sql("DESCRIBE DETAIL {full_table_name}")\
+  .to_union_dataframe()\
+  .withColumn("average_file_size_MB", col("sizeInBytes") / col("numFiles") / 1024 / 1024)\
+  .withColumn("has_too_many_small_files", 
+              (col("average_file_size_MB") < small_file_max_size_MB) & 
+              (col("numFiles") > min_file_number))\
+  .filter("has_too_many_small_files")\
+  .display()
+
+# COMMAND ----------
+
+