Skip to content

Commit a00e2af

Browse files
Delta housekeeping initial version (#101)
* delta housekeeping initial commit * debugging initial version * convert output to pandas * debugging -convert output to pandas * DeltaHousekeepingActions object and tests * added more insights to housekeeping and refactored tests * regression and cleanup * move implementation of map_chunked to a separated branch + improved unit tests * readability, cleanup, follow discoverx patterns * debugging on cluster + adding spark session to `DeltaHousekeepingActions` * simplify scan implementation & remove dependency to BeautifulSoup * faster implementation + unit tests * cleanup * cleanup and PR comments * proper use of dbwidgets * refactoring apply to return a single dataframe * add test datasets for all housekeeping checks + bug fixes * fix explain / apply methods * refactoring to control output column names * refactoring to spark API -intermediate commit * tests with DBR -nan's & timestamps * failing test + cleanup * cleanup * cleanup * remove 'reason' column from the output dfs --------- Co-authored-by: lorenzorubi-db <lorenzorubi-db>
1 parent e632b63 commit a00e2af

23 files changed

+1060
-10
lines changed

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,11 @@ The properties available in table_info are
5959
* **Maintenance**
6060
* [VACUUM all tables](docs/Vacuum.md) ([example notebook](examples/vacuum_multiple_tables.py))
6161
* Detect tables having too many small files ([example notebook](examples/detect_small_files.py))
62+
* Delta housekeeping analysis ([example notebook](examples/exec_delta_housekeeping.py)) which provide:
63+
* stats (size of tables and number of files, timestamps of latest OPTIMIZE & VACUUM operations, stats of OPTIMIZE)
64+
* recommendations on tables that need to be OPTIMIZED/VACUUM'ed
65+
* are tables OPTIMIZED/VACUUM'ed often enough
66+
* tables that have small files / tables for which ZORDER is not being effective
6267
* Deep clone a catalog ([example notebook](examples/deep_clone_schema.py))
6368
* **Governance**
6469
* PII detection with Presidio ([example notebook](examples/pii_detection_presidio.py))
@@ -91,7 +96,7 @@ from discoverx import DX
9196
dx = DX(locale="US")
9297
```
9398

94-
You can now run operations across multiple tables.
99+
You can now run operations across multiple tables.
95100

96101
## Available functionality
97102

@@ -128,4 +133,3 @@ After a `with_sql` or `unpivot_string_columns` command, you can apply the follow
128133
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
129134

130135
Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
131-

0 commit comments

Comments
 (0)