Skip to content

Create a Delta table doctor that analyzes and health and wellness of a Delta table #7

@MrPowers

Description

@MrPowers

A levi.delta_doctor(delta_table) command could be a nice way for users to help identify issues in their Delta table that could cause slow query performance.

There are several known problems that can cause poor performance of Delta tables:

  • too many small files
  • large files
  • file stats not being collected on the right columns/file stats missing for certain files
  • tables that are over-partitioned
  • tables that are not Z ORDERed
  • tables that should have constraints, but do not

The levi.delta_doctor(delta_table) could return a string with the following warnings:

  • SmallFileWarning: Your table contains 456 files with less than 1MB of data and you could consider optimizing to compact the small files
  • LargeFileWarning: Your table contains 32 files with more than 1.5GB of data. You should split up these files.
  • FileStatsWarning: You are only collecting stats for col1 and col2 in some files.

We should make it really easy for users to see if there are any obvious problems in their Delta table. We will ideally give them really easy solutions to fix these problems as well!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions