Skip to content

LHP generate to databrick bundle - shim code feedback #75

@JJediny

Description

@JJediny

Forthmost, thank you for the great project

During our use of lhp we had to develop some shim scripts to faciliate our use of LHP -> Databricks Bundle -> Deployment. This project has been a great tool for us to maintain version control of our pipeline and perfect for our deployment otherwise. Here are some of the real use issues we're encountered.

These are the specific issues found in the scripts and the recommended fixes for the upstream LakehousePlumber project.

1. Python String Interpolation (Syntax Error)

  • The Issue: LHP generates Python code with bare placeholder strings like database: "{catalog}.{schema}" inside decorators. This is not valid Python variable interpolation.
  • The Shim Fix: fix_generated_pipelines.py uses regex to wrap these strings in f"...".
  • Upstream Fix:
    • Update Template: The LHP Jinja2 templates for Python generation must produce f-strings (f"{...}") instead of bare strings.
    • Variable Scope: Ensure the generator explicitly outputs the configuration block (e.g., catalog = spark.conf.get(...)) at the top of the file so these f-strings have variables to reference.

2. Decorator Ordering (Runtime Error)

  • The Issue: LHP generates code where expectation decorators (e.g., @dlt.expect) appear before the table decorator (e.g., @dlt.table). DLT requires the table/view decorator to be the outermost (top) decorator to register the dataset correctly.
  • The Shim Fix: Regex logic swaps the order of these decorators.
  • Upstream Fix:
    • Generator Logic: Adjust the generation loop to ensures the @dlt.table / @dlt.view decorator is always yielded first (at the top), followed by any expectations or access control decorators.

3. Import & API Standardization

  • The Issue: LHP generates from pyspark import pipelines as dp and uses @dp.temporary_view.
    • The pipelines module is often deprecated or not available in standard DLT runtimes (which prefer import dlt).
    • temporary_view is not a standard DLT decorator (it's @dlt.view or just a function without a decorator for pure temporary scope).
  • The Shim Fix: Replaces imports with import dlt as dp and maps temporary_view to view.
  • Upstream Fix:
    • Modernize API: Update the generator to use the standard import dlt library.
    • Deprecate Custom Types: Remove LHP-specific types like temporary_view in favor of standard DLT view definitions.

4. Data Quality Visibility (temporary=True)

  • The Issue: LHP defaults validation tables to temporary=True. In DLT, temporary tables do not record Expectation metrics to the event log, making Data Quality invisible in the DLT UI.
  • The Shim Fix: Regex removes temporary=True from generated table definitions.
  • Upstream Fix:
    • Configurable Default: Change the default for validation actions to temporary=False or expose this as a configurable YAML property (visible: true/false).

5. Spark Configuration Injection

  • The Issue: The pipelines fail with AnalysisException without pipelines.incompatibleViewCheck.enabled="false". LHP provides no native way to inject top-level spark.conf.set calls into the generated Python file.
  • The Shim Fix: Manually injects this line at the top of every file.
  • Upstream Fix:
    • Schema Extension: Add a spark_configuration: section to the LHP YAML schema that generates corresponding spark.conf.set() calls in the Python output.

6. DLT Dependency Resolution

  • The Issue: When using spark.sql(...) inside a DLT function, the lineage graph often fails to detect dependencies.
  • The Shim Fix: Scans the SQL string for known table names and injects dummy dlt.read("...") calls to "hint" the dependency to DLT.
  • Upstream Fix:
    • Explicit Reads: Instead of generating raw spark.sql(...), LHP should generate dlt.read("upstream_table") calls or explicit spark.readStream.table(...) references which DLT can track natively.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions