-
Notifications
You must be signed in to change notification settings - Fork 8
Open
1 / 11 of 1 issue completedDescription
Forthmost, thank you for the great project
During our use of lhp we had to develop some shim scripts to faciliate our use of LHP -> Databricks Bundle -> Deployment. This project has been a great tool for us to maintain version control of our pipeline and perfect for our deployment otherwise. Here are some of the real use issues we're encountered.
These are the specific issues found in the scripts and the recommended fixes for the upstream LakehousePlumber project.
1. Python String Interpolation (Syntax Error)
- The Issue: LHP generates Python code with bare placeholder strings like
database: "{catalog}.{schema}"inside decorators. This is not valid Python variable interpolation. - The Shim Fix:
fix_generated_pipelines.pyuses regex to wrap these strings inf"...". - Upstream Fix:
- Update Template: The LHP Jinja2 templates for Python generation must produce f-strings (
f"{...}") instead of bare strings. - Variable Scope: Ensure the generator explicitly outputs the configuration block (e.g.,
catalog = spark.conf.get(...)) at the top of the file so these f-strings have variables to reference.
- Update Template: The LHP Jinja2 templates for Python generation must produce f-strings (
2. Decorator Ordering (Runtime Error)
- The Issue: LHP generates code where expectation decorators (e.g.,
@dlt.expect) appear before the table decorator (e.g.,@dlt.table). DLT requires the table/view decorator to be the outermost (top) decorator to register the dataset correctly. - The Shim Fix: Regex logic swaps the order of these decorators.
- Upstream Fix:
- Generator Logic: Adjust the generation loop to ensures the
@dlt.table/@dlt.viewdecorator is always yielded first (at the top), followed by any expectations or access control decorators.
- Generator Logic: Adjust the generation loop to ensures the
3. Import & API Standardization
- The Issue: LHP generates
from pyspark import pipelines as dpand uses@dp.temporary_view.- The
pipelinesmodule is often deprecated or not available in standard DLT runtimes (which preferimport dlt). temporary_viewis not a standard DLT decorator (it's@dlt.viewor just a function without a decorator for pure temporary scope).
- The
- The Shim Fix: Replaces imports with
import dlt as dpand mapstemporary_viewtoview. - Upstream Fix:
- Modernize API: Update the generator to use the standard
import dltlibrary. - Deprecate Custom Types: Remove LHP-specific types like
temporary_viewin favor of standard DLT view definitions.
- Modernize API: Update the generator to use the standard
4. Data Quality Visibility (temporary=True)
- The Issue: LHP defaults validation tables to
temporary=True. In DLT, temporary tables do not record Expectation metrics to the event log, making Data Quality invisible in the DLT UI. - The Shim Fix: Regex removes
temporary=Truefrom generated table definitions. - Upstream Fix:
- Configurable Default: Change the default for validation actions to
temporary=Falseor expose this as a configurable YAML property (visible: true/false).
- Configurable Default: Change the default for validation actions to
5. Spark Configuration Injection
- The Issue: The pipelines fail with
AnalysisExceptionwithoutpipelines.incompatibleViewCheck.enabled="false". LHP provides no native way to inject top-levelspark.conf.setcalls into the generated Python file. - The Shim Fix: Manually injects this line at the top of every file.
- Upstream Fix:
- Schema Extension: Add a
spark_configuration:section to the LHP YAML schema that generates correspondingspark.conf.set()calls in the Python output.
- Schema Extension: Add a
6. DLT Dependency Resolution
- The Issue: When using
spark.sql(...)inside a DLT function, the lineage graph often fails to detect dependencies. - The Shim Fix: Scans the SQL string for known table names and injects dummy
dlt.read("...")calls to "hint" the dependency to DLT. - Upstream Fix:
- Explicit Reads: Instead of generating raw
spark.sql(...), LHP should generatedlt.read("upstream_table")calls or explicitspark.readStream.table(...)references which DLT can track natively.
- Explicit Reads: Instead of generating raw
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
No labels