Skip to content

Commit 8783f20

Browse files
tlgnrCopilotmwojtyczka
authored
Adding Lakebase checks storage backend (#550)
## Changes <!-- Summary of your changes that are easy to understand. Add screenshots when necessary --> Adding Lakebase checks storage backend. ### Linked issues <!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword --> Resolves #444 ### Tests <!-- How is this tested? Please see the checklist below and also describe any other relevant tests --> - [x] manually tested - [ ] added unit tests - [x] added integration tests - [ ] added end-to-end tests --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: Marcin Wojtyczka <[email protected]>
1 parent 9276de1 commit 8783f20

File tree

13 files changed

+1013
-20
lines changed

13 files changed

+1013
-20
lines changed

docs/dqx/docs/guide/data_profiling.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ When running the profiler workflow using Databricks API or UI, you have the same
262262
- If the `checks_location` in the run config points to a table, the checks will be saved to that table.
263263
If the `checks_location` in the run config points to a file, file name is replaced with "&lt;input_table&gt;.yml". In addition, if the location is specified as a relative path, it is prefixed with the workspace installation folder.
264264
For example:
265-
- If "checks_location=catalog.schema.table", the location will be resolved to "catalog.schema.table".
265+
- If "checks_location=catalog.schema.table", the location will be resolved to "catalog.schema.table" or "database.schema.table" in case of using Lakebase to store checks.
266266
- If "checks_location=folder/checks.yml", the location will be resolved to "install_folder/folder/&lt;input_table&gt;.yml".
267267
- If "checks_location=/App/checks.yml", the location will be resolved to "/App/&lt;input_table&gt;.yml".
268268
- If "checks_location=/Volume/catalog/schema/folder/checks.yml", the location will be resolved to "/Volume/catalog/schema/folder/&lt;input_table&gt;.yml".

docs/dqx/docs/guide/quality_checks_apply.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -575,7 +575,7 @@ When running the quality checker workflow using Databricks API or UI, you have t
575575
- If the `checks_location` in the run config points to a table, the checks will be directly loaded from that table.
576576
If the `checks_location` in the run config points to a file, file name is replaced with "&lt;input_table&gt;.yml". In addition, if the location is specified as a relative path, it is prefixed with the workspace installation folder.
577577
For example:
578-
- If "checks_location=catalog.schema.table", the location will be resolved to "catalog.schema.table".
578+
- If "checks_location=catalog.schema.table", the location will be resolved to "catalog.schema.table" or "database.schema.table" in case of using Lakebase to store checks.
579579
- If "checks_location=folder/checks.yml", the location will be resolved to "install_folder/folder/&lt;input_table&gt;.yml".
580580
- If "checks_location=/App/checks.yml", the location will be resolved to "/App/&lt;input_table&gt;.yml".
581581
- If "checks_location=/Volume/catalog/schema/folder/checks.yml", the location will be resolved to "/Volume/catalog/schema/folder/&lt;input_table&gt;.yml".
@@ -690,7 +690,7 @@ When running the e2e workflow using Databricks API or UI, you have the same exec
690690
- If the `checks_location` in the run config points to a table, the checks will be directly loaded from that table.
691691
If the `checks_location` in the run config points to a file, file name is replaced with &lt;input_table&gt;.yml. In addition, if the location is specified as a relative path, it is prefixed with the workspace installation folder.
692692
For example:
693-
- If "checks_location=catalog.schema.table", the location will be resolved to "catalog.schema.table".
693+
- If "checks_location=catalog.schema.table", the location will be resolved to "catalog.schema.table" or "database.schema.table" in case of using Lakebase to store checks.
694694
- If "checks_location=folder/checks.yml", the location will be resolved to "install_folder/folder/&lt;input_table&gt;.yml".
695695
- If "checks_location=/App/checks.yml", the location will be resolved to "/App/&lt;input_table&gt;.yml".
696696
- If "checks_location=/Volume/catalog/schema/folder/checks.yml", the location will be resolved to "/Volume/catalog/schema/folder/&lt;input_table&gt;.yml".

docs/dqx/docs/guide/quality_checks_storage.mdx

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,20 @@ Saving and loading methods accept a storage backend configuration as input. The
2222
* `mode`: (optional) write mode for saving checks (`overwrite` or `append`, default is `overwrite`). The `overwrite` mode will only replace checks for the specific run config and not all checks in the table.
2323
* `VolumeFileChecksStorageConfig`: Unity Catalog Volume (JSON/YAML file). Containing fields:
2424
* `location`: Unity Catalog Volume file path (JSON or YAML).
25+
* `LakebaseChecksStorageConfig`: Lakebase table. Containing fields:
26+
* `instance_name`: name of the Lakebase instance, e.g., "my-instance".
27+
* `user`: user to connect to the Lakebase instance, e.g., "[email protected]" or Databricks service principal client ID.
28+
* `location`: fully-qualified table name in the format "database.schema.table".
29+
* `port`: (optional) port on which to connect to the Lakebase instance (use 5432 if not provided).
30+
* `run_config_name`: (optional) run configuration name to load (use "default" if not provided).
31+
* `mode`: (optional) write mode for saving checks (`overwrite` or `append`, default is `overwrite`). The `overwrite` mode will only replace checks for the specific run config and not all checks in the table.
2532
* `InstallationChecksStorageConfig`: installation-managed location from the run config, ignores location and infers it from `checks_location` in the run config. Containing fields:
2633
* `location` (optional): automatically set based on the `checks_location` field from the run configuration.
2734
* `install_folder`: (optional) installation folder where DQX is installed, only required when custom installation folder is used.
2835
* `run_config_name` (optional) - run configuration name to load (it can be any string), e.g. input table or job name (use "default" if not provided).
2936
* `product_name`: (optional) name of the product (use "dqx" if not provided).
3037
* `assume_user`: (optional) if True, assume user installation, otherwise global installation (skipped if `install_folder` is provided).
38+
* the config inherits from the specific configs such as `WorkspaceFileChecksStorageConfig`, `TableChecksStorageConfig`, `VolumeFileChecksStorageConfig`, and `LakebaseChecksStorageConfig` so relevant fields from these specific configs can be provided (e.g. instance_name and user for lakebase).
3139

3240
You can find details on how to define checks [here](/docs/guide/quality_checks_definition).
3341

@@ -49,7 +57,8 @@ If you create checks as a list of `DQRule` objects, you can convert them to meta
4957
WorkspaceFileChecksStorageConfig,
5058
InstallationChecksStorageConfig,
5159
TableChecksStorageConfig,
52-
VolumeFileChecksStorageConfig
60+
VolumeFileChecksStorageConfig,
61+
LakebaseChecksStorageConfig,
5362
)
5463
from databricks.sdk import WorkspaceClient
5564

@@ -81,6 +90,9 @@ If you create checks as a list of `DQRule` objects, you can convert them to meta
8190
# save checks as a YAML in a Unity Catalog Volume location (overwrite the file)
8291
dq_engine.save_checks(checks, config=VolumeFileChecksStorageConfig(location="/Volumes/dq/config/checks_volume/App1/checks.yml"))
8392

93+
# save checks as a Lakebase table using a Databricks service principal
94+
dq_engine.save_checks(checks, config=LakebaseChecksStorageConfig(instance_name="my-instance", user="00000000-0000-0000-0000-000000000000", location="dqx.config.checks"))
95+
8496
# save checks as a YAML file or table defined in 'checks_location' of the run config
8597
# only works if DQX is installed in the workspace
8698
# the run config name can be any string, e.g. input table or job name
@@ -195,7 +207,8 @@ If you create checks as a list of DQRule objects, you can convert them using the
195207
WorkspaceFileChecksStorageConfig,
196208
InstallationChecksStorageConfig,
197209
TableChecksStorageConfig,
198-
VolumeFileChecksStorageConfig
210+
VolumeFileChecksStorageConfig,
211+
LakebaseChecksStorageConfig,
199212
)
200213
from databricks.sdk import WorkspaceClient
201214

@@ -217,6 +230,9 @@ If you create checks as a list of DQRule objects, you can convert them using the
217230
# load checks from a Unity Catalog Volume
218231
checks: list[dict] = dq_engine.load_checks(config=VolumeFileChecksStorageConfig(location="/Volumes/dq/config/checks_volume/App1/checks.yml"))
219232

233+
# load checks from a Lakebase table using a Databricks service principal
234+
checks: list[dict] = dq_engine.load_checks(config=LakebaseChecksStorageConfig(instance_name="my-instance", user="00000000-0000-0000-0000-000000000000", location="dqx.config.checks"))
235+
220236
# load checks from a file or table defined in the run config ('checks_location' field)
221237
# only works if DQX is installed in the workspace
222238
# the run config name is a string (e.g. input table or job name)

docs/dqx/docs/installation.mdx

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,12 @@ run_configs: # <- list of run configurations, each run co
212212

213213
checks_location: iot_checks.yml # <- Quality rules (checks) can be stored in a table or defined in JSON or YAML files, located at absolute or relative path within the installation folder or volume file path.
214214

215+
# if wanting to store checks in lakebase table
216+
# checks_location: dqx.config.checks # <- fully qualified Lakebase table for storing quality rules (checks)
217+
# lakebase_instance_name: my-lakebase-instance # <- the name of the lakebase instance to use for storing checks
218+
# lakebase_user: 00000000-0000-0000-0000-000000000000 # <- the user to connect to the lakebase, e.g., [email protected] or a Databricks service principal client ID
219+
# lakebase_port: 5432 # <- optional port to connect to Lakebase, default is 5432
220+
215221
custom_check_functions: # <- optional mapping of custom check function name to Python file (module) containing check function definition
216222
my_func: custom_checks/my_funcs.py # relative workspace path (installation folder prefix applied)
217223
my_other: /Workspace/Shared/MyApp/my_funcs.py # or absolute workspace path
@@ -236,7 +242,7 @@ run_configs: # <- list of run configurations, each run co
236242

237243
warehouse_id: your-warehouse-id # <- warehouse id for refreshing dashboard
238244

239-
- name: another_run_config # <- unique name of the run config
245+
- name: another_run_config # <- unique name of the run config
240246
...
241247
```
242248

pyproject.toml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,9 @@ classifiers = [
2929
"Topic :: Utilities",
3030
]
3131
dependencies = ["databricks-labs-blueprint>=0.9.1,<0.10",
32-
"databricks-sdk~=0.57",
32+
"databricks-sdk~=0.67",
3333
"databricks-labs-lsql>=0.5,<=0.16",
34+
"sqlalchemy>=1.4,<3.0",
3435
]
3536

3637
[project.optional-dependencies]
@@ -101,6 +102,7 @@ dependencies = [
101102
"dbldatagen~=0.4.0",
102103
"pyparsing~=3.2.3",
103104
"jmespath~=1.0.1",
105+
"psycopg2-binary~=2.9.10",
104106
]
105107

106108
python="3.12" # must match the version required by databricks-connect and python version on the test clusters
@@ -111,7 +113,7 @@ path = ".venv"
111113
[tool.hatch.envs.default.scripts]
112114
test = "pytest tests/unit/ -n 10 --cov --cov-report=xml:coverage-unit.xml --timeout 30 --durations 20"
113115
coverage = "pytest tests/ -n 10 --cov --cov-report=html --timeout 600 --durations 20"
114-
integration = "pytest tests/integration/ -n 10 --cov --cov-report=xml --timeout 1200 --durations 20"
116+
integration = "pytest tests/integration/ -n 5 --cov --cov-report=xml --timeout 1200 --durations 20"
115117
e2e = "pytest tests/e2e/ -n 10 --cov --cov-report=xml --timeout 600 --durations 20"
116118
perf = "pytest tests/perf/ -n 10 --cov --cov-report=xml --timeout 600 --durations 20"
117119
fmt = ["black .",

0 commit comments

Comments
 (0)