Skip to content

Commit a6398a3

Browse files
Update the lakeflow-pipelines template according to the latest Lakeflow conventions (#3558)
This adds a new version of the `default-python` template. To enable the first integration and evaluation of this template, this first pull request replaces the internal `lakeflow-pipelines` template. Replacing `default-python` will be the next milestone. ## Changes * Update `default-python` to follow the Lakeflow conventions (per-pipeline sources go into `resources/`; shared sources go into (`lib/`) * Adapt this template so it can first be used in place of the `lakeflow-pipelines` template (this required disabling some of the template questions and components) ## Why Lakeflow comes with new conventions that emphasize modularity. This PR represents an initial milestone towards updating the DABs `default-python` template. `default-sql` will follow in a later milestone. ## Tests Existing tests were updated. <!-- If your PR needs to be included in the release notes for next release, add a separate entry in NEXT_CHANGELOG.md as part of your PR. -->
1 parent 22779c0 commit a6398a3

File tree

72 files changed

+1108
-486
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+1108
-486
lines changed

acceptance/bundle/templates/lakeflow-pipelines/python/output.txt

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
11

22
>>> [CLI] bundle init lakeflow-pipelines --config-file ./input.json --output-dir output
3-
43
Welcome to the template for Lakeflow Declarative Pipelines!
54

5+
Please answer the below to tailor your project to your preferences.
6+
You can always change your mind and change your configuration in the databricks.yml file later.
7+
8+
Note that [DATABRICKS_URL] is used for initialization
9+
(see https://docs.databricks.com/dev-tools/cli/profiles.html for how to change your profile).
610

7-
Your new project has been created in the 'my_lakeflow_pipelines' directory!
11+
Your new project has been created in the 'my_lakeflow_pipelines' directory!
812

9-
Refer to the README.md file for "getting started" instructions!
13+
Please refer to the README.md file for "getting started" instructions.
1014

1115
>>> [CLI] bundle validate -t dev
1216
Name: my_lakeflow_pipelines
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"recommendations": [
33
"databricks.databricks",
4-
"ms-python.vscode-pylance",
5-
"redhat.vscode-yaml"
4+
"redhat.vscode-yaml",
5+
"ms-python.black-formatter"
66
]
77
}

acceptance/bundle/templates/lakeflow-pipelines/python/output/my_lakeflow_pipelines/.vscode/settings.json

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,37 @@
11
{
2-
"python.analysis.stubPath": ".vscode",
3-
"databricks.python.envFile": "${workspaceFolder}/.env",
42
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
53
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
64
"python.testing.pytestArgs": [
75
"."
86
],
9-
"python.testing.unittestEnabled": false,
10-
"python.testing.pytestEnabled": true,
11-
"python.analysis.extraPaths": ["resources/my_lakeflow_pipelines_pipeline"],
127
"files.exclude": {
138
"**/*.egg-info": true,
149
"**/__pycache__": true,
1510
".pytest_cache": true,
11+
"dist": true,
12+
},
13+
"files.associations": {
14+
"**/.gitkeep": "markdown"
1615
},
16+
17+
// Pylance settings (VS Code)
18+
// Set typeCheckingMode to "basic" to enable type checking!
19+
"python.analysis.typeCheckingMode": "off",
20+
"python.analysis.extraPaths": ["src", "lib", "resources"],
21+
"python.analysis.diagnosticMode": "workspace",
22+
"python.analysis.stubPath": ".vscode",
23+
24+
// Pyright settings (Cursor)
25+
// Set typeCheckingMode to "basic" to enable type checking!
26+
"cursorpyright.analysis.typeCheckingMode": "off",
27+
"cursorpyright.analysis.extraPaths": ["src", "lib", "resources"],
28+
"cursorpyright.analysis.diagnosticMode": "workspace",
29+
"cursorpyright.analysis.stubPath": ".vscode",
30+
31+
// General Python settings
32+
"python.defaultInterpreterPath": "./.venv/bin/python",
33+
"python.testing.unittestEnabled": false,
34+
"python.testing.pytestEnabled": true,
1735
"[python]": {
1836
"editor.defaultFormatter": "ms-python.black-formatter",
1937
"editor.formatOnSave": true,

acceptance/bundle/templates/lakeflow-pipelines/python/output/my_lakeflow_pipelines/README.md

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,38 +2,53 @@
22

33
The 'my_lakeflow_pipelines' project was generated by using the Lakeflow Pipelines template.
44

5-
## Setup
5+
* `lib/`: Python source code for this project.
6+
* `lib/shared`: Shared source code across all jobs/pipelines/etc.
7+
* `resources/lakeflow_pipelines_etl`: Pipeline code and assets for the lakeflow_pipelines_etl pipeline.
8+
* `resources/`: Resource configurations (jobs, pipelines, etc.)
69

7-
1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html
10+
## Getting started
811

9-
2. Authenticate to your Databricks workspace, if you have not done so already:
10-
```
11-
$ databricks auth login
12-
```
12+
Choose how you want to work on this project:
13+
14+
(a) Directly in your Databricks workspace, see
15+
https://docs.databricks.com/dev-tools/bundles/workspace.
1316

14-
3. Optionally, install developer tools such as the Databricks extension for Visual Studio Code from
15-
https://docs.databricks.com/dev-tools/vscode-ext.html. Or the PyCharm plugin from
16-
https://www.databricks.com/blog/announcing-pycharm-integration-databricks.
17+
(b) Locally with an IDE like Cursor or VS Code, see
18+
https://docs.databricks.com/vscode-ext.
1719

20+
(c) With command line tools, see https://docs.databricks.com/dev-tools/cli/databricks-cli.html
1821

19-
## Deploying resources
22+
# Using this project using the CLI
2023

21-
1. To deploy a development copy of this project, type:
24+
The Databricks workspace and IDE extensions provide a graphical interface for working
25+
with this project. It's also possible to interact with it directly using the CLI:
26+
27+
1. Authenticate to your Databricks workspace, if you have not done so already:
28+
```
29+
$ databricks configure
30+
```
31+
32+
2. To deploy a development copy of this project, type:
2233
```
2334
$ databricks bundle deploy --target dev
2435
```
2536
(Note that "dev" is the default target, so the `--target` parameter
2637
is optional here.)
2738
28-
2. Similarly, to deploy a production copy, type:
29-
```
30-
$ databricks bundle deploy --target prod
31-
```
39+
This deploys everything that's defined for this project.
40+
For example, the default template would deploy a pipeline called
41+
`[dev yourname] lakeflow_pipelines_etl` to your workspace.
42+
You can find that resource by opening your workpace and clicking on **Jobs & Pipelines**.
3243
33-
3. Use the "summary" comand to review everything that was deployed:
44+
3. Similarly, to deploy a production copy, type:
3445
```
35-
$ databricks bundle summary
46+
$ databricks bundle deploy --target prod
3647
```
48+
Note the default template has a includes a job that runs the pipeline every day
49+
(defined in resources/lakeflow_pipelines_etl/lakeflow_pipelines_job.job.yml). The schedule
50+
is paused when deploying in development mode (see
51+
https://docs.databricks.com/dev-tools/bundles/deployment-modes.html).
3752
3853
4. To run a job or pipeline, use the "run" command:
3954
```

acceptance/bundle/templates/lakeflow-pipelines/python/output/my_lakeflow_pipelines/databricks.yml

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,6 @@ variables:
1414
description: The catalog to use
1515
schema:
1616
description: The schema to use
17-
notifications:
18-
description: The email addresses to use for failure notifications
1917

2018
targets:
2119
dev:
@@ -30,18 +28,15 @@ targets:
3028
variables:
3129
catalog: main
3230
schema: ${workspace.current_user.short_name}
33-
notifications: []
34-
3531
prod:
3632
mode: production
3733
workspace:
3834
host: [DATABRICKS_URL]
3935
# We explicitly deploy to /Workspace/Users/[USERNAME] to make sure we only have a single copy.
4036
root_path: /Workspace/Users/[USERNAME]/.bundle/${bundle.name}/${bundle.target}
37+
variables:
38+
catalog: main
39+
schema: prod
4140
permissions:
4241
- user_name: [USERNAME]
4342
level: CAN_MANAGE
44-
variables:
45-
catalog: main
46-
schema: default
47-
notifications: [[USERNAME]]

acceptance/bundle/templates/lakeflow-pipelines/python/output/my_lakeflow_pipelines/lib/shared/__init__.py

Whitespace-only changes.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from databricks.sdk.runtime import spark
2+
from pyspark.sql import DataFrame
3+
4+
5+
def find_all_taxis() -> DataFrame:
6+
"""Find all taxi data."""
7+
return spark.read.table("samples.nyctaxi.trips")

acceptance/bundle/templates/lakeflow-pipelines/python/output/my_lakeflow_pipelines/out.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,7 @@ dist/
44
__pycache__/
55
*.egg-info
66
.venv/
7+
scratch/**
8+
!scratch/README.md
79
**/explorations/**
810
**/!explorations/README.md
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This folder is reserved for Databricks Asset Bundles resource definitions.

acceptance/bundle/templates/lakeflow-pipelines/python/output/my_lakeflow_pipelines/resources/my_lakeflow_pipelines_pipeline/README.md renamed to acceptance/bundle/templates/lakeflow-pipelines/python/output/my_lakeflow_pipelines/resources/lakeflow_pipelines_etl/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
# my_lakeflow_pipelines_pipeline
1+
# my_lakeflow_pipelines
22

3-
This folder defines all source code for the my_lakeflow_pipelines_pipeline pipeline:
3+
This folder defines all source code for the my_lakeflow_pipelines pipeline:
44

5-
- `explorations`: Ad-hoc notebooks used to explore the data processed by this pipeline.
6-
- `transformations`: All dataset definitions and transformations.
7-
- `utilities` (optional): Utility functions and Python modules used in this pipeline.
8-
- `data_sources` (optional): View definitions describing the source data for this pipeline.
5+
- `explorations/`: Ad-hoc notebooks used to explore the data processed by this pipeline.
6+
- `transformations/`: All dataset definitions and transformations.
7+
- `utilities/` (optional): Utility functions and Python modules used in this pipeline.
8+
- `data_sources/` (optional): View definitions describing the source data for this pipeline.
99

1010
## Getting Started
1111

0 commit comments

Comments
 (0)