Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
9,846 changes: 9,846 additions & 0 deletions bundle_config_schema.json

Large diffs are not rendered by default.

Binary file added product_demos/.DS_Store
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
Binary file added product_demos/cdc-pipeline/.DS_Store
Binary file not shown.
697 changes: 610 additions & 87 deletions product_demos/cdc-pipeline/01-CDC-CDF-simple-pipeline.py

Large diffs are not rendered by default.

705 changes: 647 additions & 58 deletions product_demos/cdc-pipeline/02-CDC-CDF-full-multi-tables.py

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.databricks/
build/
dist/
__pycache__/
*.egg-info
.venv/
scratch/**
!scratch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Typings for Pylance in Visual Studio Code
# see https://github.com/microsoft/pyright/blob/main/docs/builtins.md
from databricks.sdk.runtime import *
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"recommendations": [
"databricks.databricks",
"ms-python.vscode-pylance",
"redhat.vscode-yaml"
]
}
16 changes: 16 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/.vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"python.analysis.stubPath": ".vscode",
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
"python.testing.pytestArgs": [
"."
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"python.analysis.extraPaths": ["src"],
"files.exclude": {
"**/*.egg-info": true,
"**/__pycache__": true,
".pytest_cache": true,
},
}
51 changes: 51 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# cdc_dabs

The 'cdc_dabs' project was generated by using the default-python template.

## Getting started

0. Install UV: https://docs.astral.sh/uv/getting-started/installation/

1. Install the Databricks CLI from https://docs.databricks.com/dev-tools/cli/databricks-cli.html

2. Authenticate to your Databricks workspace, if you have not done so already:
```
$ databricks configure
```

3. To deploy a development copy of this project, type:
```
$ databricks bundle deploy --target dev
```
(Note that "dev" is the default target, so the `--target` parameter
is optional here.)

This deploys everything that's defined for this project.
For example, the default template would deploy a job called
`[dev yourname] cdc_dabs_job` to your workspace.
You can find that job by opening your workpace and clicking on **Workflows**.

4. Similarly, to deploy a production copy, type:
```
$ databricks bundle deploy --target prod
```

Note that the default job from the template has a schedule that runs every day
(defined in resources/cdc_dabs.job.yml). The schedule
is paused when deploying in development mode (see
https://docs.databricks.com/dev-tools/bundles/deployment-modes.html).

5. To run a job or pipeline, use the "run" command:
```
$ databricks bundle run
```
6. Optionally, install the Databricks extension for Visual Studio code for local development from
https://docs.databricks.com/dev-tools/vscode-ext.html. It can configure your
virtual environment and setup Databricks Connect for running unit tests locally.
When not using these tools, consult your development environment's documentation
and/or the documentation for Databricks Connect for manually setting up your environment
(https://docs.databricks.com/en/dev-tools/databricks-connect/python/index.html).

7. For documentation on the Databricks asset bundles format used
for this project, and for CI/CD configuration, see
https://docs.databricks.com/dev-tools/bundles/index.html.
35 changes: 35 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/databricks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# This is a Databricks asset bundle definition for cdc_dabs.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
name: cdc_dabs
uuid: 4258650d-ac2f-4d40-b813-f64d1ea65d7b

artifacts:
python_artifact:
type: whl
build: uv build --wheel

include:
- resources/*.yml
- resources/*/*.yml

targets:
dev:
# The default target uses 'mode: development' to create a development copy.
# - Deployed resources get prefixed with '[dev my_user_name]'
# - Any job schedules and triggers are paused by default.
# See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
mode: development
default: true
workspace:
host: https://e2-demo-field-eng.cloud.databricks.com

prod:
mode: production
workspace:
host: https://e2-demo-field-eng.cloud.databricks.com
# We explicitly deploy to /Workspace/Users/[email protected] to make sure we only have a single copy.
root_path: /Workspace/Users/[email protected]/.bundle/${bundle.name}/${bundle.target}
permissions:
- user_name: [email protected]
level: CAN_MANAGE
22 changes: 22 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/fixtures/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Fixtures

This folder is reserved for fixtures, such as CSV files.

Below is an example of how to load fixtures as a data frame:

```
import pandas as pd
import os

def get_absolute_path(*relative_parts):
if 'dbutils' in globals():
base_dir = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()) # type: ignore
path = os.path.normpath(os.path.join(base_dir, *relative_parts))
return path if path.startswith("/Workspace") else "/Workspace" + path
else:
return os.path.join(*relative_parts)

csv_file = get_absolute_path("..", "fixtures", "mycsv.csv")
df = pd.read_csv(csv_file)
display(df)
```
41 changes: 41 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
[project]
name = "cdc_dabs"
version = "0.0.1"
authors = [{ name = "[email protected]" }]
requires-python = ">= 3.11"

[project.optional-dependencies]
dev = [
"pytest",

# Code completion support for Lakeflow Declarative Pipelines, also install databricks-connect
"databricks-dlt",

# databricks-connect can be used to run parts of this project locally.
# See https://docs.databricks.com/dev-tools/databricks-connect.html.
#
# Note, databricks-connect is automatically installed if you're using Databricks
# extension for Visual Studio Code
# (https://docs.databricks.com/dev-tools/vscode-ext/dev-tasks/databricks-connect.html).
#
# To manually install databricks-connect, uncomment the line below to install a version
# of db-connect that corresponds to the Databricks Runtime version used for this project.
# See https://docs.databricks.com/dev-tools/databricks-connect.html
# "databricks-connect>=15.4,<15.5",
]

[tool.pytest.ini_options]
pythonpath = "src"
testpaths = [
"tests",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/cdc_dabs"]

[project.scripts]
main = "cdc_dabs.main:main"
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# The main job for cdc_dabs.
resources:
jobs:
cdc_dabs_job:
name: cdc_dabs_job

trigger:
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
periodic:
interval: 1
unit: DAYS

#email_notifications:
# on_failure:
# - [email protected]

tasks:
- task_key: notebook_task
notebook_task:
notebook_path: ../src/notebook.ipynb

- task_key: refresh_pipeline
depends_on:
- task_key: notebook_task
pipeline_task:
pipeline_id: ${resources.pipelines.cdc_dabs_pipeline.id}

- task_key: main_task
depends_on:
- task_key: refresh_pipeline
environment_key: default
python_wheel_task:
package_name: cdc_dabs
entry_point: main

# A list of task execution environment specifications that can be referenced by tasks of this job.
environments:
- environment_key: default

# Full documentation of this spec can be found at:
# https://docs.databricks.com/api/workspace/jobs/create#environments-spec
spec:
client: "2"
dependencies:
- ../dist/*.whl
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# The main pipeline for cdc_dabs
resources:
pipelines:
cdc_dabs_pipeline:
name: cdc_dabs_pipeline
catalog: dbacademy
schema: cdc_dabs_${bundle.target}
serverless: true
libraries:
- notebook:
path: ../src/pipeline.ipynb

configuration:
bundle.sourcePath: ${workspace.file_path}/src
4 changes: 4 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/scratch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# scratch

This folder is reserved for personal, exploratory notebooks.
By default these are not committed to Git, as 'scratch' is listed in .gitignore.
Empty file.
25 changes: 25 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/src/cdc_dabs/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from pyspark.sql import SparkSession, DataFrame


def get_taxis(spark: SparkSession) -> DataFrame:
return spark.read.table("samples.nyctaxi.trips")


# Create a new Databricks Connect session. If this fails,
# check that you have configured Databricks Connect correctly.
# See https://docs.databricks.com/dev-tools/databricks-connect.html.
def get_spark() -> SparkSession:
try:
from databricks.connect import DatabricksSession

return DatabricksSession.builder.getOrCreate()
except ImportError:
return SparkSession.builder.getOrCreate()


def main():
get_taxis(get_spark()).show(5)


if __name__ == "__main__":
main()
75 changes: 75 additions & 0 deletions product_demos/cdc-pipeline/test2/cdc_dabs/src/notebook.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "ee353e42-ff58-4955-9608-12865bd0950e",
"showTitle": false,
"title": ""
}
},
"source": [
"# Default notebook\n",
"\n",
"This default notebook is executed using Databricks Workflows as defined in resources/cdc_dabs.job.yml."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {
"byteLimit": 2048000,
"rowLimit": 10000
},
"inputWidgets": {},
"nuid": "6bca260b-13d1-448f-8082-30b60a85c9ae",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from cdc_dabs import main\n",
"\n",
"main.get_taxis(spark).show(10)"
]
}
],
"metadata": {
"application/vnd.databricks.v1+notebook": {
"dashboards": [],
"language": "python",
"notebookMetadata": {
"pythonIndentUnit": 2
},
"notebookName": "notebook",
"widgets": {}
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading