Skip to content

Commit ca9d173

Browse files
Add UDTF example with improved documentation
- Add k-means clustering UDTF example for Unity Catalog - Focus documentation on UDTF pattern and SQL accessibility - Include Python implementation and SQL usage examples - Add CI/CD integration instructions
1 parent e7fa87a commit ca9d173

File tree

16 files changed

+1793
-0
lines changed

16 files changed

+1793
-0
lines changed
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
.databricks/
2+
build/
3+
dist/
4+
__pycache__/
5+
*.egg-info
6+
.venv/
7+
scratch/**
8+
!scratch/README.md
9+
**/explorations/**
10+
**/!explorations/README.md
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Typings for Pylance in Visual Studio Code
2+
# see https://github.com/microsoft/pyright/blob/main/docs/builtins.md
3+
from databricks.sdk.runtime import *
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"recommendations": [
3+
"databricks.databricks",
4+
"redhat.vscode-yaml",
5+
"ms-python.black-formatter"
6+
]
7+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
{
2+
"jupyter.interactiveWindow.cellMarker.codeRegex": "^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])",
3+
"jupyter.interactiveWindow.cellMarker.default": "# COMMAND ----------",
4+
"python.testing.pytestArgs": [
5+
"."
6+
],
7+
"files.exclude": {
8+
"**/*.egg-info": true,
9+
"**/__pycache__": true,
10+
".pytest_cache": true,
11+
"dist": true,
12+
},
13+
"files.associations": {
14+
"**/.gitkeep": "markdown"
15+
},
16+
17+
// Pylance settings (VS Code)
18+
// Set typeCheckingMode to "basic" to enable type checking!
19+
"python.analysis.typeCheckingMode": "off",
20+
"python.analysis.extraPaths": ["src", "lib", "resources"],
21+
"python.analysis.diagnosticMode": "workspace",
22+
"python.analysis.stubPath": ".vscode",
23+
24+
// Pyright settings (Cursor)
25+
// Set typeCheckingMode to "basic" to enable type checking!
26+
"cursorpyright.analysis.typeCheckingMode": "off",
27+
"cursorpyright.analysis.extraPaths": ["src", "lib", "resources"],
28+
"cursorpyright.analysis.diagnosticMode": "workspace",
29+
"cursorpyright.analysis.stubPath": ".vscode",
30+
31+
// General Python settings
32+
"python.defaultInterpreterPath": "./.venv/bin/python",
33+
"python.testing.unittestEnabled": false,
34+
"python.testing.pytestEnabled": true,
35+
"[python]": {
36+
"editor.defaultFormatter": "ms-python.black-formatter",
37+
"editor.formatOnSave": true,
38+
},
39+
}
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# User-Defined Table Functions in Unity Catalog
2+
3+
This project demonstrates how to create and register a Python User-Defined Table Function (UDTF) in Unity Catalog using Databricks Asset Bundles. Once registered, the UDTF becomes available to analysts and other users across your Databricks workspace, callable directly from SQL queries.
4+
5+
**Learn more:** [Introducing Python UDTFs in Unity Catalog](https://www.databricks.com/blog/introducing-python-user-defined-table-functions-udtfs-unity-catalog)
6+
7+
## Concrete example: Definition and Usage
8+
9+
This project includes a k-means clustering algorithm as a UDTF.
10+
11+
### Python Implementation
12+
13+
The UDTF is defined in [`src/kmeans_udtf.py`](src/kmeans_udtf.py) as follows:
14+
15+
```python
16+
class SklearnKMeans:
17+
def __init__(self, id_column: str, columns: list, k: int):
18+
self.id_column = id_column
19+
self.columns = columns
20+
self.k = k
21+
self.data = []
22+
23+
def eval(self, row: Row):
24+
# Process each input row
25+
self.data.append(row)
26+
27+
def terminate(self):
28+
# Perform computation and yield results
29+
# ... clustering logic ...
30+
for record in results:
31+
yield (record.id, record.cluster)
32+
```
33+
34+
### SQL Usage
35+
36+
Once registered, any analyst or SQL user in your workspace can call the UDTF from SQL queries:
37+
38+
```sql
39+
SELECT * FROM main.your_schema.k_means(
40+
input_data => TABLE(SELECT * FROM my_data),
41+
id_column => 'id',
42+
columns => array('feature1', 'feature2', 'feature3'),
43+
k => 3
44+
)
45+
```
46+
47+
The UDTF integrates seamlessly with:
48+
- SQL queries in notebooks
49+
- Databricks SQL dashboards
50+
- Any tool that connects to your Databricks workspace via SQL
51+
52+
See [`src/sample_notebook.ipynb`](src/sample_notebook.ipynb) for complete examples.
53+
54+
## Getting Started With This Project
55+
56+
### Prerequisites
57+
58+
* Databricks workspace with Unity Catalog enabled
59+
* Databricks CLI installed and configured
60+
* Python with `uv` package manager
61+
62+
### Setup and Testing
63+
64+
1. Install dependencies:
65+
```bash
66+
uv sync --dev
67+
```
68+
69+
2. Run tests (registers and executes the UDTF):
70+
```bash
71+
uv run pytest
72+
```
73+
74+
### Deployment
75+
76+
Deploy to dev:
77+
```bash
78+
databricks bundle deploy --target dev
79+
databricks bundle run register_udtf_job --target dev
80+
```
81+
82+
Deploy to production:
83+
```bash
84+
databricks bundle deploy --target prod
85+
databricks bundle run register_udtf_job --target prod
86+
```
87+
88+
The UDTF will be registered at `main.your_username.k_means` (dev) or `main.prod.k_means` (prod).
89+
90+
## Advanced Topics
91+
92+
**CI/CD Integration:**
93+
- Set up CI/CD for Databricks Asset Bundles following the [CI/CD documentation](https://docs.databricks.com/dev-tools/bundles/ci-cd.html)
94+
- To automatically register the UDTF on deployment, add `databricks bundle run -t prod register_udtf_job` to your deployment script after `databricks bundle deploy -t prod` (alternatively, the job in `resources/udtf_job.yml` can use a schedule for registration)
95+
96+
**Serverless compute vs. clusters:** The job uses serverless compute by default. Customize catalog/schema in `databricks.yml` or via job parameters.
97+
98+
## Learn More
99+
100+
- [Introducing Python UDTFs in Unity Catalog](https://www.databricks.com/blog/introducing-python-user-defined-table-functions-udtfs-unity-catalog) - Blog post covering UDTF concepts and use cases
101+
- [Python UDTFs Documentation](https://docs.databricks.com/udf/udtf-unity-catalog.html) - Official documentation
102+
- [Databricks Asset Bundles](https://docs.databricks.com/dev-tools/bundles/index.html) - CI/CD and deployment framework
103+
- [Unity Catalog Functions](https://docs.databricks.com/udf/unity-catalog.html) - Governance and sharing
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# This is a Databricks asset bundle definition for udtf_operator.
2+
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
3+
bundle:
4+
name: udtf_operator
5+
uuid: 67b02248-6ca2-4230-8ce5-fc7f5e37c7a5
6+
7+
include:
8+
- resources/*.yml
9+
- resources/*/*.yml
10+
11+
# Variable declarations. These variables are assigned in the dev/prod targets below.
12+
variables:
13+
catalog:
14+
description: The catalog to use
15+
schema:
16+
description: The schema to use
17+
18+
targets:
19+
dev:
20+
# The default target uses 'mode: development' to create a development copy.
21+
# - Deployed resources get prefixed with '[dev my_user_name]'
22+
# - Any job schedules and triggers are paused by default.
23+
# See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
24+
mode: development
25+
default: true
26+
workspace:
27+
host: https://company.databricks.com
28+
variables:
29+
catalog: main
30+
schema: ${workspace.current_user.short_name}
31+
prod:
32+
mode: production
33+
workspace:
34+
host: https://company.databricks.com
35+
# We explicitly deploy to /Workspace/Users/[email protected] to make sure we only have a single copy.
36+
root_path: /Workspace/Users/[email protected]/.bundle/${bundle.name}/${bundle.target}
37+
variables:
38+
catalog: main
39+
schema: prod
40+
permissions:
41+
- user_name: [email protected]
42+
level: CAN_MANAGE
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Test fixtures directory
2+
3+
Add JSON or CSV files here. In tests, use them with `load_fixture()`:
4+
5+
```
6+
def test_using_fixture(load_fixture):
7+
data = load_fixture("my_data.json")
8+
assert len(data) >= 1
9+
```
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
PassengerId,Age,Pclass,Survived
2+
1,22.0,3,0
3+
2,38.0,1,1
4+
3,26.0,3,1
5+
4,35.0,1,1
6+
5,35.0,3,0
7+
6,54.0,1,0
8+
7,2.0,3,0
9+
8,27.0,1,0
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
[project]
2+
name = "udtf_operator"
3+
version = "0.0.1"
4+
authors = [{ name = "[email protected]" }]
5+
requires-python = ">=3.10,<3.13"
6+
dependencies = [
7+
"scikit-learn>=1.3.0",
8+
"numpy>=1.24.0",
9+
]
10+
11+
[dependency-groups]
12+
dev = [
13+
"pytest",
14+
"databricks-dlt",
15+
"databricks-connect>=15.4,<15.5",
16+
"ipykernel",
17+
]
18+
19+
[project.scripts]
20+
main = "udtf_operator.main:main"
21+
22+
[build-system]
23+
requires = ["hatchling"]
24+
build-backend = "hatchling.build"
25+
26+
[tool.hatch.build.targets.wheel]
27+
packages = ["src"]
28+
29+
[tool.black]
30+
line-length = 125
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
resources:
2+
jobs:
3+
register_udtf_job:
4+
name: register_udtf_job
5+
schedule:
6+
quartz_cron_expression: '0 0 8 * * ?' # daily at 8am
7+
timezone_id: "UTC"
8+
parameters:
9+
- name: catalog
10+
default: ${var.catalog}
11+
- name: schema
12+
default: ${var.schema}
13+
tasks:
14+
- task_key: create_udtf
15+
spark_python_task:
16+
python_file: ../src/main.py
17+
parameters:
18+
- "--catalog"
19+
- "{{job.parameters.catalog}}"
20+
- "--schema"
21+
- "{{job.parameters.schema}}"
22+
environment_key: default
23+
environments:
24+
- environment_key: default
25+
spec:
26+
environment_version: "4"

0 commit comments

Comments
 (0)