|
| 1 | +# User-Defined Table Functions in Unity Catalog |
| 2 | + |
| 3 | +This project demonstrates how to create and register a Python User-Defined Table Function (UDTF) in Unity Catalog using Databricks Asset Bundles. Once registered, the UDTF becomes available to analysts and other users across your Databricks workspace, callable directly from SQL queries. |
| 4 | + |
| 5 | +**Learn more:** [Introducing Python UDTFs in Unity Catalog](https://www.databricks.com/blog/introducing-python-user-defined-table-functions-udtfs-unity-catalog) |
| 6 | + |
| 7 | +## Concrete example: Definition and Usage |
| 8 | + |
| 9 | +This project includes a k-means clustering algorithm as a UDTF. |
| 10 | + |
| 11 | +### Python Implementation |
| 12 | + |
| 13 | +The UDTF is defined in [`src/kmeans_udtf.py`](src/kmeans_udtf.py) as follows: |
| 14 | + |
| 15 | +```python |
| 16 | +class SklearnKMeans: |
| 17 | + def __init__(self, id_column: str, columns: list, k: int): |
| 18 | + self.id_column = id_column |
| 19 | + self.columns = columns |
| 20 | + self.k = k |
| 21 | + self.data = [] |
| 22 | + |
| 23 | + def eval(self, row: Row): |
| 24 | + # Process each input row |
| 25 | + self.data.append(row) |
| 26 | + |
| 27 | + def terminate(self): |
| 28 | + # Perform computation and yield results |
| 29 | + # ... clustering logic ... |
| 30 | + for record in results: |
| 31 | + yield (record.id, record.cluster) |
| 32 | +``` |
| 33 | + |
| 34 | +### SQL Usage |
| 35 | + |
| 36 | +Once registered, any analyst or SQL user in your workspace can call the UDTF from SQL queries: |
| 37 | + |
| 38 | +```sql |
| 39 | +SELECT * FROM main.your_schema.k_means( |
| 40 | + input_data => TABLE(SELECT * FROM my_data), |
| 41 | + id_column => 'id', |
| 42 | + columns => array('feature1', 'feature2', 'feature3'), |
| 43 | + k => 3 |
| 44 | +) |
| 45 | +``` |
| 46 | + |
| 47 | +The UDTF integrates seamlessly with: |
| 48 | +- SQL queries in notebooks |
| 49 | +- Databricks SQL dashboards |
| 50 | +- Any tool that connects to your Databricks workspace via SQL |
| 51 | + |
| 52 | +See [`src/sample_notebook.ipynb`](src/sample_notebook.ipynb) for complete examples. |
| 53 | + |
| 54 | +## Getting Started With This Project |
| 55 | + |
| 56 | +### Prerequisites |
| 57 | + |
| 58 | +* Databricks workspace with Unity Catalog enabled |
| 59 | +* Databricks CLI installed and configured |
| 60 | +* Python with `uv` package manager |
| 61 | + |
| 62 | +### Setup and Testing |
| 63 | + |
| 64 | +1. Install dependencies: |
| 65 | + ```bash |
| 66 | + uv sync --dev |
| 67 | + ``` |
| 68 | + |
| 69 | +2. Run tests (registers and executes the UDTF): |
| 70 | + ```bash |
| 71 | + uv run pytest |
| 72 | + ``` |
| 73 | + |
| 74 | +### Deployment |
| 75 | + |
| 76 | +Deploy to dev: |
| 77 | +```bash |
| 78 | +databricks bundle deploy --target dev |
| 79 | +databricks bundle run register_udtf_job --target dev |
| 80 | +``` |
| 81 | + |
| 82 | +Deploy to production: |
| 83 | +```bash |
| 84 | +databricks bundle deploy --target prod |
| 85 | +databricks bundle run register_udtf_job --target prod |
| 86 | +``` |
| 87 | + |
| 88 | +The UDTF will be registered at `main.your_username.k_means` (dev) or `main.prod.k_means` (prod). |
| 89 | + |
| 90 | +## Advanced Topics |
| 91 | + |
| 92 | +**CI/CD Integration:** |
| 93 | +- Set up CI/CD for Databricks Asset Bundles following the [CI/CD documentation](https://docs.databricks.com/dev-tools/bundles/ci-cd.html) |
| 94 | +- To automatically register the UDTF on deployment, add `databricks bundle run -t prod register_udtf_job` to your deployment script after `databricks bundle deploy -t prod` (alternatively, the job in `resources/udtf_job.yml` can use a schedule for registration) |
| 95 | + |
| 96 | +**Serverless compute vs. clusters:** The job uses serverless compute by default. Customize catalog/schema in `databricks.yml` or via job parameters. |
| 97 | + |
| 98 | +## Learn More |
| 99 | + |
| 100 | +- [Introducing Python UDTFs in Unity Catalog](https://www.databricks.com/blog/introducing-python-user-defined-table-functions-udtfs-unity-catalog) - Blog post covering UDTF concepts and use cases |
| 101 | +- [Python UDTFs Documentation](https://docs.databricks.com/udf/udtf-unity-catalog.html) - Official documentation |
| 102 | +- [Databricks Asset Bundles](https://docs.databricks.com/dev-tools/bundles/index.html) - CI/CD and deployment framework |
| 103 | +- [Unity Catalog Functions](https://docs.databricks.com/udf/unity-catalog.html) - Governance and sharing |
0 commit comments