dbxmetagen is a utility for generating high-quality metadata for Unity Catalog, and includes the ability to identify and classify data and metadata in various ways to assist enrichment of Unity Catalog, and quickly building up data catalogs for governance purposes.
Options:
- descriptions for tables and columns in Databricks, enhancing enterprise search, governance, Databricks Genie performance, and any other tooling that benefit significantly from high quality metadata.
- identify and classify columns and tables into personal information (PI) into PII, PHI, and PCI.
- predict domain + subdomain for tables
The tool is highly configurable, supporting bulk operations, SDLC integration, and fine-grained control over privacy and output formats.
-
Clone the repo into a Git Folder in your Databricks workspace
Create Git Folder → Clone https://github.com/databricks-industry-solutions/dbxmetagen -
Open the notebook:
notebooks/generate_metadata.py -
Run the first few cells to setup widgets, then fill in the widgets:
- catalog_name (required): Your Unity Catalog name
- table_names (required): Comma-separated list (e.g.,
catalog.schema.table1, catalog.schema.table2). You can provide entire schemas ascatalog.schema.*. Use test tables or a small schema for initial runs. - metadata_type: Choose
comment,pi, ordomain - Adjust other widgets as needed (all have sensible defaults)
-
Run the notebook - metadata will be generated. If you change apply_ddl to true, then the DDL will be applied to your tables.
That's it! No YAML files to edit, no deployment scripts to run.
If you want to take the next step with this same quickstart approach:
- Update notebooks/table_names.csv instead of the table names widget.
- Explore the review functionality - sync_reviewed_ddl notebook.
- Review some of the advanced options in variables.yml
- Manually create a Databricks job and run notebooks/generate_metadata.py as a task.
- Move to full deployment using Databricks Asset Bundles - only necessary to use the app, otherwise everything can be done with a simple git clone into the workspace.
For a web UI with job management, metadata review, and team collaboration:
-
Prerequisites:
- Databricks CLI installed and configured:
databricks configure --profile <your-profile> - You may need to use
databricks auth login --profile <your-profile> - Python 3.10+, Poetry (for building a wheel when using Databricks asset bundles)
- Databricks CLI installed and configured:
-
Configure environment:
cp example.env dev.env # Edit dev.env with your workspace URL and permission groups/users Update databricks.yml host for asset bundle deploy
-
Deploy:
./deploy.sh --profile <your-profile> --target <your-dab-target>
-
Access: Go to Databricks Workspace → Apps → dbxmetagen-app
OR
Run through Automation
databricks bundle run metadata_generator_job -t <your-dab-target> -p <your-profile> --params table_names='<catalog.schema>.*',mode=domainOR
Notebook: Go to Databricks Workspace → generate_metadata notebook
- Comments: AI-generated descriptions for tables and columns
- PI Classification: Identify and tag PII, PHI, and PCI with Unity Catalog tags
- Domain Classification: Automatically categorize tables into business domains and subdomains with Unity Catalog tags
- AI-generated metadata must be human-reviewed for full compliance.
- Generated comments may include data samples or metadata, depending on settings.
- Compliance (e.g., HIPAA) is the user's responsibility.
- Unless configured otherwise, dbxmetagen inspects data and sends it to the specified model endpoint. There are a wide variety of options to control this behavior in detail.
- Configuration-driven: Basic required settings can be managed via widgets and in the app config. Advanced settings are managed via
variables.yml. - AI-assisted: Both comment generation and PI identification and classification use both AI-based and deterministic or data engineering approaches, combining multiple sophisticated approaches to get you quality metadata.
- Data Sampling: Controls over sample size, inclusion of data, and metadata.
- Validation: Uses
Pydanticand structured outputs for schema enforcement. - Logging: Tracks processed tables and results.
- DDL Generation: Produces
ALTER TABLEstatements for integration. - Manual Overrides: Supports CSV-based overrides for granular control.
Both primary entry points for this application are Databricks notebooks.
- dbxmetagen/src/notebooks/generate_metadata This is the primary entry point for the application, allowing both comment generation and PI identification and classification.
- dbxmetagen/src/notebooks/sync_reviewed_ddl This utility allows re-integration of reviewed and edited run logs in tsv or excel format to be used to apply DDL to tables.
For detailed information on how different team members use dbxmetagen, see docs/PERSONAS_AND_WORKFLOWS.md.
-
Simple Workflow: Clone repo, configure widgets, update
notebooks/table_names.csv, run notebook. -
Advanced Workflow: Adjust PI definitions, acronyms, secondary options; use asset bundles or web terminal for deployment; leverage manual overrides.
- Clone the Repo into Databricks or locally
- If cloned into Repos in Databricks, one can run the notebook using an all-purpose cluster (tested on 14.3 ML LTS, 15.4 ML LTS, 16.4 ML) without further deployment, simply adjusting variables.yml and widgets in the notebook.
- If cloned locally, we recommend using Databricks asset bundle build to create and run a workflow.
- Either create a catalog or use an existing one. If using the app, app SPN will need permissions on the catalog.
- Same for schema and volume, defaults are in variables.yml.
- Whether using asset bundles, or the notebook run, adjust the host urls, catalog name, and if desired schema name in resources/variables/variables.yml.
- Review the settings in the config.py file in src/dbxmetagen to whatever settings you need. If you want to make changes to variables in your project, change them in the notebook widget.
- Make sure to check the options for add_metadata and apply_ddl and set them correctly. Add metadata will run a describe extended on every column and use the metadata in table descriptions, though ANALYZE ... COLUMNS will need to have been run to get useful information from this.
- You also can adjust sample_size, columns_per_call, and ACRO_CONTENT, as well as many other variables in variables.yml.
- Point to a test table to start, though by default DDL will not be applied, instead it will only be generated and added to .sql files in the volume generated_metadata.
- Settings in the notebook widgets will override settings in config.py, so make sure the widgets in the main notebook are updated appropriately.
- In notebooks/table_names.csv, keep the first row as table_name and add the list of tables you want metadata to be generated for. Add them as . if they are in the same catalog that you define your catalog in variables.yml file separately, or you can use a three-level namespace for these table names. You can also provide ..* to run against all tables in a schema.
Most configurations that users should change are in variables.yml. There are a variety of useful options, please read the descriptions, I will not rewrite them all here.
Key settings in
variables.yml:Privacy:
allow_data: false = no data sent to LLMs (maximum privacy)sample_size: Number of rows sampled (0 = no sampling)allow_data_in_comments: Control data in generated comments
Model:
model: LLM endpoint (recommend newer versions of claude sonnet or GPT-OSS).columns_per_call: Columns per LLM call (5-10 recommended)temperature: Model creativity (0.1 for consistency)
Workflow:
apply_ddl: Apply changes directly to Unity Catalog (false = generate only)ddl_output_format: Output format (excel, tsv, or sql)
PI Detection:
include_deterministic_pi: Use Presidio for rule-based detectiondisable_medical_information_value: Treat all medical data as PHI
For complete configuration reference, see docs/CONFIGURATION.md.
- Tested on DBR 17.4 LTS, 16.4 LTS, and 15.4 LTS, as well as the ML versions.
- Serverless runtimes tested extensively but runtimes are less consistent.
- Views only work on 16.4. Pre-16.4, alternative DDL is used that only works on tables.
- Excel writes for metadata generator or sync_reviewed_ddl only work on ML runtimes. If you must use a standard runtime, leverage tsv.
- Throttling: Default endpoints may throttle during large or concurrent jobs.
- Sampling: Balance sample size for accuracy and performance.
- Chunking: Fewer columns per call yield richer comments but may increase cost/time.
- Security: Default endpoints are not HIPAA-compliant; configure secure endpoints as needed.
- PI Mode: Use more rows and smaller chunks for better PI detection.
Automatically categorizes tables into business domains and subdomains by analyzing table data and schema metadata. Tags are applied to Unity Catalog for better data organization and discovery.
Configuration: Edit
configurations/domain_config.yamlto match your organization's business structure. Default domains include Finance, Clinical, CRM, HR, Operations, Marketing, Product, IT, and Legal.Customization:
- Copy and modify
domain_config.yaml - Add/remove domains and subdomains
- Update descriptions and keywords for your industry
- Test on sample tables and refine
Review domain assignments in the Streamlit app or exported run logs before applying to Unity Catalog.
This project is licensed under the Databricks DB License.
All packages are open source with permissive licenses (Apache 2.0, MIT, BSD 3-Clause) that allow commercial use, modification, and redistribution.
Other libraries used come from the Databricks runtime
Thanks to James McCall, Diego Malaver, Aaaron Zavora, and Charles Linville for discussions around dbxmetagen.
For detailed configuration options, see docs/CONFIGURATION.md.
For user workflows and team roles, see docs/PERSONAS_AND_WORKFLOWS.md.
