diff --git a/.claude/skills/EXAMPLES.md b/.claude/skills/EXAMPLES.md new file mode 100644 index 0000000..ab6533c --- /dev/null +++ b/.claude/skills/EXAMPLES.md @@ -0,0 +1,167 @@ +# Databricks Asset Bundle Examples Reference + +This document provides GitHub URLs for fetching bundle examples when the skills are used outside the bundle-examples repository. + +## Base GitHub Repository +https://github.com/databricks/bundle-examples + +## Core Bundle Examples + +### Minimal Bundle +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_minimal/databricks.yml +- **Use for**: Simplest bundle structure, getting started + +### Standard Python Project +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_python/databricks.yml +- **Job example**: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_python/resources/sample_job.job.yml +- **Pipeline example**: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_python/resources/default_python_etl.pipeline.yml +- **pyproject.toml**: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_python/pyproject.toml +- **Use for**: Python projects with jobs and pipelines + +### Python-Based Resources (pydabs) +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/pydabs/databricks.yml +- **resources/__init__.py**: https://raw.githubusercontent.com/databricks/bundle-examples/main/pydabs/resources/__init__.py +- **Job example**: https://raw.githubusercontent.com/databricks/bundle-examples/main/pydabs/resources/sample_job.py +- **Use for**: Python-defined resources, dynamic generation + +### SQL Project +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_sql/databricks.yml +- **Use for**: SQL-focused analytics projects + +### DLT Python Pipelines +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/lakeflow_pipelines_python/databricks.yml +- **Use for**: Delta Live Tables with Python + +### DLT SQL Pipelines +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/lakeflow_pipelines_sql/databricks.yml +- **Use for**: Delta Live Tables with SQL + +### MLOps Stack +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/mlops_stacks/databricks.yml +- **ML resources**: https://raw.githubusercontent.com/databricks/bundle-examples/main/mlops_stacks/mlops_stacks/resources/ml-artifacts-resource.yml +- **Use for**: Full ML lifecycle workflows + +## Knowledge Base Examples + +### Serverless Job +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/serverless_job/databricks.yml +- **Use for**: Serverless compute patterns + +### Job with Multiple Wheels +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/job_with_multiple_wheels/databricks.yml +- **Use for**: Multiple library dependencies + +### Job with Run Job Tasks +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/job_with_run_job_tasks/databricks.yml +- **Use for**: Job orchestration patterns + +### Job with SQL Notebook +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/job_with_sql_notebook/databricks.yml +- **Use for**: SQL notebook tasks + +### Job Read Secret +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/job_read_secret/databricks.yml +- **Use for**: Secret management patterns + +### Pipeline with Schema +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/pipeline_with_schema/databricks.yml +- **Use for**: Pipelines with Unity Catalog integration + +### Development Cluster +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/development_cluster/databricks.yml +- **Use for**: Custom cluster configurations + +### Alerts +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/alerts/databricks.yml +- **Alert example**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/alerts/resources/nyc_taxi_daily_revenue.alert.yml +- **Use for**: SQL alert configurations + +### Databricks App +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/databricks_app/databricks.yml +- **Use for**: App deployment + +### App with Database +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/app_with_database/databricks.yml +- **App resource**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/app_with_database/resources/myapp.app.yml +- **Use for**: Apps with database connections + +### Dashboard +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/dashboard_nyc_taxi/databricks.yml +- **Dashboard resource**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/dashboard_nyc_taxi/resources/nyc_taxi_trip_analysis.dashboard.yml +- **Use for**: AI/BI Dashboard deployment + +### Database with Catalog +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/database_with_catalog/databricks.yml +- **Use for**: Database instances and catalogs + +### Write from Job to Volume +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/write_from_job_to_volume/databricks.yml +- **Schema**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/write_from_job_to_volume/resources/hello_world.schema.yml +- **Volume**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/write_from_job_to_volume/resources/my_volume.volume.yml +- **Use for**: Volume management patterns + +### Share Files Across Bundles +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/share_files_across_bundles/databricks.yml +- **Use for**: Code sharing patterns + +### Target Includes +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/target_includes/databricks.yml +- **Use for**: Environment-specific resource inclusion + +### Private Wheel Packages +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/private_wheel_packages/databricks.yml +- **Use for**: Private package repositories + +### Python Wheel with Poetry +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/python_wheel_poetry/databricks.yml +- **Use for**: Poetry-based wheel building + +### Spark JAR Task +- **URL**: https://raw.githubusercontent.com/databricks/bundle-examples/main/knowledge_base/spark_jar_task/databricks.yml +- **Use for**: JAR artifacts and Scala/Java jobs + +## Usage Pattern in Skills + +Skills follow a three-tier approach for maximum portability: + +### Tier 1: Try Local Files First +``` +Use Glob to search: Glob("**/databricks.yml") or Glob("**/*.job.yml") +If found, use Read to examine local examples +``` + +### Tier 2: Fetch from GitHub +``` +If no local files, use WebFetch with URLs from this document +Example: WebFetch(url="https://raw.githubusercontent.com/databricks/bundle-examples/main/default_python/databricks.yml", + prompt="Extract the configuration pattern") +``` + +### Tier 3: Use Inline Templates +``` +All skills contain comprehensive inline YAML/Python templates +Works even without local files or network access +``` + +## Example Usage in Skill + +```markdown +## Instructions + +1. **Try local examples first** + - Use `Glob("**/*.job.yml")` to find local job files + - If found, use Read to examine them + +2. **Fetch from GitHub if needed** + - Use WebFetch with URL: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_python/resources/sample_job.job.yml + - Extract relevant patterns for user's needs + +3. **Provide inline templates** + - Show comprehensive example from skill templates + - Customize based on user requirements +``` + +This three-tier approach ensures skills work everywhere: +- ✅ In the bundle-examples repo (local files) +- ✅ In any other repo (GitHub fetch) +- ✅ Offline (inline templates) diff --git a/.claude/skills/README.md b/.claude/skills/README.md new file mode 100644 index 0000000..eaac470 --- /dev/null +++ b/.claude/skills/README.md @@ -0,0 +1,129 @@ +# Databricks Asset Bundle Claude Skills + +This folder contains specialized Claude skills for working with Databricks Asset Bundles (DABs). These skills provide expert guidance on creating, configuring, validating, and deploying bundles. + +## Skills Overview + +### Core Bundle Skills +- **create-bundle** - Create new bundles from scratch with proper structure +- **validate-bundle** - Validate configurations and troubleshoot errors +- **optimize-bundle** - Design patterns, architecture guidance, and optimization +- **secure-bundle** - Permissions, grants, secrets, and security + +### Resource Configuration Skills +- **configure-job** - Configure jobs with tasks, schedules, and orchestration +- **configure-pipeline** - Set up Delta Live Tables (DLT) pipelines +- **configure-app** - Deploy Databricks Apps (Dash, Streamlit, etc.) +- **configure-dashboard** - Configure AI/BI dashboards and snapshots +- **configure-alert** - Set up SQL alerts and monitoring +- **configure-schema** - Manage Unity Catalog schemas and catalogs +- **configure-volume** - Configure Unity Catalog volumes for file storage +- **configure-cluster** - Configure clusters for jobs and pipelines +- **configure-ml-resources** - Set up ML resources (models, experiments) + +### Advanced Configuration Skills +- **use-python-resources** - Python-based resource definitions (pydabs pattern) +- **configure-environments** - Configure dev/staging/prod deployment environments +- **manage-variables** - Variables, interpolation, and resource references +- **manage-dependencies** - Build and manage Python wheels, JARs, and libraries + +## How Skills Work + +### Three-Tier Approach for Maximum Portability + +These skills are designed to work in any repository, not just the bundle-examples repo: + +**Tier 1: Local Examples** (if available) +- Skills first try to find local example files using Glob +- If found, they Read and reference them +- Works when skills are in the bundle-examples repo + +**Tier 2: GitHub Examples** (fetch remotely) +- If no local files, skills use WebFetch to get examples from: + - https://github.com/databricks/bundle-examples +- Ensures skills work in any repository +- Always fetches latest examples + +**Tier 3: Inline Templates** (always available) +- All skills contain comprehensive inline YAML/Python examples +- Works even without local files or network access +- Self-contained and complete + +### Documentation Integration + +Every skill automatically fetches the latest Databricks documentation using WebFetch: +- Official docs: https://docs.databricks.com/aws/en/dev-tools/bundles/ +- Always up-to-date guidance +- Combines official docs with practical examples + +## Usage Examples + +### Creating a New Bundle +``` +"Help me create a new Python ETL bundle" +→ Triggers create-bundle skill +→ Fetches latest docs +→ Shows relevant examples +→ Generates complete databricks.yml +``` + +### Configuring a Job +``` +"Create a job that runs daily at 2 AM" +→ Triggers configure-job skill +→ Fetches job task documentation +→ Shows job configuration patterns +→ Generates complete job YAML +``` + +### Troubleshooting +``` +"Getting error: variable 'catalog' not defined" +→ Triggers validate-bundle skill +→ Diagnoses the issue +→ Shows fix with correct configuration +``` + +### Best Practices +``` +"How should I structure my production deployment?" +→ Triggers optimize-bundle skill +→ Fetches deployment mode docs +→ Shows production patterns +→ Provides security checklist +``` + +## Portability Features + +These skills are fully portable and can be copied to any repository: + +✅ **Work in bundle-examples repo** - Uses local example files +✅ **Work in any other repo** - Fetches examples from GitHub +✅ **Work offline** - Uses comprehensive inline templates +✅ **Always current** - Fetches latest Databricks documentation +✅ **Self-contained** - No external dependencies required + +## References + +- **EXAMPLES.md** - Complete list of GitHub URLs for all example files +- **Databricks Docs**: https://docs.databricks.com/aws/en/dev-tools/bundles/ +- **GitHub Repo**: https://github.com/databricks/bundle-examples + +## Skill Development Notes + +Each skill follows Claude Agent best practices: +- YAML frontmatter with `name` and `description` +- Clear procedural instructions for Claude +- Comprehensive inline examples +- Progressive disclosure (local → GitHub → inline) +- Related skills cross-references +- Common issues and solutions + +## Contributing + +When updating skills: +1. Ensure three-tier approach is maintained (local/GitHub/inline) +2. Keep inline examples comprehensive and current +3. Update GitHub URLs if example locations change +4. Test skills work outside bundle-examples repo +5. Update EXAMPLES.md if adding new example references diff --git a/.claude/skills/configure-alert.md b/.claude/skills/configure-alert.md new file mode 100644 index 0000000..0da1e59 --- /dev/null +++ b/.claude/skills/configure-alert.md @@ -0,0 +1,112 @@ +--- +name: configure-alert +description: Expert assistance for Alert resources. Use when users want to create SQL alerts, configure thresholds, set up monitoring, or configure alert notifications. +--- + +# Resource: Alert - SQL Alert Configuration + +## Instructions + +1. **Understand alert needs** + - What metric to monitor? + - Threshold values? + - Notification requirements? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (alerts section) + +3. **Find example** + - knowledge_base/alerts/ + +4. **Provide configuration** + - Alert resource with query + - Evaluation criteria + - Schedule configuration + +## Key Patterns + +### Basic Alert +```yaml +resources: + alerts: + revenue_alert: + name: "Daily Revenue Alert" + query_text: | + SELECT SUM(revenue) as total_revenue + FROM ${var.catalog}.${var.schema}.sales + WHERE date = current_date() + warehouse_id: ${var.warehouse_id} + evaluation: + comparison: GREATER_THAN + threshold_value: 100000 + source: MAX # MAX, MIN, AVG, SUM + schedule: + quartz_cron_schedule: "0 0 8 * * ?" + timezone_id: "UTC" +``` + +### Alert with Retrigger +```yaml +resources: + alerts: + error_alert: + name: "Error Rate Alert" + query_text: | + SELECT COUNT(*) as error_count + FROM ${var.catalog}.${var.schema}.logs + WHERE level = 'ERROR' + AND timestamp > current_timestamp() - INTERVAL 1 HOUR + warehouse_id: ${var.warehouse_id} + evaluation: + comparison: GREATER_THAN + threshold_value: 10 + schedule: + quartz_cron_schedule: "0 */15 * * * ?" # Every 15 minutes + timezone_id: "UTC" + retrigger_interval_seconds: 3600 # Don't retrigger within 1 hour +``` + +### Alert with File Query +```yaml +resources: + alerts: + performance_alert: + name: "Performance Alert" + query_file: ./src/queries/performance_check.sql + warehouse_id: ${var.warehouse_id} + evaluation: + comparison: LESS_THAN + threshold_value: 95 # Alert if < 95% + source: MIN + schedule: + quartz_cron_schedule: "0 0 * * * ?" +``` + +## Comparison Operators + +- `GREATER_THAN` +- `GREATER_THAN_OR_EQUAL` +- `LESS_THAN` +- `LESS_THAN_OR_EQUAL` +- `EQUAL` +- `NOT_EQUAL` + +## Source Aggregations + +- `MAX` - Maximum value +- `MIN` - Minimum value +- `AVG` - Average value +- `SUM` - Sum of values + +## Examples + +``` +User: "Create an alert when daily sales drop below threshold" + +Steps: +1. Read knowledge_base/alerts/ example +2. Create alert with query_text +3. Configure LESS_THAN comparison +4. Set up daily schedule +5. Add notification configuration +``` diff --git a/.claude/skills/configure-app.md b/.claude/skills/configure-app.md new file mode 100644 index 0000000..7d48daa --- /dev/null +++ b/.claude/skills/configure-app.md @@ -0,0 +1,82 @@ +--- +name: configure-app +description: Expert assistance for Databricks Apps resources. Use when users want to deploy apps (Dash, Streamlit, Gradio, etc.) or configure app integrations with databases. +--- + +# Resource: App - Databricks Apps Configuration + +## Instructions + +1. **Understand app type** + - What framework? (Dash, Streamlit, Gradio) + - Database connections needed? + - Permissions required? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (apps section) + +3. **Find examples** + - knowledge_base/databricks_app/ + - knowledge_base/app_with_database/ + +4. **Provide configuration** + - App resource YAML + - source_code_path + - Database connections if needed + +## Key Patterns + +### Basic App +```yaml +resources: + apps: + my_app: + name: "My Dashboard App" + source_code_path: ./src/app +``` + +### App with Database +```yaml +resources: + database_instances: + app_db: + name: "app-database" + instance_size: SMALL + + apps: + my_app: + name: "App with Database" + source_code_path: ./src/app + resources: + - name: main_db + database_instance: + database_instance_source: + database_instance_id: ${resources.database_instances.app_db.id} + permission: CAN_CONNECT_AND_CREATE +``` + +### App with Permissions +```yaml +resources: + apps: + team_app: + name: "Team Dashboard" + source_code_path: ./src/app + permissions: + - group_name: analysts + level: CAN_VIEW + - group_name: engineers + level: CAN_MANAGE +``` + +## Examples + +``` +User: "Deploy a Streamlit app with database access" + +Steps: +1. Read knowledge_base/app_with_database/ +2. Configure database_instances resource +3. Configure app with database reference +4. Explain app structure and requirements +``` diff --git a/.claude/skills/configure-cluster.md b/.claude/skills/configure-cluster.md new file mode 100644 index 0000000..1108c3f --- /dev/null +++ b/.claude/skills/configure-cluster.md @@ -0,0 +1,145 @@ +--- +name: configure-cluster +description: Expert assistance for cluster configurations in jobs and pipelines. Use when users need help configuring job_clusters, compute specifications, or choosing between serverless and cluster-based compute. +--- + +# Resource: Cluster - Cluster Configuration Expert + +## Instructions + +1. **Understand compute needs** + - Serverless or custom cluster? + - Size and scaling requirements? + - Spark configuration needs? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (clusters section) + +3. **Find examples** + - knowledge_base/development_cluster/ + - knowledge_base/serverless_job/ + +4. **Provide configuration** + - job_clusters specification + - Or recommend serverless + - Spark configuration if needed + +## Key Patterns + +### Serverless (Recommended) +```yaml +resources: + jobs: + serverless_job: + name: "Serverless Job" + tasks: + - task_key: process + python_wheel_task: + package_name: my_project + entry_point: main + libraries: + - whl: ./dist/*.whl + # No cluster configuration needed - serverless by default +``` + +### Job with Custom Cluster +```yaml +resources: + jobs: + cluster_job: + name: "Job with Cluster" + job_clusters: + - job_cluster_key: main_cluster + new_cluster: + spark_version: "15.4.x-scala2.12" + node_type_id: ${var.cluster_node_type} + num_workers: ${var.cluster_workers} + spark_conf: + "spark.databricks.delta.optimizeWrite.enabled": "true" + "spark.databricks.delta.autoCompact.enabled": "true" + spark_env_vars: + "ENV": "${bundle.target}" + tasks: + - task_key: process + job_cluster_key: main_cluster + spark_python_task: + python_file: ./src/process.py +``` + +### Single-Node Cluster +```yaml +job_clusters: + - job_cluster_key: single_node + new_cluster: + spark_version: "15.4.x-scala2.12" + node_type_id: "i3.xlarge" + num_workers: 0 # Single node + spark_conf: + "spark.databricks.cluster.profile": "singleNode" + "spark.master": "local[*]" + custom_tags: + "ResourceClass": "SingleNode" +``` + +### Autoscaling Cluster +```yaml +job_clusters: + - job_cluster_key: autoscale + new_cluster: + spark_version: "15.4.x-scala2.12" + node_type_id: ${var.cluster_node_type} + autoscale: + min_workers: 2 + max_workers: 10 +``` + +### Cluster with Init Scripts +```yaml +job_clusters: + - job_cluster_key: custom_init + new_cluster: + spark_version: "15.4.x-scala2.12" + node_type_id: ${var.cluster_node_type} + num_workers: ${var.cluster_workers} + init_scripts: + - workspace: + destination: "/init-scripts/setup.sh" +``` + +## Serverless vs Cluster Decision + +**Use Serverless when:** +- Standard Python/SQL workloads +- Want simplest configuration +- Cost optimization important +- Fast startup needed + +**Use Custom Cluster when:** +- Special Spark configuration required +- Init scripts needed +- Specific library versions required +- GPU workloads +- Very large data volumes + +## Common spark_conf Settings + +```yaml +spark_conf: + "spark.databricks.delta.optimizeWrite.enabled": "true" + "spark.databricks.delta.autoCompact.enabled": "true" + "spark.sql.adaptive.enabled": "true" + "spark.databricks.photon.enabled": "true" +``` + +## Examples + +``` +User: "My job needs 4 workers and custom Spark settings" + +Steps: +1. Read knowledge_base/development_cluster/ +2. Configure job_clusters with num_workers: 4 +3. Add spark_conf with required settings +4. Use variables for node_type_id +5. Link task to cluster via job_cluster_key +``` diff --git a/.claude/skills/configure-dashboard.md b/.claude/skills/configure-dashboard.md new file mode 100644 index 0000000..f0a3edf --- /dev/null +++ b/.claude/skills/configure-dashboard.md @@ -0,0 +1,87 @@ +--- +name: configure-dashboard +description: Expert assistance for AI/BI Dashboard resources. Use when users want to deploy dashboards, configure dashboard snapshots, or integrate dashboards with jobs. +--- + +# Resource: Dashboard - AI/BI Dashboard Configuration + +## Instructions + +1. **Understand dashboard needs** + - Existing dashboard or new? + - Warehouse requirements + - Snapshot needs? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (dashboards section) + +3. **Find example** + - knowledge_base/dashboard_nyc_taxi/ + +4. **Provide configuration** + - Dashboard resource YAML + - file_path to .lvdash.json + - Warehouse ID reference + +## Key Patterns + +### Basic Dashboard +```yaml +resources: + dashboards: + my_dashboard: + name: "Analytics Dashboard" + file_path: ./dashboards/analytics.lvdash.json + warehouse_id: ${var.warehouse_id} +``` + +### Dashboard with Snapshots in Job +```yaml +resources: + dashboards: + report_dashboard: + name: "Daily Report" + file_path: ./dashboards/report.lvdash.json + warehouse_id: ${var.warehouse_id} + + jobs: + snapshot_job: + name: "Dashboard Snapshot Job" + schedule: + quartz_cron_expression: "0 0 6 * * ?" + tasks: + - task_key: create_snapshot + dashboard_task: + dashboard_id: ${resources.dashboards.report_dashboard.id} + warehouse_id: ${var.warehouse_id} +``` + +### Dashboard with embed_credentials +```yaml +resources: + dashboards: + embedded_dashboard: + name: "Embedded Dashboard" + file_path: ./dashboards/embed.lvdash.json + warehouse_id: ${var.warehouse_id} + embed_credentials: true +``` + +## Commands + +Generate dashboard file: +``` +databricks bundle generate dashboard --resource my_dashboard +``` + +## Examples + +``` +User: "Deploy a dashboard that snapshots daily" + +Steps: +1. Read knowledge_base/dashboard_nyc_taxi/ +2. Create dashboard resource +3. Add job with dashboard_task for snapshots +4. Explain .lvdash.json file generation +``` diff --git a/.claude/skills/configure-environments.md b/.claude/skills/configure-environments.md new file mode 100644 index 0000000..ecd288a --- /dev/null +++ b/.claude/skills/configure-environments.md @@ -0,0 +1,195 @@ +--- +name: configure-environments +description: Expert guidance on deployment modes and target configurations. Use when users need help with dev/prod environments, target setup, or environment-specific configuration. +--- + +# Deployment Modes - Target Configuration Expert + +## Instructions + +1. **Understand environment needs** + - How many environments? (dev, staging, prod) + - Multiple developers? + - Deployment strategy? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/deployment-modes + +3. **Find examples** + - All databricks.yml files show target patterns + - Look for development vs production mode usage + +4. **Provide guidance** + - Target configurations + - Mode selection (development vs production) + - Variable overrides per target + - Permission strategies + +## Key Concepts + +### Development Mode +```yaml +targets: + dev: + mode: development + default: true + workspace: + host: https://workspace.databricks.com + variables: + catalog: dev_catalog + schema: ${workspace.current_user.short_name} +``` + +**Behavior:** +- Auto-prefixes resources: `[dev username] Job Name` +- Pauses job schedules automatically +- Multiple developers can deploy simultaneously +- Each user gets isolated resources +- Default permission is deploying user + +### Production Mode +```yaml +targets: + prod: + mode: production + workspace: + host: https://workspace.databricks.com + root_path: /Workspace/Users/user@company.com/.bundle/${bundle.name}/${bundle.target} + variables: + catalog: prod_catalog + schema: prod + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE + - group_name: data_engineers + level: CAN_RUN + run_as: + service_principal_name: sp-prod-bundle +``` + +**Behavior:** +- No resource prefixing (clean names) +- Schedules run as configured +- Requires explicit permissions +- Should use service principal for execution +- Single deployment (not per-user) + +## Multi-Environment Setup + +### Standard Pattern +```yaml +targets: + dev: + mode: development + default: true + workspace: + host: https://dev-workspace.databricks.com + + staging: + mode: production # Treat staging like prod + workspace: + host: https://staging-workspace.databricks.com + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE + + prod: + mode: production + workspace: + host: https://prod-workspace.databricks.com + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE + run_as: + service_principal_name: sp-prod +``` + +### Target-Specific Variables +```yaml +variables: + catalog: + description: "Catalog name" + cluster_size: + description: "Cluster node type" + +targets: + dev: + mode: development + variables: + catalog: dev + cluster_size: i3.xlarge + + prod: + mode: production + variables: + catalog: prod + cluster_size: i3.4xlarge +``` + +### Target-Specific Resources +```yaml +targets: + dev: + mode: development + resources: + include: + - resources/dev_*.yml + + prod: + mode: production + resources: + include: + - resources/prod_*.yml +``` + +## Best Practices + +1. **Always use development mode for dev target** + - Enables safe multi-developer workflows + - Automatic resource isolation + +2. **Production mode requires explicit permissions** + - Never deploy to prod without permissions + - Use service principals for execution + +3. **Use variables for environment differences** + - Catalog names + - Schema names + - Cluster sizes + - Warehouse IDs + +4. **Set default: true on one target** + - Usually dev target + - Allows `databricks bundle deploy` without `-t` flag + +5. **Use service principals in production** + - More secure than user accounts + - Better for automation + +## Deployment Commands + +```bash +# Deploy to default target (dev) +databricks bundle deploy + +# Deploy to specific target +databricks bundle deploy -t staging +databricks bundle deploy -t prod + +# Run job in specific target +databricks bundle run my_job -t prod +``` + +## Examples + +``` +User: "Set up dev, staging, and prod environments" + +Steps: +1. Show multi-environment pattern +2. Configure dev with mode: development +3. Configure staging and prod with mode: production +4. Set up target-specific variables +5. Explain deployment workflow +6. Guide on service principal setup for prod +``` diff --git a/.claude/skills/configure-job.md b/.claude/skills/configure-job.md new file mode 100644 index 0000000..15c3f48 --- /dev/null +++ b/.claude/skills/configure-job.md @@ -0,0 +1,226 @@ +--- +name: configure-job +description: Expert assistance for configuring Databricks Job resources in bundles. Use when users want to create/modify jobs, configure tasks, set up schedules, or work with job orchestration. +--- + +# Resource: Job - Job Configuration Expert + +## Instructions + +When helping users with job resources: + +1. **Understand requirements** + - Determine task types needed (notebook, Python wheel, SQL, dbt, pipeline, etc.) + - Identify dependencies between tasks + - Check if schedule/trigger needed + - Determine compute needs (serverless vs cluster) + +2. **Fetch documentation** + - Use WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/job-task-types + - This covers all task types and configurations + +3. **Find relevant examples** + - Use Glob to find job examples: `Glob("**/*.job.yml")` + - Key examples: + - default_python/resources/sample_job.job.yml - Comprehensive + - knowledge_base/serverless_job/ - Serverless pattern + - knowledge_base/job_with_multiple_wheels/ - Multiple libraries + - knowledge_base/job_with_run_job_tasks/ - Job orchestration + - pydabs/resources/sample_job.py - Python-defined + +4. **Read examples** + - Use Read to examine 2-3 relevant job files + - Show patterns matching user's needs + +5. **Provide configuration** + - Generate complete job YAML or Python code + - Include task dependencies if multi-task + - Configure schedule if needed + - Set up notifications + +## Key Patterns + +### Basic Serverless Job +```yaml +resources: + jobs: + my_job: + name: "ETL Job" + tasks: + - task_key: process + python_wheel_task: + package_name: my_project + entry_point: main + libraries: + - whl: ./dist/*.whl +``` + +### Multi-Task with Dependencies +```yaml +resources: + jobs: + pipeline_job: + name: "Data Pipeline" + tasks: + - task_key: extract + notebook_task: + notebook_path: ./src/notebooks/extract.py + + - task_key: transform + depends_on: + - task_key: extract + python_wheel_task: + package_name: my_project + entry_point: transform + libraries: + - whl: ./dist/*.whl + + - task_key: load + depends_on: + - task_key: transform + spark_python_task: + python_file: ./src/load.py +``` + +### Job with Schedule +```yaml +resources: + jobs: + daily_job: + name: "Daily ETL" + schedule: + quartz_cron_expression: "0 0 2 * * ?" # 2 AM daily + timezone_id: "America/Los_Angeles" + pause_status: UNPAUSED + email_notifications: + on_failure: + - team@company.com + tasks: + - task_key: run + python_wheel_task: + package_name: my_project + entry_point: main + libraries: + - whl: ./dist/*.whl +``` + +### Pipeline Refresh Task +```yaml +resources: + jobs: + orchestrator: + name: "Pipeline Orchestrator" + tasks: + - task_key: refresh_pipeline + pipeline_task: + pipeline_id: ${resources.pipelines.my_pipeline.id} +``` + +### SQL Task +```yaml +resources: + jobs: + sql_job: + name: "SQL Job" + tasks: + - task_key: query + sql_task: + warehouse_id: ${var.warehouse_id} + file: + path: ./src/queries/report.sql +``` + +### dbt Task +```yaml +resources: + jobs: + dbt_job: + name: "dbt Transformation" + tasks: + - task_key: dbt_run + dbt_task: + project_directory: dbt_project + commands: + - "dbt run" + - "dbt test" + warehouse_id: ${var.warehouse_id} +``` + +### Job with Custom Cluster +```yaml +resources: + jobs: + cluster_job: + name: "Job with Cluster" + job_clusters: + - job_cluster_key: main + new_cluster: + spark_version: "15.4.x-scala2.12" + node_type_id: ${var.cluster_node_type} + num_workers: ${var.cluster_workers} + tasks: + - task_key: process + job_cluster_key: main + spark_python_task: + python_file: ./src/process.py +``` + +## Task Types Quick Reference + +| Task Type | Use When | Example | +|-----------|----------|---------| +| `notebook_task` | Running notebooks | Exploratory work, parameterized notebooks | +| `python_wheel_task` | Running packaged Python | Production Python code | +| `spark_python_task` | Running Python files | Simple scripts without packaging | +| `sql_task` | Running SQL queries | SQL analytics | +| `dbt_task` | Running dbt | dbt transformations | +| `pipeline_task` | Triggering DLT | Refreshing pipelines | +| `run_job_task` | Orchestrating jobs | Master orchestrator pattern | + +## Common Issues + +- **Missing libraries**: Add to libraries section +- **Task dependency cycle**: Check depends_on doesn't create loop +- **Schedule not running**: Verify pause_status: UNPAUSED and production mode +- **Serverless not available**: Some tasks require clusters (check docs) +- **Permission errors**: Add permissions in prod target + +## Related Skills + +- `resource-pipeline` - For pipeline_task configuration +- `resource-cluster` - For job_clusters setup +- `artifacts-dependencies` - For library management +- `permissions-security` - For job permissions +- `variables-references` - For parameterization + +## Examples + +### Example 1: Simple Python Job +``` +User: "Create a job that runs my Python wheel daily at 2 AM" + +Steps: +1. Read default_python/resources/sample_job.job.yml +2. Generate job with: + - python_wheel_task + - schedule with quartz_cron_expression + - email_notifications +3. Explain how to reference wheel artifact +``` + +### Example 2: Multi-Stage Pipeline +``` +User: "Job with extract, transform, load tasks in sequence" + +Steps: +1. Show multi-task pattern with depends_on +2. Configure each task appropriate type +3. Set up task dependencies +4. Add error notifications +``` + +## CLI Commands + +- `databricks bundle validate` - Validate job config +- `databricks bundle deploy` - Deploy job +- `databricks bundle run my_job` - Run job manually diff --git a/.claude/skills/configure-ml-resources.md b/.claude/skills/configure-ml-resources.md new file mode 100644 index 0000000..7132484 --- /dev/null +++ b/.claude/skills/configure-ml-resources.md @@ -0,0 +1,76 @@ +--- +name: configure-ml-resources +description: Expert assistance for ML resources (registered models, experiments). Use when users want to configure MLflow models, experiments, or set up ML workflows in bundles. +--- + +# Resource: ML - ML Resources Configuration + +## Instructions + +1. **Understand ML needs** + - Model registration? + - Experiment tracking? + - Unity Catalog integration? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (ML resources) + +3. **Find example** + - mlops_stacks/ + +4. **Provide configuration** + - registered_models resource + - experiments resource + - Unity Catalog integration + +## Key Patterns + +### Registered Model (Unity Catalog) +```yaml +resources: + registered_models: + my_model: + name: ${var.catalog}.${var.schema}.my_model + catalog_name: ${var.catalog} + schema_name: ${var.schema} + comment: "Production ML model" + grants: + - principal: account users + privileges: + - EXECUTE +``` + +### Experiment +```yaml +resources: + experiments: + my_experiment: + name: /Users/${workspace.current_user.userName}/experiments/my_model + description: "Model training experiments" +``` + +### Model with Grants +```yaml +resources: + registered_models: + my_model: + name: ${var.catalog}.${var.schema}.my_model + catalog_name: ${var.catalog} + schema_name: ${var.schema} + grants: + - principal: account users + privileges: + - EXECUTE +``` + +## Examples + +``` +User: "Create resources for ML model deployment" + +Steps: +1. Read mlops_stacks/ example +2. Create registered_model resource +3. Configure Unity Catalog integration +4. Set up grants for model access +``` diff --git a/.claude/skills/configure-pipeline.md b/.claude/skills/configure-pipeline.md new file mode 100644 index 0000000..0ce04a4 --- /dev/null +++ b/.claude/skills/configure-pipeline.md @@ -0,0 +1,118 @@ +--- +name: configure-pipeline +description: Expert assistance for configuring Delta Live Tables (DLT) pipeline resources. Use when users want to create/modify pipelines, set up data transformations, or configure streaming pipelines. +--- + +# Resource: Pipeline - DLT Pipeline Configuration + +## Instructions + +When helping with pipeline resources: + +1. **Understand pipeline needs** + - Python or SQL based? + - Streaming or batch? + - Schema and catalog requirements + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (pipelines section) + +3. **Find examples** + - lakeflow_pipelines_python/ - Python DLT + - lakeflow_pipelines_sql/ - SQL DLT + - knowledge_base/pipeline_with_schema/ - With Unity Catalog + - default_python/resources/default_python_etl.pipeline.yml + +4. **Provide configuration** + - Pipeline resource YAML + - Library paths + - Schema/catalog setup + - Compute configuration + +## Key Patterns + +### Basic Python Pipeline +```yaml +resources: + pipelines: + my_pipeline: + name: "ETL Pipeline" + libraries: + - notebook: + path: ./src/notebooks/etl.py + target: ${var.catalog}.${var.schema} + channel: CURRENT +``` + +### Serverless Pipeline +```yaml +resources: + pipelines: + serverless_pipeline: + name: "Serverless DLT" + libraries: + - notebook: + path: ./src/notebooks/transform.py + target: ${var.catalog}.${var.schema} + catalog: ${var.catalog} + channel: CURRENT # Serverless by default +``` + +### Pipeline with Cluster +```yaml +resources: + pipelines: + cluster_pipeline: + name: "Cluster-based Pipeline" + libraries: + - notebook: + path: ./src/notebooks/process.py + target: ${var.catalog}.${var.schema} + clusters: + - label: default + node_type_id: ${var.cluster_node_type} + num_workers: ${var.cluster_workers} +``` + +### SQL Pipeline +```yaml +resources: + pipelines: + sql_pipeline: + name: "SQL DLT Pipeline" + libraries: + - file: + path: ./src/dlt/transforms.sql + target: ${var.catalog}.${var.schema} +``` + +### Continuous Pipeline +```yaml +resources: + pipelines: + streaming_pipeline: + name: "Continuous Streaming" + libraries: + - notebook: + path: ./src/streaming/ingest.py + target: ${var.catalog}.${var.schema} + continuous: true +``` + +## Related Skills + +- `resource-job` - To trigger pipelines with pipeline_task +- `resource-schema` - For schema configuration +- `variables-references` - For catalog/schema variables + +## Examples + +``` +User: "Create a DLT pipeline for Python transformations" + +Steps: +1. Read lakeflow_pipelines_python/ example +2. Generate pipeline with notebook library +3. Configure target schema +4. Explain DLT expectations and table definitions +``` diff --git a/.claude/skills/configure-schema.md b/.claude/skills/configure-schema.md new file mode 100644 index 0000000..f30d040 --- /dev/null +++ b/.claude/skills/configure-schema.md @@ -0,0 +1,98 @@ +--- +name: configure-schema +description: Expert assistance for Schema and Catalog resources in Unity Catalog. Use when users want to create schemas, configure catalogs, set up database instances, or manage Unity Catalog resources. +--- + +# Resource: Schema - Unity Catalog Schema Configuration + +## Instructions + +1. **Understand catalog/schema needs** + - Unity Catalog hierarchy + - Grants needed? + - Database instance (OLTP) needed? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (schemas section) + +3. **Find examples** + - knowledge_base/database_with_catalog/ + - knowledge_base/write_from_job_to_volume/ + +4. **Provide configuration** + - Schema resource + - Grants if needed + - Database catalog if OLTP + +## Key Patterns + +### Basic Schema +```yaml +resources: + schemas: + my_schema: + name: ${var.catalog}.${var.schema} + comment: "Main schema for data pipeline" +``` + +### Schema with Grants +```yaml +resources: + schemas: + shared_schema: + name: ${var.catalog}.${var.schema} + grants: + - principal: account users + privileges: + - SELECT + - MODIFY +``` + +### Database Instance (OLTP) +```yaml +resources: + database_instances: + app_database: + name: "app-db" + instance_size: SMALL + description: "Application database" + + database_catalogs: + app_catalog: + name: "app_catalog" + database_instance: + database_instance_source: + database_instance_id: ${resources.database_instances.app_database.id} +``` + +## Unity Catalog Hierarchy + +``` +Catalog +└── Schema + ├── Tables + ├── Views + ├── Functions + └── Volumes +``` + +## Grant Privileges + +- `SELECT` - Read data +- `MODIFY` - Write data +- `CREATE` - Create objects +- `EXECUTE` - Run functions +- `USE CATALOG` - Access catalog +- `USE SCHEMA` - Access schema + +## Examples + +``` +User: "Create a schema with read access for all users" + +Steps: +1. Read knowledge_base schema examples +2. Create schema resource +3. Add grants with SELECT privilege +4. Explain catalog.schema naming +``` diff --git a/.claude/skills/configure-volume.md b/.claude/skills/configure-volume.md new file mode 100644 index 0000000..e491683 --- /dev/null +++ b/.claude/skills/configure-volume.md @@ -0,0 +1,97 @@ +--- +name: configure-volume +description: Expert assistance for Volume resources in Unity Catalog. Use when users want to create volumes for file storage, configure volume access, or integrate volumes with jobs and notebooks. +--- + +# Resource: Volume - Unity Catalog Volume Configuration + +## Instructions + +1. **Understand volume needs** + - Managed or external? + - Storage location requirements? + - Access patterns? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/resources (volumes section) + +3. **Find example** + - knowledge_base/write_from_job_to_volume/ + +4. **Provide configuration** + - Volume resource + - Path references + - Grants if needed + +## Key Patterns + +### Managed Volume +```yaml +resources: + volumes: + my_volume: + name: my_volume + catalog_name: ${var.catalog} + schema_name: ${var.schema} + volume_type: MANAGED + comment: "Managed volume for data files" +``` + +### External Volume +```yaml +resources: + volumes: + external_volume: + name: external_data + catalog_name: ${var.catalog} + schema_name: ${var.schema} + volume_type: EXTERNAL + storage_location: "s3://bucket/path/" +``` + +### Volume with Grants +```yaml +resources: + volumes: + shared_volume: + name: shared_files + catalog_name: ${var.catalog} + schema_name: ${var.schema} + volume_type: MANAGED + grants: + - principal: account users + privileges: + - READ_VOLUME + - WRITE_VOLUME +``` + +## Accessing Volumes + +In code, reference volumes: +```python +# Path format: /Volumes/{catalog}/{schema}/{volume}/ +volume_path = f"/Volumes/{catalog}/{schema}/{volume_name}/data.csv" + +# Read file +df = spark.read.csv(volume_path) + +# Write file +df.write.csv(volume_path) +``` + +## Volume Privileges + +- `READ_VOLUME` - Read files +- `WRITE_VOLUME` - Write files + +## Examples + +``` +User: "Create a volume for storing processed data files" + +Steps: +1. Read knowledge_base/write_from_job_to_volume/ +2. Create managed volume resource +3. Show path format for accessing +4. Explain grants if sharing needed +``` diff --git a/.claude/skills/create-bundle.md b/.claude/skills/create-bundle.md new file mode 100644 index 0000000..1dd84e6 --- /dev/null +++ b/.claude/skills/create-bundle.md @@ -0,0 +1,198 @@ +--- +name: create-bundle +description: Guide users through creating new Databricks Asset Bundles from scratch. Use when the user wants to create a new bundle, initialize a DAB project, or needs help setting up bundle structure and configuration. +--- + +# Bundle Create - Databricks Asset Bundle Creation + +## Instructions + +When helping users create a Databricks Asset Bundle: + +1. **Fetch latest documentation** + - Use WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/ + - This ensures up-to-date bundle configuration guidance + +2. **Understand project requirements** + - Ask what type of project (Python ETL, SQL, DLT pipelines, MLOps, Apps) + - Determine if they need YAML-based or Python-based resources + - Identify environment needs (dev, staging, prod) + +3. **Find relevant examples** + - First try `Glob("**/databricks.yml")` to find local examples in current repo + - If local examples exist, use Read to examine them + - If no local examples, use WebFetch to get from official Databricks GitHub: + - Simple: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_minimal/databricks.yml + - Python: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_python/databricks.yml + - Python resources: https://raw.githubusercontent.com/databricks/bundle-examples/main/pydabs/databricks.yml + - SQL: https://raw.githubusercontent.com/databricks/bundle-examples/main/default_sql/databricks.yml + - DLT Python: https://raw.githubusercontent.com/databricks/bundle-examples/main/lakeflow_pipelines_python/databricks.yml + - MLOps: https://raw.githubusercontent.com/databricks/bundle-examples/main/mlops_stacks/databricks.yml + - Always include inline templates as fallback + +4. **Provide complete configuration** + - Generate databricks.yml with: + - bundle name and UUID + - include paths for resources + - variables section (if needed) + - dev target with `mode: development` + - prod target with `mode: production` and permissions + - Explain directory structure to create + - Guide on pyproject.toml if Python project + +6. **Guide on next steps** + - Run `databricks bundle validate` to check configuration + - Create first resources (jobs, pipelines, etc.) + - Deploy to dev: `databricks bundle deploy -t dev` + +## Key Patterns + +### Minimal Bundle +```yaml +bundle: + name: my_bundle + uuid: + +include: + - resources/*.yml + +targets: + dev: + mode: development + default: true + workspace: + host: https://workspace.databricks.com +``` + +### Standard Python Project +```yaml +bundle: + name: my_project + uuid: + +include: + - resources/*.yml + +artifacts: + python_artifact: + type: whl + build: uv build --wheel + +variables: + catalog: + description: "Unity Catalog catalog" + schema: + description: "Schema for tables" + +targets: + dev: + mode: development + default: true + workspace: + host: https://workspace.databricks.com + variables: + catalog: dev + schema: ${workspace.current_user.short_name} + + prod: + mode: production + workspace: + host: https://workspace.databricks.com + variables: + catalog: prod + schema: prod + permissions: + - user_name: user@company.com + level: CAN_MANAGE +``` + +### Python-Based Resources (pydabs) +```yaml +bundle: + name: my_bundle + +python: + venv_path: .venv + resources: + - "resources:load_resources" + +targets: + dev: + mode: development + default: true +``` + +### Directory Structure +``` +my_project/ +├── databricks.yml # Main configuration +├── resources/ # Resource definitions +│ ├── *.job.yml +│ └── *.pipeline.yml +├── src/ # Source code +│ ├── my_project/ # Python package +│ └── notebooks/ # Notebooks +├── tests/ # Tests +├── pyproject.toml # Python dependencies +└── README.md +``` + +## Decision Guide + +**Project Type Selection:** +| Type | Use Example | Key Feature | +|------|-------------|-------------| +| Simple | default_minimal | Minimal config | +| Python ETL | default_python | Jobs + pipelines + wheel | +| SQL | default_sql | SQL queries & dashboards | +| DLT Python | lakeflow_pipelines_python | Delta Live Tables | +| DLT SQL | lakeflow_pipelines_sql | SQL pipelines | +| MLOps | mlops_stacks | Full ML lifecycle | +| Python Resources | pydabs | Code-defined resources | + +**YAML vs Python Resources:** +- YAML: Standard approach, declarative, use for most projects +- Python (pydabs): When you need dynamic resource generation or complex logic + +## Related Skills + +Suggest these skills after bundle creation: +- `resource-job` - To add job resources +- `resource-pipeline` - To add pipeline resources +- `deployment-modes` - For target configuration +- `variables-references` - For variable setup +- `artifacts-dependencies` - For Python wheel configuration +- `bundle-validate` - To validate configuration + +## Examples + +### Example 1: Creating a Simple Python ETL Bundle +``` +User: "I want to create a new bundle for Python ETL jobs" + +Steps: +1. Fetch https://docs.databricks.com/aws/en/dev-tools/bundles/ +2. Read default_python/databricks.yml +3. Generate configuration based on pattern +4. Explain directory structure +5. Guide on creating first job resource +``` + +### Example 2: Creating a DLT Pipeline Bundle +``` +User: "Help me set up a bundle for Delta Live Tables pipelines" + +Steps: +1. Fetch bundle documentation +2. Read lakeflow_pipelines_python/databricks.yml +3. Show configuration for DLT +4. Explain pipeline resource structure +5. Guide on catalog/schema variables +``` + +## CLI Commands + +- `databricks bundle init` - Initialize from template +- `databricks bundle validate` - Validate configuration +- `databricks bundle deploy` - Deploy bundle +- `databricks bundle deploy -t prod` - Deploy to production diff --git a/.claude/skills/manage-dependencies.md b/.claude/skills/manage-dependencies.md new file mode 100644 index 0000000..5a13471 --- /dev/null +++ b/.claude/skills/manage-dependencies.md @@ -0,0 +1,298 @@ +--- +name: manage-dependencies +description: Expert assistance with artifacts, wheels, libraries, and dependencies. Use when users need help building Python packages, configuring library dependencies, or managing private packages. +--- + +# Artifacts & Dependencies - Package Management Expert + +## Instructions + +1. **Understand packaging needs** + - Python wheels? + - JAR files? + - Multiple libraries? + - Private repositories? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/library-dependencies + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/artifact-private + +3. **Find examples** + - default_python/ - Standard wheel build + - knowledge_base/job_with_multiple_wheels/ - Multiple packages + - knowledge_base/private_wheel_packages/ - Private repos + - knowledge_base/python_wheel_poetry/ - Poetry setup + - knowledge_base/spark_jar_task/ - JAR artifacts + +4. **Provide configuration** + - artifacts section + - pyproject.toml setup + - Library references in tasks + +## Key Patterns + +### Standard Python Wheel +```yaml +bundle: + name: my_project + +artifacts: + python_artifact: + type: whl + build: uv build --wheel + +include: + - resources/*.yml +``` + +### pyproject.toml Setup +```toml +[project] +name = "my_project" +version = "0.1.0" +dependencies = [ + "databricks-sdk>=0.1.0,<1.0.0", + "pandas>=2.0.0,<3.0.0", +] + +[dependency-groups] +dev = [ + "pytest>=7.0", + "databricks-connect>=15.4,<15.5", +] + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" +``` + +### Job Using Wheel +```yaml +resources: + jobs: + my_job: + name: "ETL Job" + tasks: + - task_key: process + python_wheel_task: + package_name: my_project + entry_point: main + libraries: + - whl: ./dist/*.whl +``` + +### Multiple Wheels +```yaml +resources: + jobs: + multi_lib_job: + name: "Job with Multiple Libraries" + tasks: + - task_key: process + python_wheel_task: + package_name: my_project + entry_point: main + libraries: + - whl: ./dist/my_project-*.whl + - whl: ./dist/shared_lib-*.whl + - pypi: + package: "requests>=2.28.0" +``` + +### Private Wheel Packages +```yaml +artifacts: + my_private_wheel: + type: whl + build: uv build --wheel + path: ./private_package + +# In task +libraries: + - whl: ./dist/private_package-*.whl + - whl: https://private-repo.com/packages/library.whl +``` + +### Poetry Build +```yaml +artifacts: + poetry_artifact: + type: whl + build: poetry build + +# poetry.toml +[tool.poetry] +name = "my-project" +version = "0.1.0" + +[tool.poetry.dependencies] +python = "^3.10" +databricks-sdk = "^0.1.0" +``` + +### Spark JAR Artifact +```yaml +artifacts: + spark_jar: + type: jar + build: mvn clean package + +resources: + jobs: + scala_job: + name: "Scala Spark Job" + tasks: + - task_key: process + spark_jar_task: + main_class_name: com.company.Main + libraries: + - jar: ./target/my-app.jar +``` + +### Environment Dependencies +```yaml +resources: + jobs: + env_job: + name: "Job with Environment" + environments: + - environment_key: default + spec: + client: "1" + dependencies: + - "pandas==2.0.0" + - "numpy==1.24.0" + tasks: + - task_key: analyze + environment_key: default + python_wheel_task: + package_name: my_project + entry_point: analyze + libraries: + - whl: ./dist/*.whl +``` + +## Build Tools + +### uv (Recommended) +```yaml +artifacts: + python_artifact: + type: whl + build: uv build --wheel +``` + +**Benefits:** +- Fast (10-100x faster than pip) +- Reliable dependency resolution +- Modern Python packaging +- Built-in virtual environment management + +### Poetry +```yaml +artifacts: + python_artifact: + type: whl + build: poetry build +``` + +**Benefits:** +- Mature dependency management +- Lock files for reproducibility +- Good for existing Poetry projects + +## Dependency Best Practices + +1. **Pin major versions** + ```toml + dependencies = [ + "databricks-sdk>=0.1.0,<1.0.0", # Good + "pandas>=2.0.0,<3.0.0", # Good + # Not: "pandas" # Bad - unpinned + ] + ``` + +2. **Separate dev dependencies** + ```toml + [dependency-groups] + dev = [ + "pytest", + "black", + "mypy", + ] + ``` + +3. **Test builds locally** + ```bash + uv build --wheel + ls dist/ # Verify wheel created + ``` + +4. **Use uv for speed** + - Faster builds + - Better dependency resolution + - Simpler configuration + +5. **Keep dependencies minimal** + - Only include what you need + - Smaller packages = faster deployment + +## Library Types + +### Wheel (Python) +```yaml +libraries: + - whl: ./dist/*.whl + - whl: /Volumes/catalog/schema/volume/package.whl +``` + +### PyPI Package +```yaml +libraries: + - pypi: + package: "pandas>=2.0.0" +``` + +### JAR (Java/Scala) +```yaml +libraries: + - jar: ./target/my-app.jar +``` + +### Maven +```yaml +libraries: + - maven: + coordinates: "com.company:artifact:1.0.0" +``` + +## Common Issues + +- **Build fails**: Check pyproject.toml syntax, run build locally +- **Package not found**: Verify artifact paths, check dist/ directory +- **Dependency conflicts**: Pin versions explicitly, test locally +- **Import errors**: Ensure package_name matches setup, check entry_point +- **Private repo access**: Configure credentials, verify network access + +## Examples + +``` +User: "Set up Python project with wheel building" + +Steps: +1. Read default_python/ example +2. Create pyproject.toml with dependencies +3. Add artifacts section with uv build +4. Show job configuration with python_wheel_task +5. Explain build → dist/*.whl → deploy flow +``` + +``` +User: "Job needs multiple Python packages" + +Steps: +1. Read knowledge_base/job_with_multiple_wheels/ +2. Show libraries section with multiple whl entries +3. Explain how to build multiple packages +4. Configure artifacts for each package +``` diff --git a/.claude/skills/manage-variables.md b/.claude/skills/manage-variables.md new file mode 100644 index 0000000..9ae5f49 --- /dev/null +++ b/.claude/skills/manage-variables.md @@ -0,0 +1,225 @@ +--- +name: manage-variables +description: Expert assistance with variables, interpolation, and resource references in bundles. Use when users need help with variable syntax, referencing resources, or understanding interpolation patterns. +--- + +# Variables & References - Variable and Interpolation Expert + +## Instructions + +1. **Understand variable needs** + - What values vary per environment? + - What resources need to reference each other? + - What workspace context needed? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/variables + +3. **Find examples** + - Grep for variable patterns across examples + - Look for ${var.}, ${resources.}, ${workspace.} usage + +4. **Provide guidance** + - Variable declarations + - Target-specific values + - Interpolation syntax + - Resource references + +## Variable Declaration + +```yaml +variables: + catalog: + description: "Unity Catalog catalog name" + default: "default_catalog" # Optional default + + schema: + description: "Schema for tables and views" + # No default - must be provided in targets + + warehouse_id: + description: "SQL Warehouse ID" + +targets: + dev: + variables: + catalog: dev_catalog + schema: ${workspace.current_user.short_name} + warehouse_id: "abc123" + + prod: + variables: + catalog: prod_catalog + schema: prod + warehouse_id: "xyz789" +``` + +## Interpolation Syntax + +### User-Defined Variables +```yaml +${var.catalog} +${var.schema} +${var.warehouse_id} +``` + +### Bundle Metadata +```yaml +${bundle.name} # Bundle name +${bundle.target} # Current target (dev, prod, etc.) +${bundle.uuid} # Bundle UUID +``` + +### Workspace Context +```yaml +${workspace.current_user.short_name} # username (before @) +${workspace.current_user.userName} # full email +${workspace.file_path} # Bundle file path in workspace +``` + +### Resource References +```yaml +${resources.jobs.my_job.id} # Job ID +${resources.pipelines.my_pipeline.id} # Pipeline ID +${resources.schemas.my_schema.id} # Schema ID +${resources.volumes.my_volume.id} # Volume ID +${resources.models.my_model.id} # Model ID +${resources.database_instances.my_db.id} # Database instance ID +``` + +## Common Patterns + +### Schema with User Isolation (Dev) +```yaml +variables: + catalog: + description: "Catalog name" + schema: + description: "Schema name" + +targets: + dev: + mode: development + variables: + catalog: dev + schema: ${workspace.current_user.short_name} # Each user gets own schema +``` + +### Job Referencing Pipeline +```yaml +resources: + pipelines: + my_pipeline: + name: "ETL Pipeline" + # ... pipeline config + + jobs: + pipeline_job: + name: "Run Pipeline" + tasks: + - task_key: refresh + pipeline_task: + pipeline_id: ${resources.pipelines.my_pipeline.id} +``` + +### Cross-Resource Dependencies +```yaml +resources: + schemas: + my_schema: + name: ${var.catalog}.${var.schema} + + volumes: + my_volume: + name: data_volume + catalog_name: ${var.catalog} + schema_name: ${var.schema} + depends_on: + - ${resources.schemas.my_schema.id} +``` + +### Path Construction +```yaml +resources: + jobs: + my_job: + name: "${bundle.target}_etl_job" # dev_etl_job, prod_etl_job + tasks: + - task_key: process + spark_python_task: + python_file: ${workspace.file_path}/src/process.py + parameters: + - "--catalog" + - "${var.catalog}" + - "--schema" + - "${var.schema}" +``` + +## Variable Best Practices + +1. **Use variables for environment-specific values** + - Catalog/schema names + - Cluster sizes + - Warehouse IDs + - Any value that changes per target + +2. **Add descriptions to all variables** + - Helps team understand purpose + - Self-documenting configuration + +3. **Use defaults sparingly** + - Better to explicitly set in targets + - Makes environment config visible + +4. **Use workspace.current_user for dev isolation** + - `${workspace.current_user.short_name}` for schemas + - Enables multi-developer workflows + +5. **Use resource references for dependencies** + - Ensures correct deployment order + - Makes dependencies explicit + - Prevents hardcoding IDs + +## Common Mistakes + +```yaml +# WRONG - Missing ${} +catalog: var.catalog + +# WRONG - Wrong syntax +catalog: {var.catalog} + +# WRONG - Typo in property +schema: ${workspace.user.short_name} + +# WRONG - Wrong resource type (plural) +pipeline_id: ${resources.pipeline.my_pipeline.id} # Should be "pipelines" + +# CORRECT +catalog: ${var.catalog} +schema: ${workspace.current_user.short_name} +pipeline_id: ${resources.pipelines.my_pipeline.id} +``` + +## Examples + +``` +User: "How do I make catalog name different per environment?" + +Steps: +1. Show variable declaration +2. Configure catalog variable +3. Set different values in dev vs prod targets +4. Show usage: ${var.catalog} +5. Explain interpolation at deployment time +``` + +``` +User: "Job needs to reference a pipeline" + +Steps: +1. Show resource reference syntax +2. Use ${resources.pipelines.name.id} +3. Explain dependency handling +4. Show complete example with both resources +``` diff --git a/.claude/skills/optimize-bundle.md b/.claude/skills/optimize-bundle.md new file mode 100644 index 0000000..d08c4e9 --- /dev/null +++ b/.claude/skills/optimize-bundle.md @@ -0,0 +1,342 @@ +--- +name: optimize-bundle +description: Provide expert guidance on bundle design patterns, architectural decisions, and best practices for structuring Databricks Asset Bundles. Use when users ask about best ways to structure bundles, deployment strategies, or optimization. +--- + +# Bundle Best Practices - Design & Architecture Guidance + +## Instructions + +When providing bundle best practices guidance: + +1. **Understand the context** + - Learn about project complexity, team size, environments + - Understand current pain points or goals + +2. **Fetch best practices documentation** + - Use WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/faqs + - Use WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/deployment-modes + - Use WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/sharing + +3. **Find relevant patterns** + - Use Glob to find examples demonstrating the pattern + - knowledge_base/ has focused examples of specific patterns + - mlops_stacks/ and contrib/databricks_ingestion_monitoring/ show production patterns + +4. **Provide actionable recommendations** + - Show specific code examples from repository + - Explain trade-offs of different approaches + - Give incremental improvement steps + +## Key Best Practices + +### 1. Deployment Mode Strategy + +**Development mode** (for dev/testing): +```yaml +targets: + dev: + mode: development # Auto-prefixes resources with [dev username] + default: true + variables: + schema: ${workspace.current_user.short_name} # User-specific +``` + +**Benefits:** Multiple developers can deploy simultaneously, schedules paused, isolated resources + +**Production mode** (for staging/prod): +```yaml +targets: + prod: + mode: production # Clean resource names, schedules active + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE + run_as: + service_principal_name: sp-prod-bundle +``` + +**Benefits:** No prefixing, explicit permissions, service principal execution + +### 2. Variable Management + +Use variables for environment-specific values: +```yaml +variables: + catalog: + description: "Unity Catalog catalog" + cluster_size: + description: "Cluster node type" + +targets: + dev: + variables: + catalog: dev_catalog + cluster_size: i3.xlarge + prod: + variables: + catalog: prod_catalog + cluster_size: i3.4xlarge +``` + +### 3. Resource Organization + +**Small projects (< 5 resources):** +``` +project/ +├── databricks.yml +├── resources/ +│ ├── job.job.yml +│ └── pipeline.pipeline.yml +``` + +**Medium projects:** +``` +project/ +├── databricks.yml +├── resources/ +│ ├── jobs/ +│ ├── pipelines/ +│ └── schemas/ +``` + +**Large projects:** +``` +project/ +├── databricks.yml +├── resources/ +│ ├── infrastructure/ +│ ├── ingestion/ +│ ├── transformation/ +│ └── serving/ +``` + +### 4. Naming Conventions + +- **Bundles:** lowercase_with_underscores +- **Files:** resource_name.type.yml (e.g., etl_job.job.yml) +- **Variables:** Clear purpose (catalog, schema, warehouse_id) +- Let deployment mode handle prefixing + +### 5. Permission Strategy + +**Development:** Automatic via mode: development + +**Production:** +```yaml +targets: + prod: + mode: production + permissions: + - group_name: data_engineers # Use groups + level: CAN_RUN + - user_name: deployer@company.com + level: CAN_MANAGE + run_as: + service_principal_name: sp-prod # Service principal execution +``` + +### 6. Sharing Code Across Bundles + +**Reference:** knowledge_base/share_files_across_bundles/ + +```yaml +# In bundle databricks.yml +include: + - ../shared/config/common_variables.yml + +sync: + paths: + - ../shared/lib # Share Python libraries +``` + +### 7. Serverless-First Strategy + +Prefer serverless for new workloads: +```yaml +resources: + jobs: + my_job: + name: "Serverless Job" + tasks: + - task_key: process + python_wheel_task: + package_name: my_project + entry_point: main + libraries: + - whl: ./dist/*.whl +``` + +**Benefits:** Lower cost, faster startup, no cluster management, auto-scaling + +**Reference:** knowledge_base/serverless_job/ + +### 8. Secret Management + +Never hardcode secrets: +```yaml +# WRONG +variables: + api_key: "sk_live_123" # Never! + +# RIGHT - Use Databricks secrets +tasks: + - task_key: process + spark_conf: + spark.api_key: "{{secrets/my_scope/api_key}}" +``` + +**Reference:** knowledge_base/job_read_secret/ + +### 9. Artifact Management + +```yaml +artifacts: + python_artifact: + type: whl + build: uv build --wheel # Fast, modern package manager + +# pyproject.toml +[project] +dependencies = [ + "databricks-sdk>=0.1.0,<1.0.0", # Pin major versions +] +``` + +### 10. Target-Specific Resources + +**Reference:** knowledge_base/target_includes/ + +```yaml +targets: + staging: + resources: + include: + - resources/staging_*.yml # Staging-only resources + prod: + resources: + include: + - resources/prod_*.yml # Production-only resources +``` + +## Common Anti-Patterns to Avoid + +### ❌ Hardcoded Values +```yaml +# BAD +new_cluster: + node_type_id: "i3.xlarge" # Hardcoded + +# GOOD +new_cluster: + node_type_id: ${var.cluster_node_type} +``` + +### ❌ No UUID +```yaml +# BAD +bundle: + name: my_bundle + +# GOOD +bundle: + name: my_bundle + uuid: "550e8400-e29b-41d4-a716-446655440000" +``` + +### ❌ Production Without Permissions +```yaml +# BAD +targets: + prod: + mode: production + +# GOOD +targets: + prod: + mode: production + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE +``` + +### ❌ Monolithic Files +Break large configurations into focused, modular files + +## Architectural Patterns + +### Multi-Bundle Architecture +**When:** Large organization, multiple teams, different deployment schedules +``` +org/ +├── infrastructure-bundle/ +├── ingestion-bundle/ +├── transformation-bundle/ +└── serving-bundle/ +``` + +### Monorepo Bundle +**When:** Single team, tightly coupled workflows +``` +data-platform/ +├── databricks.yml +├── resources/ +│ ├── infrastructure/ +│ ├── ingestion/ +│ └── transformation/ +``` + +### Environment Promotion +Deploy through: `dev` → `staging` → `prod` + +Use same bundle, different targets with appropriate modes and permissions + +## Quality Checklist + +Guide users to verify: +- [ ] UUID present and unique +- [ ] Variables for environment-specific values +- [ ] Development mode for dev +- [ ] Production mode with permissions for prod +- [ ] Service principal for prod execution +- [ ] Secrets via Databricks secrets +- [ ] Resources organized logically +- [ ] Naming conventions followed +- [ ] Serverless considered +- [ ] Dependencies minimal and pinned + +## Related Skills + +- `bundle-create` - Initial setup with best practices +- `bundle-validate` - Ensure compliance +- `deployment-modes` - Deep dive on dev vs prod +- `variables-references` - Variable management +- `permissions-security` - Security patterns +- All resource skills - Resource-specific practices + +## Examples + +### Example 1: Improving Bundle Structure +``` +User: "My bundle is getting messy with 20+ resources" + +Guidance: +1. Show knowledge_base examples with good organization +2. Recommend grouping by function (infrastructure/, ingestion/, etc.) +3. Update include paths to match new structure +4. Use variables to reduce duplication +``` + +### Example 2: Preparing for Production +``` +User: "Ready to deploy to production, what should I check?" + +Checklist: +1. mode: production in prod target +2. Explicit permissions configured +3. Service principal for run_as +4. Secrets properly managed +5. Variables set for prod environment +6. Schedules configured appropriately +7. Error notifications set up +``` diff --git a/.claude/skills/secure-bundle.md b/.claude/skills/secure-bundle.md new file mode 100644 index 0000000..0f03c7f --- /dev/null +++ b/.claude/skills/secure-bundle.md @@ -0,0 +1,287 @@ +--- +name: secure-bundle +description: Expert guidance on permissions, grants, and security configurations in bundles. Use when users need help with access control, service principals, secrets, or Unity Catalog grants. +--- + +# Permissions & Security - Security Configuration Expert + +## Instructions + +1. **Understand security needs** + - Who needs access? (users, groups, service principals) + - What level of access? (view, run, manage) + - Unity Catalog grants needed? + - Secrets to manage? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/permissions + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/run-as + +3. **Find examples** + - knowledge_base/job_read_secret/ - Secret management + - Look for permissions and grants in examples + +4. **Provide configuration** + - Permissions for bundle resources + - Grants for Unity Catalog resources + - Service principal setup + - Secret access patterns + +## Resource Permissions + +### Bundle-Level Permissions +```yaml +targets: + prod: + mode: production + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE + - group_name: data_engineers + level: CAN_RUN + - group_name: analysts + level: CAN_VIEW + - service_principal_name: sp-automation + level: CAN_MANAGE +``` + +### Resource-Level Permissions +```yaml +resources: + jobs: + sensitive_job: + name: "Sensitive Data Processing" + permissions: + - group_name: data_engineers + level: CAN_MANAGE + - group_name: analysts + level: CAN_VIEW + # ... job config +``` + +### Permission Levels + +| Level | Can View | Can Run | Can Modify | Use When | +|-------|----------|---------|------------|----------| +| `CAN_VIEW` | ✓ | ✗ | ✗ | Read-only access | +| `CAN_RUN` | ✓ | ✓ | ✗ | Execute but not modify | +| `CAN_MANAGE` | ✓ | ✓ | ✓ | Full control | +| `IS_OWNER` | ✓ | ✓ | ✓ | Ownership | + +## Unity Catalog Grants + +### Schema Grants +```yaml +resources: + schemas: + shared_schema: + name: ${var.catalog}.${var.schema} + grants: + - principal: account users + privileges: + - SELECT + - MODIFY + - principal: data_engineers + privileges: + - CREATE +``` + +### Model Grants +```yaml +resources: + registered_models: + my_model: + name: ${var.catalog}.${var.schema}.my_model + catalog_name: ${var.catalog} + schema_name: ${var.schema} + grants: + - principal: account users + privileges: + - EXECUTE +``` + +### Volume Grants +```yaml +resources: + volumes: + shared_volume: + name: shared_data + catalog_name: ${var.catalog} + schema_name: ${var.schema} + volume_type: MANAGED + grants: + - principal: data_engineers + privileges: + - READ_VOLUME + - WRITE_VOLUME + - principal: analysts + privileges: + - READ_VOLUME +``` + +## Service Principal Execution + +### Run As Service Principal +```yaml +targets: + prod: + mode: production + run_as: + service_principal_name: sp-prod-etl + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE +``` + +**Benefits:** +- More secure than user accounts +- Better for automation +- Consistent execution identity +- Easier credential management + +### Setting Up Service Principal +1. Create service principal in Databricks +2. Grant necessary permissions +3. Configure in bundle +4. Service principal needs: + - Workspace access + - Unity Catalog permissions + - Secret scope access (if using secrets) + +## Secret Management + +### Using Secrets in Jobs +```yaml +resources: + jobs: + secret_job: + name: "Job with Secrets" + tasks: + - task_key: process + spark_python_task: + python_file: ./src/process.py + spark_conf: + spark.api_key: "{{secrets/my_scope/api_key}}" + spark.api_secret: "{{secrets/my_scope/api_secret}}" +``` + +### Accessing Secrets in Python +```python +from databricks.sdk import WorkspaceClient + +w = WorkspaceClient() +api_key = w.dbutils.secrets.get(scope="my_scope", key="api_key") +``` + +### Secret Best Practices +1. **Never hardcode secrets** in bundle files +2. **Use Databricks secrets** for sensitive data +3. **Scope access properly** - grant only to needed principals +4. **Rotate secrets regularly** +5. **Use service principals** for secret access in prod + +## Common Patterns + +### Multi-Tier Access +```yaml +targets: + prod: + mode: production + permissions: + # Deployers - can manage + - user_name: deployer@company.com + level: CAN_MANAGE + + # Engineers - can run and modify + - group_name: data_engineers + level: CAN_MANAGE + + # Operators - can run only + - group_name: data_operators + level: CAN_RUN + + # Analysts - can view only + - group_name: analysts + level: CAN_VIEW +``` + +### Development Permissions +```yaml +targets: + dev: + mode: development # Automatic per-user permissions + default: true + # No explicit permissions needed - each developer gets their own resources +``` + +### Least Privilege Principle +```yaml +resources: + jobs: + reporting_job: + name: "Daily Report" + permissions: + - group_name: report_viewers + level: CAN_VIEW # Only view, not run + # ... + + jobs: + data_pipeline: + name: "Data Pipeline" + permissions: + - group_name: data_engineers + level: CAN_MANAGE # Full control for engineers + - group_name: sre_team + level: CAN_RUN # SRE can run but not modify +``` + +## Unity Catalog Privileges + +**Schema/Catalog:** +- `USE CATALOG` - Access catalog +- `USE SCHEMA` - Access schema +- `CREATE` - Create objects +- `SELECT` - Read data +- `MODIFY` - Write data + +**Models:** +- `EXECUTE` - Run model inference + +**Volumes:** +- `READ_VOLUME` - Read files +- `WRITE_VOLUME` - Write files + +## Security Checklist + +- [ ] Production uses service principals for execution +- [ ] Explicit permissions in production target +- [ ] No hardcoded secrets in configuration +- [ ] Secrets accessed via Databricks secrets +- [ ] Least privilege access (CAN_VIEW < CAN_RUN < CAN_MANAGE) +- [ ] Groups used instead of individual users where possible +- [ ] Unity Catalog grants configured for shared resources +- [ ] Service principals have minimum necessary permissions + +## Examples + +``` +User: "Set up permissions for prod with different access levels" + +Steps: +1. Read knowledge_base examples +2. Configure prod target with mode: production +3. Add permissions with appropriate levels +4. Set up service principal for run_as +5. Explain permission levels +``` + +``` +User: "Job needs to access API key from secrets" + +Steps: +1. Read knowledge_base/job_read_secret/ +2. Show {{secrets/scope/key}} syntax +3. Configure in spark_conf or parameters +4. Explain secret scope access +5. Show Python code to access secrets +``` diff --git a/.claude/skills/use-python-resources.md b/.claude/skills/use-python-resources.md new file mode 100644 index 0000000..9205cc8 --- /dev/null +++ b/.claude/skills/use-python-resources.md @@ -0,0 +1,140 @@ +--- +name: use-python-resources +description: Expert assistance for Python-based resource definitions using databricks-bundles library (pydabs pattern). Use when users want to define resources in Python code instead of YAML for dynamic configuration. +--- + +# Python Resources - Python-Based Bundle Configuration + +## Instructions + +1. **Understand use case** + - Why Python over YAML? (dynamic generation, logic, etc.) + - What resources need to be defined? + +2. **Fetch documentation** + - WebFetch: https://docs.databricks.com/aws/en/dev-tools/bundles/python/ + +3. **Find example** + - pydabs/ - Complete Python resources example + +4. **Provide configuration** + - databricks.yml with python.resources + - resources/__init__.py structure + - Python resource definitions + +## Key Patterns + +### databricks.yml Configuration +```yaml +bundle: + name: my_python_bundle + +python: + venv_path: .venv + resources: + - "resources:load_resources" + +targets: + dev: + mode: development + default: true +``` + +### resources/__init__.py +```python +from databricks.bundles.resources import Resources +from .sample_job import sample_job +from .sample_pipeline import sample_pipeline + +def load_resources() -> Resources: + return Resources( + jobs={ + "sample_job": sample_job, + }, + pipelines={ + "sample_pipeline": sample_pipeline, + } + ) +``` + +### resources/sample_job.py +```python +from databricks.bundles.jobs import Job + +sample_job = Job.from_dict({ + "name": "Sample Job", + "tasks": [ + { + "task_key": "process", + "python_wheel_task": { + "package_name": "my_project", + "entry_point": "main" + }, + "libraries": [{"whl": "./dist/*.whl"}] + } + ], + "schedule": { + "quartz_cron_expression": "0 0 * * * ?", + "timezone_id": "UTC" + } +}) +``` + +### Mixing YAML and Python +```yaml +bundle: + name: hybrid_bundle + +include: + - resources/*.yml # YAML resources + +python: + venv_path: .venv + resources: + - "resources:load_resources" # Python resources +``` + +## When to Use Python Resources + +**Use Python when:** +- Need dynamic resource generation based on logic +- Conditional resource creation +- Complex configuration with loops/conditions +- Team prefers code over config +- Building resource generation tools + +**Use YAML when:** +- Standard static configurations +- Simpler, more declarative approach +- Team prefers config files +- Most cases (YAML is the default) + +## Setup Requirements + +```toml +# pyproject.toml +[dependency-groups] +dev = [ + "databricks-bundles>=0.279.0", +] +``` + +Install and create venv: +```bash +uv venv +source .venv/bin/activate +uv pip install -e ".[dev]" +``` + +## Examples + +``` +User: "I need to generate 10 similar jobs programmatically" + +Steps: +1. Read pydabs/ example structure +2. Set up python.resources in databricks.yml +3. Create resources/__init__.py +4. Show loop to generate jobs in Python +5. Explain load_resources() pattern +``` diff --git a/.claude/skills/validate-bundle.md b/.claude/skills/validate-bundle.md new file mode 100644 index 0000000..9d392f1 --- /dev/null +++ b/.claude/skills/validate-bundle.md @@ -0,0 +1,181 @@ +--- +name: validate-bundle +description: Validate bundle configurations and troubleshoot errors. Use when users encounter validation errors, deployment failures, or want to check bundle health before deploying. +--- + +# Bundle Validate - Configuration Validation & Troubleshooting + +## Instructions + +When helping users validate or troubleshoot bundles: + +1. **Understand the issue** + - Ask user to share error message or describe the problem + - Determine what they were trying to do when error occurred + +2. **Fetch troubleshooting documentation** + - Use WebFetch to get: https://docs.databricks.com/aws/en/dev-tools/bundles/faqs + - This has common issues and solutions + +3. **Read their configuration** + - Use Read to examine their databricks.yml + - Read any resource files mentioned in errors + - Look for syntax issues, typos, missing fields + +4. **Search for working patterns** + - Use Grep to find correct syntax in repository examples + - Compare their config against working examples + +5. **Diagnose and fix** + - Identify root cause (syntax, references, paths, permissions, etc.) + - Provide specific fix with corrected code + - Explain what was wrong and why + +6. **Validate solution** + - Guide user to run `databricks bundle validate` + - Test incrementally if complex issue + +## Common Error Patterns + +### Variable Reference Errors +``` +Error: variable "catalog" is not defined + +Fix: Add to variables section: +variables: + catalog: + description: "Catalog to use" + +And set value in target: +targets: + dev: + variables: + catalog: dev_catalog +``` + +### Resource Not Found +``` +Error: resource "jobs/my_job" is not defined + +Fixes: +1. Check include path covers resource location +2. Verify file name: .job.yml +3. Check resource defined correctly: + resources: + jobs: + my_job: # This name must match reference +``` + +### Invalid Interpolation Syntax +``` +Correct: ${var.catalog} +Wrong: {var.catalog} +Wrong: $var.catalog +Wrong: var.catalog +``` + +### Circular Dependencies +``` +Error: circular dependency detected + +Fix: Map dependencies, break cycle +- Job A depends on Job B +- Job B depends on Job A +→ Remove one dependency +``` + +### Permission Denied +``` +Error: permission denied + +Fix for prod: +targets: + prod: + mode: production + permissions: + - user_name: deployer@company.com + level: CAN_MANAGE +``` + +### Missing Workspace Host +``` +Error: workspace host required + +Fix: +targets: + dev: + workspace: + host: https://workspace.databricks.com +``` + +## Validation Checklist + +Guide users through: +- [ ] YAML syntax valid (no tabs, proper indentation) +- [ ] bundle.name and UUID present +- [ ] All targets have workspace.host +- [ ] Variables declared before use +- [ ] Variable syntax correct: ${var.name} +- [ ] Resource references correct: ${resources.jobs.name.id} +- [ ] Include paths relative to databricks.yml +- [ ] One target has default: true +- [ ] Dev target uses mode: development +- [ ] Prod target uses mode: production with permissions +- [ ] No circular dependencies +- [ ] Permission levels valid (CAN_MANAGE, CAN_VIEW, CAN_RUN) + +## Interpolation Reference + +**Correct syntax:** +- `${var.catalog}` - User variable +- `${bundle.name}` - Bundle name +- `${bundle.target}` - Current target +- `${workspace.current_user.short_name}` - Username +- `${resources.jobs.my_job.id}` - Job reference +- `${resources.pipelines.my_pipeline.id}` - Pipeline reference + +## Related Skills + +- `bundle-create` - If structure is fundamentally wrong +- `variables-references` - For variable issues +- `deployment-modes` - For target configuration +- `permissions-security` - For permission errors +- Resource skills - For specific resource configuration issues + +## Examples + +### Example 1: Variable Not Found +``` +User: "Getting error: variable 'schema' not defined" + +Steps: +1. Read their databricks.yml +2. Check variables section +3. Add missing variable: + variables: + schema: + description: "Schema name" +4. Set in target: + targets: + dev: + variables: + schema: ${workspace.current_user.short_name} +``` + +### Example 2: Resource Reference Typo +``` +User: "Pipeline task failing - resource not found" + +Steps: +1. Read job file with pipeline_task +2. Find the typo: ${resources.pipelines.my_pipline.id} +3. Correct: ${resources.pipelines.my_pipeline.id} +4. Verify pipeline resource exists +``` + +## CLI Commands + +- `databricks bundle validate` - Validate configuration +- `databricks bundle validate -t dev` - Validate specific target +- `databricks bundle deploy --dry-run` - Preview deployment +- `databricks bundle summary` - Show deployed resources