diff --git a/docs/data-tests/dbt/package-model-configuration.mdx b/docs/data-tests/dbt/package-model-configuration.mdx new file mode 100644 index 000000000..d910dc694 --- /dev/null +++ b/docs/data-tests/dbt/package-model-configuration.mdx @@ -0,0 +1,430 @@ +# Elementary dbt Package Model Configuration + +This document provides a comprehensive list of all configuration options available for the Elementary dbt package via models/dbt elements other than the `vars` section in your `dbt_project.yml`. + +## Overview + +The Elementary dbt package supports extensive configuration through various dbt elements beyond the `vars` section. These configuration methods provide fine-grained control over Elementary's behavior at different levels of your dbt project. + +## 1. Model-Level Configuration + +### Model Config Block +Models can be configured using the `config()` block in SQL files: + +```sql +{{ + config( + materialized='incremental', + transient=False, + post_hook='{{ elementary.upload_dbt_models() }}', + unique_key='unique_id', + on_schema_change='sync_all_columns', + full_refresh=elementary.get_config_var('elementary_full_refresh'), + table_type=elementary.get_default_table_type(), + incremental_strategy=elementary.get_default_incremental_strategy() + ) +}} +``` + +### Model Schema Configuration +In `schema.yml` files, models can be configured with: + +```yaml +models: + - name: my_model + config: + tags: ["production", "monitoring"] + materialized: table + elementary: + timestamp_column: updated_at + backfill_days: 7 + anomaly_direction: "spike" + anomaly_sensitivity: 4 + where_expression: "status = 'active'" + time_bucket: + period: hour + count: 4 + min_training_set_size: 10 + days_back: 30 + seasonality: "day_of_week" + meta: + owner: ["@data_team", "analytics@company.com"] + subscribers: ["@alerts"] + description: "Model description" + description: "Detailed model description" +``` + +## 2. Column-Level Configuration + +### Column Config in Schema Files +```yaml +models: + - name: my_model + columns: + - name: sensitive_column + config: + disable_test_samples: true # Prevents sampling for this column + tests: + - elementary.column_anomalies: + column_anomalies: + - null_count + - min_length +``` + +### Column Meta Configuration +```yaml +models: + - name: my_model + columns: + - name: user_id + meta: + owner: "@privacy_team" + tags: ["pii", "sensitive"] + description: "Unique user identifier" +``` + +## 3. Source Configuration + +### Source Freshness Configuration +```yaml +sources: + - name: raw_data + schema: staging + tables: + - name: users + meta: + owner: ["@data_team"] + elementary: + timestamp_column: updated_at + freshness: + error_after: + count: 1 + period: hour + warn_after: + count: 30 + period: minute + loaded_at_field: updated_at + description: "Raw user data from source system" +``` + +### Source Meta Configuration +```yaml +sources: + - name: raw_data + schema: staging + tables: + - name: users + meta: + owner: "data@company.com" + subscribers: ["alerts@company.com"] + tags: ["critical", "pii"] +``` + +## 4. Exposure Configuration + +### Exposure Meta and Owner Configuration +```yaml +exposures: + - name: customer_dashboard + label: Customer Analytics + type: dashboard + maturity: high + url: https://bi.tool/dashboards/1 + description: "Customer analytics dashboard" + depends_on: + - ref('customers') + - ref('orders') + owner: + name: "Data Team" + email: "data@company.com" + meta: + platform: "Tableau" + workbook: "Customer Analytics" + path: "Dashboards/Customers" + referenced_columns: + - column_name: customer_id + data_type: numeric + node: ref('customers') + tags: + - marketing + - critical +``` + +## 5. Test Configuration + +### Test-Level Configuration +```yaml +models: + - name: my_model + tests: + - elementary.volume_anomalies: + alias: "custom_volume_test" + timestamp_column: "created_at" + time_bucket: + period: hour + count: 1 + seasonality: "hour_of_day" + sensitivity: 2 + anomaly_direction: "spike" + where: "status = 'active'" + tags: ["critical", "monitoring"] + config: + severity: warn + meta: + description: "Custom volume anomaly detection" + owner: "@data_team" +``` + +### Test Configuration Hierarchy +Tests can be configured at multiple levels with the following precedence (highest to lowest): +1. **Test level** - Direct test configuration +2. **Model level** - Model's `elementary` config +3. **Project level** - `vars` in `dbt_project.yml` + +## 6. Project-Level Model Configuration + +### dbt_project.yml Model Configuration +```yaml +models: + elementary: + +schema: elementary + +enabled: "{{ var('elementary_enabled', True) }}" + + my_project: + staging: + +materialized: view + +schema: staging + marts: + +materialized: table + +schema: marts +``` + +## 7. Selectors Configuration + +### selectors.yml for Model Selection +```yaml +selectors: + - name: elementary_models + definition: + method: tag + value: elementary + + - name: critical_models + definition: + method: tag + value: critical + + - name: staging_models + definition: + method: path + value: models/staging +``` + +## 8. Key Configuration Options by Element + +### Model Configuration Options +- `timestamp_column`: Column to use for time-based analysis +- `backfill_days`: Days to backfill metrics +- `anomaly_direction`: 'spike', 'drop', or 'both' +- `anomaly_sensitivity`: 1-5 scale for sensitivity +- `where_expression`: Filter condition for analysis +- `time_bucket`: Time aggregation settings +- `seasonality`: Seasonality pattern to apply +- `min_training_set_size`: Minimum data points for training + +### Column Configuration Options +- `disable_test_samples`: Prevent sampling for sensitive columns +- `meta.owner`: Column owner for alerts +- `meta.tags`: Column tags for categorization + +### Source Configuration Options +- `freshness`: Data freshness thresholds +- `loaded_at_field`: Field indicating when data was loaded +- `meta.owner`: Source owner for alerts +- `meta.tags`: Source tags for categorization + +### Exposure Configuration Options +- `owner`: Exposure owner information +- `meta`: Custom metadata for the exposure +- `tags`: Exposure tags for categorization +- `depends_on`: Dependencies for lineage tracking + +## 9. Configuration Precedence + +The configuration follows this precedence order (highest to lowest priority): + +1. **Test-level configuration** (in test definition) +2. **Model-level configuration** (in model's `elementary` config) +3. **Project-level configuration** (in `vars` section) +4. **Package defaults** + +## 10. Advanced Configuration Examples + +### Complex Model Configuration +```yaml +models: + - name: complex_analytics_model + config: + materialized: incremental + unique_key: id + elementary: + timestamp_column: event_timestamp + backfill_days: 14 + anomaly_direction: "both" + anomaly_sensitivity: 3 + where_expression: "is_active = true and event_type in ('purchase', 'view')" + time_bucket: + period: day + count: 1 + seasonality: "day_of_week" + min_training_set_size: 21 + days_back: 60 + meta: + owner: ["@analytics_team", "product@company.com"] + subscribers: ["@alerts", "oncall@company.com"] + tags: ["critical", "revenue", "analytics"] + description: "Complex analytics model with comprehensive monitoring" + columns: + - name: user_id + config: + disable_test_samples: true + meta: + owner: "@privacy_team" + tags: ["pii"] + description: "User identifier (PII)" + - name: revenue_amount + meta: + owner: "@finance_team" + tags: ["financial", "critical"] + description: "Revenue amount in USD" + tests: + - elementary.column_anomalies: + column_anomalies: + - min + - max + - average + - standard_deviation + anomaly_direction: "spike" + sensitivity: 2 + tags: ["revenue_monitoring"] +``` + +### Comprehensive Source Configuration +```yaml +sources: + - name: external_systems + schema: raw + tables: + - name: customer_data + meta: + owner: ["@data_engineering", "vendor@external.com"] + elementary: + timestamp_column: last_updated + min_training_set_size: 30 + anomaly_sensitivity: 4 + freshness: + error_after: + count: 2 + period: hour + warn_after: + count: 1 + period: hour + filter: "status = 'active'" + loaded_at_field: last_updated + description: "Customer data from external CRM system" + columns: + - name: customer_id + meta: + owner: "@privacy_team" + tags: ["pii", "primary_key"] + description: "Unique customer identifier" + - name: email + config: + disable_test_samples: true + meta: + owner: "@privacy_team" + tags: ["pii", "sensitive"] + description: "Customer email address" +``` + +### Advanced Exposure Configuration +```yaml +exposures: + - name: executive_dashboard + label: Executive KPI Dashboard + type: dashboard + maturity: high + url: https://bi.company.com/executive-dashboard + description: > + Executive dashboard showing key business metrics + including revenue, customer acquisition, and churn rates. + depends_on: + - ref('daily_revenue') + - ref('customer_metrics') + - ref('churn_analysis') + owner: + name: "Executive Team" + email: "exec@company.com" + meta: + platform: "Looker" + workbook: "Executive Metrics" + path: "Dashboards/Executive" + refresh_schedule: "hourly" + data_sources: + - "daily_revenue" + - "customer_metrics" + - "churn_analysis" + referenced_columns: + - column_name: revenue_amount + data_type: numeric + node: ref('daily_revenue') + - column_name: customer_count + data_type: numeric + node: ref('customer_metrics') + tags: + - executive + - critical + - kpi +``` + +## 11. Best Practices + +### Model Configuration Best Practices +1. **Use descriptive names** for models, columns, and tests +2. **Set appropriate owners** for all models and columns +3. **Use tags consistently** for categorization and filtering +4. **Configure timestamp columns** for time-based analysis +5. **Set appropriate sensitivity levels** based on business impact + +### Security Best Practices +1. **Disable sampling** for PII columns using `disable_test_samples: true` +2. **Use PII tags** for sensitive data identification +3. **Set appropriate owners** for sensitive data +4. **Configure alerts** for critical models and sources + +### Performance Best Practices +1. **Use appropriate materializations** (view vs table vs incremental) +2. **Configure time buckets** based on data volume and update frequency +3. **Set reasonable training periods** for anomaly detection +4. **Use filters** to focus on relevant data subsets + +## 12. Troubleshooting + +### Common Configuration Issues +1. **Missing timestamp columns** - Ensure timestamp columns are properly configured +2. **Incorrect owner format** - Use consistent owner naming conventions +3. **Tag conflicts** - Avoid conflicting tag definitions +4. **Configuration precedence** - Understand the hierarchy of configuration options + +### Debugging Configuration +1. **Check dbt logs** for configuration errors +2. **Verify schema files** for syntax errors +3. **Test configurations** in development environment +4. **Use Elementary CLI** for configuration validation + +## References + +- [Elementary Documentation](https://docs.elementary-data.com/) +- [dbt Configuration Documentation](https://docs.getdbt.com/reference/dbt-jinja-functions/var) +- [Elementary GitHub Repository](https://github.com/elementary-data/elementary) +- [Elementary Configuration Variables](./elementary_configuration_variables.md) \ No newline at end of file diff --git a/docs/data-tests/dbt/package-vars-config.mdx b/docs/data-tests/dbt/package-vars-config.mdx new file mode 100644 index 000000000..dbdbb0405 --- /dev/null +++ b/docs/data-tests/dbt/package-vars-config.mdx @@ -0,0 +1,222 @@ +# Elementary dbt Package Configuration Variables + +This document provides a comprehensive list of all configuration variables available for the Elementary dbt package via `vars` in your `dbt_project.yml`. + +## Overview + +The Elementary dbt package provides extensive configuration options that can be set via the `vars` section in your `dbt_project.yml` file. These variables control various aspects of the package including anomaly detection, alerts, data uploads, testing, and performance. + +## Core Configuration Variables + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `days_back` | `14` | Number of days to look back for anomaly detection training data | +| `anomaly_sensitivity` | `3` | Sensitivity level for anomaly detection (1-5, higher = more sensitive) | +| `backfill_days` | `2` | Number of days to backfill metrics | +| `tests_schema_name` | `''` | Custom schema name for Elementary tests | +| `debug_logs` | `false` | Enable debug logging | +| `project_name` | `none` | Custom project name | +| `elementary_full_refresh` | `false` | Force full refresh of Elementary models | +| `min_training_set_size` | `7` | Minimum number of data points required for anomaly detection training | +| `anomaly_direction` | `'both'` | Anomaly detection direction: 'spike', 'drop', or 'both' | +| `anomaly_exclude_metrics` | `none` | Metrics to exclude from anomaly detection | +| `fail_on_zero` | `false` | Whether to fail tests when metric value is zero | + +## Alert Configuration Variables + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `disable_warn_alerts` | `false` | Disable warning alerts | +| `disable_model_alerts` | `false` | Disable model run alerts | +| `disable_test_alerts` | `false` | Disable test result alerts | +| `disable_source_freshness_alerts` | `false` | Disable source freshness alerts | +| `disable_skipped_model_alerts` | `true` | Disable alerts for skipped models | +| `disable_skipped_test_alerts` | `true` | Disable alerts for skipped tests | + +## Data Upload Configuration Variables + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `disable_run_results` | `false` | Disable uploading run results | +| `disable_freshness_results` | `false` | Disable uploading freshness results | +| `disable_tests_results` | `false` | Disable uploading test results | +| `disable_dbt_artifacts_autoupload` | `false` | Disable automatic dbt artifacts upload | +| `disable_dbt_invocation_autoupload` | `false` | Disable automatic dbt invocation upload | +| `columns_upload_strategy` | `'enriched_only'` | Strategy for uploading columns: 'enriched_only', 'all', or 'none' | +| `upload_artifacts_method` | `'diff'` | Method for uploading artifacts: 'diff' or 'full' | +| `cache_artifacts` | `true` | Cache artifacts for performance | +| `dbt_artifacts_chunk_size` | `5000` | Chunk size for uploading dbt artifacts | +| `include_other_warehouse_specific_columns` | `false` | Include warehouse-specific columns | + +## Test Configuration Variables + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `test_sample_row_count` | `5` | Number of sample rows to collect for failed tests | +| `tests_use_temp_tables` | `false` | Use temporary tables for tests | +| `calculate_failed_count` | `true` | Calculate failed row count in tests | +| `store_result_rows_in_own_table` | `true` | Store test result rows in separate table | +| `clean_elementary_temp_tables` | `true` | Clean up temporary tables after tests | + +## Performance and System Variables + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `query_max_size` | `1000000` | Maximum query size (250000 for BigQuery/ClickHouse/Athena/Trino) | +| `max_int` | `2147483647` | Maximum integer value for metrics | +| `long_string_size` | `65535` | Maximum size for long string columns | +| `collect_model_sql` | `true` | Collect model SQL for artifacts | +| `force_metrics_backfill` | `false` | Force backfill of metrics | + +## PII and Sampling Configuration Variables + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `disable_samples_on_pii_tags` | `false` | Disable sampling for PII-tagged tables | +| `pii_tags` | `['pii']` | Tags that identify PII data | + +## Advanced Configuration Variables + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `edr_cli_run` | `false` | Whether running via Elementary CLI | +| `custom_run_started_at` | `none` | Custom run start timestamp | +| `mute_dbt_upgrade_recommendation` | `false` | Mute dbt upgrade recommendations | +| `mute_ensure_materialization_override` | `false` | Mute materialization override warnings | + +## Complex Configuration Objects + +### ignore_small_changes + +| Variable | Default Value | Description | +|----------|---------------|-------------| +| `ignore_small_changes` | `{'spike_failure_percent_threshold': none, 'drop_failure_percent_threshold': none}` | Thresholds for ignoring small changes in anomaly detection | + +### edr_monitors + +The `edr_monitors` variable controls which monitors are enabled for different data types. Default configuration: + +```yaml +edr_monitors: + table: ['row_count', 'freshness'] + column_any_type: ['null_count', 'null_percent'] + column_string: ['min_length', 'max_length', 'average_length', 'missing_count', 'missing_percent'] + column_numeric: ['min', 'max', 'zero_count', 'zero_percent', 'average', 'standard_deviation', 'variance'] + column_boolean: ['count_true', 'count_false'] +``` + +#### Available Monitors by Type + +**Table Monitors:** +- `row_count` - Monitor row count over time +- `freshness` - Monitor data freshness +- `event_freshness` - Monitor event-based freshness + +**Column Any Type Monitors:** +- `null_count` - Count of null values +- `null_percent` - Percentage of null values +- `not_null_percent` - Percentage of non-null values + +**Column String Monitors:** +- `min_length` - Minimum string length +- `max_length` - Maximum string length +- `average_length` - Average string length +- `missing_count` - Count of missing values +- `missing_percent` - Percentage of missing values +- `not_missing_percent` - Percentage of non-missing values + +**Column Numeric Monitors:** +- `min` - Minimum value +- `max` - Maximum value +- `zero_count` - Count of zero values +- `zero_percent` - Percentage of zero values +- `not_zero_percent` - Percentage of non-zero values +- `average` - Average value +- `standard_deviation` - Standard deviation +- `variance` - Variance +- `sum` - Sum of values + +**Column Boolean Monitors:** +- `count_true` - Count of true values +- `count_false` - Count of false values + +## Usage Examples + +### Basic Configuration + +```yaml +# dbt_project.yml +vars: + debug_logs: true + anomaly_sensitivity: 2 + days_back: 30 + disable_model_alerts: false + disable_test_alerts: false +``` + +### Advanced Configuration + +```yaml +# dbt_project.yml +vars: + # Core settings + debug_logs: true + anomaly_sensitivity: 2 + days_back: 30 + min_training_set_size: 10 + + # Alert settings + disable_model_alerts: false + disable_test_alerts: false + disable_warn_alerts: true + + # PII settings + pii_tags: ['pii', 'sensitive', 'personal'] + disable_samples_on_pii_tags: true + + # Performance settings + query_max_size: 500000 + dbt_artifacts_chunk_size: 10000 + + # Custom monitors + edr_monitors: + table: ['row_count', 'freshness'] + column_any_type: ['null_count', 'null_percent'] + column_string: ['min_length', 'max_length', 'average_length'] + column_numeric: ['min', 'max', 'average', 'standard_deviation'] + column_boolean: ['count_true', 'count_false'] + + # Ignore small changes + ignore_small_changes: + spike_failure_percent_threshold: 5 + drop_failure_percent_threshold: 10 +``` + +### Environment-Specific Configuration + +```yaml +# dbt_project.yml +vars: + debug_logs: "{{ env_var('DBT_EDR_DEBUG', False) }}" + project_name: "{{ env_var('DBT_PROJECT_NAME', 'my_project') }}" + anomaly_sensitivity: "{{ env_var('DBT_ANOMALY_SENSITIVITY', 3) }}" +``` + +## Warehouse-Specific Defaults + +Some variables have different default values depending on your data warehouse: + +- **BigQuery, ClickHouse, Athena, Trino**: `query_max_size` defaults to `250000` instead of `1000000` + +## Notes + +- All boolean variables can be set as strings (`"true"`/`"false"`) or booleans (`true`/`false`) +- The `edr_monitors` configuration is quite flexible and allows you to customize which monitors are active for different data types +- Environment variables can be used in conjunction with these vars for dynamic configuration +- Some variables are primarily used internally by the package and may not need customization in most cases + +## References + +- [Elementary Documentation](https://docs.elementary-data.com/) +- [dbt Variables Documentation](https://docs.getdbt.com/reference/dbt-jinja-functions/var) +- [Elementary GitHub Repository](https://github.com/elementary-data/elementary) \ No newline at end of file