diff --git a/docs/Dockerfile b/docs/Dockerfile index 1ee2e6c29..1ad7cc928 100644 --- a/docs/Dockerfile +++ b/docs/Dockerfile @@ -1,8 +1,7 @@ -FROM node:19 +FROM node:20.3.0 WORKDIR /app RUN npm i -g mintlify -RUN mintlify install EXPOSE 3000 CMD ["mintlify", "dev"] diff --git a/docs/data-tests/ai-data-tests/ai_data_validations.mdx b/docs/data-tests/ai-data-tests/ai_data_validations.mdx new file mode 100644 index 000000000..7f9908fc5 --- /dev/null +++ b/docs/data-tests/ai-data-tests/ai_data_validations.mdx @@ -0,0 +1,125 @@ +--- +title: "AI Data Validations" +--- + + + **Beta Feature**: AI data validation tests is currently in beta. The functionality and interface may change in future releases. + + +# AI Data Validation with Elementary + +## What is AI Data Validation? + +Elementary's `elementary.ai_data_validation` test allows you to validate any data column using AI and LLM language models. This test is more flexible than traditional tests as it can be applied to any column type and uses natural language to define validation rules. + +With `ai_data_validation`, you can simply describe what you expect from your data in plain English, and Elementary will check if your data meets those expectations. This is particularly useful for complex validation rules that would be difficult to express with traditional SQL or dbt tests. + +## How It Works + +Elementary leverages the AI and LLM capabilities built directly into your data warehouse. When you run a validation test: + +1. Your data stays within your data warehouse +2. The warehouse's built-in AI and LLM functions analyze the data +3. Elementary reports whether each value meets your expectations based on the prompt + +## Required Setup for Each Data Warehouse + +Before you can use Elementary's AI data validations, you need to set up AI and LLM capabilities in your data warehouse: + +### Snowflake +- **Prerequisite**: Enable Snowflake Cortex AI LLM functions +- **Recommended Model**: `claude-3-5-sonnet` +- [View Snowflake's Guide](/data-tests/unstructured-data-tests/snowflake) + +### Databricks +- **Prerequisite**: Ensure Databricks AI Functions are available +- **Recommended Model**: `databricks-meta-llama-3-3-70b-instruct` +- [View Databrick's Setup Guide](/data-tests/unstructured-data-tests/databricks) + +### BigQuery +- **Prerequisite**: Configure BigQuery to use Vertex AI models +- **Recommended Model**: `gemini-1.5-pro` +- [View BigQuery's Setup Guide](/data-tests/unstructured-data-tests/bigquery) + +### Redshift +- Support coming soon + +### Data Lakes +- Currently supported through Snowflake, Databricks, or BigQuery external object tables +- [View Data Lakes Information](/data-tests/unstructured-data-tests/data-lakes) + +## Using the AI Data Validation Test + +The test requires one main parameter: +- `expectation_prompt`: Describe what you expect from the data in plain English + +Optionally, you can also specify: +- `llm_model_name`: Specify which AI model to use (see recommendations above for each warehouse) + + + This test works with any column type, as the data will be converted to a string format for validation. This enables natural language data validations for dates, numbers, and other structured data types. + + + + +```yml Models +version: 2 + +models: + - name: < model name > + columns: + - name: < column name > + tests: + - elementary.ai_data_validation: + expectation_prompt: "Description of what the data should satisfy" + llm_model_name: "model_name" # Optional +``` + +```yml Example - Date Validation +version: 2 + +models: + - name: crm + description: "A table containing contract details." + columns: + - name: contract_date + description: "The date when the contract was signed." + tests: + - elementary.ai_data_validation: + expectation_prompt: "There should be no contract date in the future" +``` + +```yml Example - Numeric Validation +version: 2 + +models: + - name: sales + description: "A table containing sales data." + columns: + - name: discount_percentage + description: "The discount percentage applied to the sale." + tests: + - elementary.ai_data_validation: + expectation_prompt: "The discount percentage should be between 0 and 50, and should only be a whole number." + llm_model_name: "claude-3-5-sonnet" + config: + severity: warn +``` + +```yml Example - Complex Validation +version: 2 + +models: + - name: customer_accounts + description: "A table containing customer account information." + columns: + - name: account_status + description: "The current status of the customer account." + tests: + - elementary.ai_data_validation: + expectation_prompt: "The account status should be one of: 'active', 'inactive', 'suspended', or 'pending'. If the account is 'suspended', there should be a reason code in the suspension_reason column." + llm_model_name: "gemini-1.5-pro" +``` + + + diff --git a/docs/data-tests/ai-data-tests/bigquery.mdx b/docs/data-tests/ai-data-tests/bigquery.mdx new file mode 100644 index 000000000..37b522033 --- /dev/null +++ b/docs/data-tests/ai-data-tests/bigquery.mdx @@ -0,0 +1,106 @@ +--- +title: "BigQuery Vertex AI" +description: "Learn how to configure BigQuery to use Vertex AI models for unstructured data validation tests" +--- + +# BigQuery Setup for Unstructured Data Tests + +Elementary's unstructured data validation tests leverage BigQuery ML and Vertex AI models to perform advanced AI-powered validations. This guide will walk you through the setup process. + +## Prerequisites + +Before you begin, ensure you have: +- A Google Cloud account with appropriate permissions +- Access to BigQuery and Vertex AI services +- A BigQuery dataset where you'll create your model, that will be used by Elementary's data validation tests. This is the dataset where you have unstructured data stored and that you want to apply validations on. + +## Step 1: Enable the Vertex AI API + +1. Navigate to the Google Cloud Console +2. Go to **APIs & Services** > **API Library** +3. Search for "Vertex AI API" +4. Click on the API and select **Enable** + +## Step 2: Create a Remote Connection to Vertex AI + +Elementary's unstructured data validation tests use BigQuery ML to access pre-trained Vertex AI models. To establish this connection: + +1. Navigate to the Google Cloud Console > **BigQuery** +2. In the Explorer panel, click the **+** button +3. Select **Connections to external data sources** +4. Change the connection type to **Vertex AI remote models, remote functions and BigLake (Cloud Resource)** +5. Select the appropriate region: + - If your model and dataset are in the same region, select that specific region + - Otherwise, select multi-region + +After creating the connection: +1. In the BigQuery Explorer, navigate to **External Connections** +2. Find and click on your newly created connection +3. Copy the **Service Account ID** for the next step + +## Step 3: Grant Vertex AI Access Permissions + +Now you need to give the connection's service account permission to access Vertex AI: + +1. In the Google Cloud Console, go to **IAM & Admin** +2. Click **+ Grant Access** +3. Under "New principals", paste the service account ID you copied +4. Assign the **Vertex AI User** role +5. Click **Save** + +## Step 4: Create an LLM Model Interface in BigQuery + +1. In the BigQuery Explorer, navigate to **External Connections** +2. Find again your newly created connection from previous step and clikc on it +3. Copy the **Connection ID** (format: `projects//locations//connections/`) +4. [Select a model endpoint](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models). You can use `gemini-1.5-pro-002` as a default endpoint. +5. Run the following SQL query to create a model in your dataset: + +```sql +CREATE OR REPLACE MODEL + `..` +REMOTE WITH CONNECTION + `` +OPTIONS ( + endpoint = '' +); +``` + +### Example + +```sql +CREATE OR REPLACE MODEL + `my-project.my-dataset.gemini-1.5-pro` +REMOTE WITH CONNECTION + `projects/my-project/locations/us/connections/my-remote-connection-model-name` +OPTIONS ( + endpoint = 'gemini-1.5-pro-002' +); +``` + +> **Note:** During development, we used `gemini-1.5-pro` and recommend it as the default model for unstructured data tests in BigQuery. + +### Additional Resources + +- [Available models and endpoints](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models) +- [Documentation on creating remote models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model) + +## Step 5: Running an Unstructured Data Test + +Once your model is set up, you can reference it in your Elementary tests: + +```yaml +models: + - name: table_with_unstructured_data + description: "A table containing unstructured text data." + columns: + - name: text_data + description: "Unstructured text data stored as a string." + tests: + - elementary.validate_unstructured_data: + expectation_prompt: "The text data should represent an example of unstructured data." + llm_model_name: "gemini-1.5-pro" +``` + + + diff --git a/docs/data-tests/ai-data-tests/data-lakes.mdx b/docs/data-tests/ai-data-tests/data-lakes.mdx new file mode 100644 index 000000000..7d7035c5f --- /dev/null +++ b/docs/data-tests/ai-data-tests/data-lakes.mdx @@ -0,0 +1,7 @@ +--- +title: "Data lakes" +--- + +Currently, you can apply Elementary's unstructured data validation tests on data lakes using Snowflake, Databricks, or BigQuery external object tables. + +Native and direct support for data lakes is coming soon. Please reach out if you would like to discuss this integration and use case. \ No newline at end of file diff --git a/docs/data-tests/ai-data-tests/databricks.mdx b/docs/data-tests/ai-data-tests/databricks.mdx new file mode 100644 index 000000000..41211db58 --- /dev/null +++ b/docs/data-tests/ai-data-tests/databricks.mdx @@ -0,0 +1,35 @@ +--- +title: "Databricks AI Functions" +--- + +# Setting Up Databricks AI Functions + +Elementary unstructured data validation tests run on top of Databricks AI Functions for Databricks users. +This guide provides details on the prerequisites to use Databricks AI Functions. + +## What are Databricks AI Functions? + +Databricks AI Functions are built-in SQL functions that allow you to apply AI capabilities directly to your data using SQL. These functions enable you to leverage large language models and other AI capabilities without complex setup or external dependencies, making them ideal for data validation tests. + +## Availability and Prerequisites + +To use Databricks AI Functions, your environment must meet the following requirements: + +### Runtime Requirements +- **Recommended**: Databricks Runtime 15.3 or above for optimal performance + +### Environment Requirements +- Your workspace must be in a supported Model Serving region. +- For Pro SQL warehouses, AWS PrivateLink must be enabled. +- Databricks SQL does support AI functions but Databricks SQL Classic does not support it. + +### Models +Databricks AI functions can run on foundation models hosted in Databricks, external foundation models (like OpenAI's models) and custom models. +Currently Elementary's unstructured data validations support only foundation models hosted in Databricks. Adding support for external and custom models is coming soon. +> **Note**: While developing the tests we worked with `databricks-meta-llama-3-3-70b-instruct` so we recommend using this model as a default when running unstructured data validation tests in Databricks. + + +## Region Considerations + +When using AI functions, be aware that some models are limited to specific regions (US and EU). Make sure your Databricks workspace is in a supported region for the Databricks AI functions. + diff --git a/docs/data-tests/ai-data-tests/redshift.mdx b/docs/data-tests/ai-data-tests/redshift.mdx new file mode 100644 index 000000000..25392a4b2 --- /dev/null +++ b/docs/data-tests/ai-data-tests/redshift.mdx @@ -0,0 +1,7 @@ +--- +title: "Redshift" +--- + +Elementary's unstructured data validation tests do not currently support Redshift. + +On Redshift setting up LLM functions is more complex and requires deploying a lambda function to call external LLM models. Documentation and support for this integration is coming soon. Please reach out if you'd like to discuss this use case and integration options. \ No newline at end of file diff --git a/docs/data-tests/ai-data-tests/snowflake.mdx b/docs/data-tests/ai-data-tests/snowflake.mdx new file mode 100644 index 000000000..c93b1b669 --- /dev/null +++ b/docs/data-tests/ai-data-tests/snowflake.mdx @@ -0,0 +1,70 @@ +--- +title: "Snowflake Cortex AI" +--- + +# Snowflake Cortex AI LLM Functions + +This guide provides instructions on how to enable Snowflake Cortex AI LLM functions, which is a prerequisite for running Elementary unstructured data validation tests on Snowflake. + +## What is Snowflake Cortex? + +Snowflake Cortex is a fully managed service that brings cutting-edge AI and ML solutions directly into your Snowflake environment. It allows you to leverage the power of large language models (LLMs) without any complex setup or external dependencies. +Snowflake provides LLMs that are fully hosted and managed by Snowflake, using them requires no setup and your data stays within Snowflake. + + +## Cross-Region Model Usage + +> **Important**: It is always better to use models in the same region as your dataset to avoid errors and optimize performance. + +To learn where each model is located we recommend checking this [models list](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#availability). +If you encounter a "model not found" error, it may be because the model you're trying to use is not available in your current region. In such cases, you can enable cross-region model access with the following command (requires ACCOUNTADMIN privileges): + +```sql +-- Enable access to models in any region +ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION'; +``` + +This setting allows your account to use models from any region, which can be helpful when the model you need is not available in your current region. However, be aware that cross-region access may impact performance and could have additional cost implications. + + +## Supported LLM Models + +Snowflake Cortex provides access to various industry-leading LLM models with different capabilities and context lengths. Here are the key models available: + +### Native Snowflake Models + +* **Snowflake Arctic**: An open enterprise-grade model developed by Snowflake, optimized for business use cases. + +### External Models (Hosted within Snowflake) + +* **Claude Models (Anthropic)**: High-capability models for complex reasoning tasks. +* **Mistral Models**: Including mistral-large, mixtral-8x7b, and mistral-7b for various use cases. +* **Llama Models (Meta)**: Including llama3.2-1b, llama3.2-3b, llama3.1-8b, and llama2-70b-chat. +* **Gemma Models (Google)**: Including gemma-7b for code and text completion tasks. + +> **Note**: While developing the tests we worked with `claude-3-5-sonnet` so we recommend using this model as a default when running unstructured data tests in Snowflake. + +## Permissions + +> **Note**: By default, all users in your Snowflake account already have access to Cortex AI LLM functions through the PUBLIC role. In most cases, you don't need to do anything to enable access. + +The `CORTEX_USER` database role in the SNOWFLAKE database includes all the privileges needed to call Snowflake Cortex LLM functions. This role is automatically granted to the PUBLIC role, which all users have by default. + +The following commands are **only needed if** your administrator has revoked the default access from the PUBLIC role or if you need to set up specific access controls. If you can already use Cortex functions, you can skip this section. + +```sql +-- Run as ACCOUNTADMIN +USE ROLE ACCOUNTADMIN; + +-- Create a dedicated role for Cortex users +CREATE ROLE cortex_user_role; + +-- Grant the database role to the custom role +GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE cortex_user_role; + +-- Grant the role to specific users +GRANT ROLE cortex_user_role TO USER ; + +-- Optionally, grant warehouse access to the role +GRANT USAGE ON WAREHOUSE TO ROLE cortex_user_role; +``` \ No newline at end of file diff --git a/docs/data-tests/ai-data-tests/unstructured_data_validations.mdx b/docs/data-tests/ai-data-tests/unstructured_data_validations.mdx new file mode 100644 index 000000000..2d0bd39ca --- /dev/null +++ b/docs/data-tests/ai-data-tests/unstructured_data_validations.mdx @@ -0,0 +1,247 @@ +--- +title: "Unstructured Data Validations" +--- + + + **Beta Feature**: Unstructured data validation tests is currently in beta. The functionality and interface may change in future releases. + + +# Validating Unstructured Data with Elementary + +## What is Unstructured Data Validation? + +Elementary's `elementary.unstructured_data_validation` test allows you to validate unstructured data using AI and LLM language models. Instead of writing complex code, you can simply describe what you expect from your data in plain English, and Elementary will check if your data meets those expectations. + +For example, you can verify that customer feedback comments are in English, product descriptions contain required information, or support tickets follow a specific format or a sentiment. + +## How It Works + +Elementary leverages the AI and LLM capabilities built directly into your data warehouse. When you run a validation test: + +1. Your unstructured data stays within your data warehouse +2. The warehouse's built-in AI and LLM functions analyze the data +3. Elementary reports whether each text value meets your expectations + +## Required Setup for Each Data Warehouse + +Before you can use Elementary's unstructured data validations, you need to set up AI and LLM capabilities in your data warehouse: + +### Snowflake +- **Prerequisite**: Enable Snowflake Cortex AI LLM functions +- **Recommended Model**: `claude-3-5-sonnet` +- [View Snowflake's Guide](/data-tests/unstructured-data-tests/snowflake) + +### Databricks +- **Prerequisite**: Ensure Databricks AI Functions are available +- **Recommended Model**: `databricks-meta-llama-3-3-70b-instruct` +- [View Databrick's Setup Guide](/data-tests/unstructured-data-tests/databricks) + +### BigQuery +- **Prerequisite**: Configure BigQuery to use Vertex AI models +- **Recommended Model**: `gemini-1.5-pro` +- [View BigQuery's Setup Guide](/data-tests/unstructured-data-tests/bigquery) + +### Redshift +- Support coming soon + +### Data Lakes +- Currently supported through Snowflake, Databricks, or BigQuery external object tables +- [View Data Lakes Information](/data-tests/unstructured-data-tests/data-lakes) + + +## Using the Validation Test + +The test requires two main parameters: +- `expectation_prompt`: Describe what you expect from the text in plain English +- `llm_model_name`: Specify which AI model to use (see recommendations above for each warehouse) + + + This test works with any column containing unstructured text data such as descriptions, comments, or other free-form text fields. It can also be applied to structured columns that can be converted to strings, enabling natural language data validations. + + + + +```yml Models +version: 2 + +models: + - name: < model name > + columns: + - name: < column name > + tests: + - elementary.unstructured_data_validation: + expectation_prompt: "Description of what the text should contain or represent" + llm_model_name: "model_name" +``` + +```yml Example +version: 2 + +models: + - name: table_with_unstructured_data + description: "A table containing unstructured text data." + columns: + - name: text_data + description: "Unstructured text data stored as a string." + tests: + - elementary.unstructured_data_validation: + expectation_prompt: "The text data should represent an example of unstructured data." + llm_model_name: "test_model" +``` + +```yml Example - Validating Customer Feedback +version: 2 + +models: + - name: customer_feedback + description: "A table containing customer feedback comments." + columns: + - name: feedback_text + description: "Customer feedback in free text format." + tests: + - elementary.unstructured_data_validation: + expectation_prompt: "The text should be a customer feedback comment in English, it should describe only a bug or a feature request." + llm_model_name: "claude-3-5-sonnet" + config: + severity: warn +``` + + + + +## Usage Examples + +Here are some powerful ways you can apply unstructured data validations: + +### Validating Structure + +```yml +models: + - name: medicine_prescriptions + description: "A table containing medicine prescriptions." + columns: + - name: doctor_notes + description: "A column containing the doctor notes on the prescription" + tests: + - elementary.unstructured_data_validation: + expectation_prompt: "The prescription has to include a limited time period and recommendations to the patient" + llm_model_name: "claude-3-5-sonnet" +``` + +Test fails if: A doctor's note does not specify a time period or lacks recommendations for the patient. + +### Validating Sentiment + +```yml +models: + - name: customer_feedback + description: "A table containing customer feedback." + columns: + - name: negative_feedbacks + description: "A column containing negative feedbacks about our product." + tests: + - elementary.unstructured_data_validation: + expectation_prompt: "The customer feedback's sentiment has to be negative" + llm_model_name: "claude-3-5-sonnet" +``` + +Test fails if: Any feedback in `negative_feedbacks` is not actually negative. + +### Validating Similarities Coming Soon + +```yml +models: + - name: summerized_pdfs + description: "A table containing a summary of our ingested PDFs." + columns: + - name: pdf_summary + description: "A column containing the main PDF's content summary." + tests: + - elementary.validate_similarity: + to: ref('pdf_source_table') + column: pdf_content + match_by: pdf_name +``` + +Test fails if: A PDF summary does not accurately represent the original PDF's content. The validation will use the pdf name as the key to match a summary from the pdf_summary table to the pdf_content in the pdf_source_table. + +```yml +models: + - name: jobs + columns: + - name: job_title + tests: + - elementary.validate_similarity: + column: job_description +``` + +Test fails if: The job title does not align with the job description. + +### Accepted Categories Coming Soon + +```yml +models: + - name: support_tickets + description: "A table containing customer support tickets." + columns: + - name: issue_description + description: "A column containing customer-reported issues." + tests: + - elementary.accepted_categories: + categories: ['billing', 'technical_support', 'account_access', 'other'] +``` + +Test fails if: A support ticket does not fall within the predefined categories. + +### Accepted Entities Coming Soon + +```yml +models: + - name: news_articles + description: "A table containing news articles." + columns: + - name: article_text + description: "A column containing full article text." + tests: + - elementary.extract_and_validate_entities: + entities: + organization: + required: true + accepted_values: ['Google', 'Amazon', 'Microsoft', 'Apple'] + location: + required: false + accepted_values: {{ run_query('select zip_code from locations') }} +``` + +Test fails if: +- The required entity (e.g., `organization`) is missing. +- Extracted entities do not match the expected values. + +### Compare Numeric Values Coming Soon + +```yml +models: + - name: board_meeting_summaries + description: "A table containing board meeting summary texts." + columns: + - name: meeting_notes + description: "A column containing the full summary of the board meeting." + tests: + - elementary.extract_and_validate_numbers: + entities: + revenue: + compare_with: ref('crm_financials') + column: sum(revenue) + required: true + net_profit: + compare_with: ref('crm_financials') + column: sum(net_profit) + customer_count: + compare_with: ref('crm_customers') + column: count(customers) + required: true +``` + +Test fails if: +- Required entities are missing +- The numerical entities do not match the structured CRM data \ No newline at end of file diff --git a/docs/mint.json b/docs/mint.json index e9bea78ab..dd4586dbf 100644 --- a/docs/mint.json +++ b/docs/mint.json @@ -369,6 +369,18 @@ "data-tests/schema-tests/exposure-tests" ] }, + { + "group": "AI Data Tests (Beta)", + "pages": [ + "data-tests/ai-data-tests/ai_data_validations", + "data-tests/ai-data-tests/unstructured_data_validations", + "data-tests/ai-data-tests/snowflake", + "data-tests/ai-data-tests/databricks", + "data-tests/ai-data-tests/bigquery", + "data-tests/ai-data-tests/redshift", + "data-tests/ai-data-tests/data-lakes" + ] + }, { "group": "Other Tests", "pages": [