Merge pull request #1851 from elementary-data/unstructured_data_tests_docs_v2

oravi · web-flow · commit ffa1c0f6a073 · 2025-03-12T13:56:42.000+02:00
Unstructured data tests docs v2
diff --git a/docs/Dockerfile b/docs/Dockerfile
@@ -1,8 +1,7 @@
-FROM node:19
+FROM node:20.3.0
 
 WORKDIR /app
 RUN npm i -g mintlify
-RUN mintlify install
 
 EXPOSE 3000
 CMD ["mintlify", "dev"]
diff --git a/docs/data-tests/ai-data-tests/ai_data_validations.mdx b/docs/data-tests/ai-data-tests/ai_data_validations.mdx
@@ -0,0 +1,125 @@
+---
+title: "AI Data Validations"
+---
+
+<Note type="warning">
+  **Beta Feature**: AI data validation tests is currently in beta. The functionality and interface may change in future releases.
+</Note>
+
+# AI Data Validation with Elementary
+
+## What is AI Data Validation?
+
+Elementary's `elementary.ai_data_validation` test allows you to validate any data column using AI and LLM language models. This test is more flexible than traditional tests as it can be applied to any column type and uses natural language to define validation rules.
+
+With `ai_data_validation`, you can simply describe what you expect from your data in plain English, and Elementary will check if your data meets those expectations. This is particularly useful for complex validation rules that would be difficult to express with traditional SQL or dbt tests.
+
+## How It Works
+
+Elementary leverages the AI and LLM capabilities built directly into your data warehouse. When you run a validation test:
+
+1. Your data stays within your data warehouse
+2. The warehouse's built-in AI and LLM functions analyze the data
+3. Elementary reports whether each value meets your expectations based on the prompt
+
+## Required Setup for Each Data Warehouse
+
+Before you can use Elementary's AI data validations, you need to set up AI and LLM capabilities in your data warehouse:
+
+### Snowflake
+- **Prerequisite**: Enable Snowflake Cortex AI LLM functions
+- **Recommended Model**: `claude-3-5-sonnet`
+- [View Snowflake's Guide](/data-tests/unstructured-data-tests/snowflake)
+
+### Databricks
+- **Prerequisite**: Ensure Databricks AI Functions are available
+- **Recommended Model**: `databricks-meta-llama-3-3-70b-instruct`
+- [View Databrick's Setup Guide](/data-tests/unstructured-data-tests/databricks)
+
+### BigQuery
+- **Prerequisite**: Configure BigQuery to use Vertex AI models
+- **Recommended Model**: `gemini-1.5-pro`
+- [View BigQuery's Setup Guide](/data-tests/unstructured-data-tests/bigquery)
+
+### Redshift
+- Support coming soon
+
+### Data Lakes
+- Currently supported through Snowflake, Databricks, or BigQuery external object tables
+- [View Data Lakes Information](/data-tests/unstructured-data-tests/data-lakes)
+
+## Using the AI Data Validation Test
+
+The test requires one main parameter:
+- `expectation_prompt`: Describe what you expect from the data in plain English
+
+Optionally, you can also specify:
+- `llm_model_name`: Specify which AI model to use (see recommendations above for each warehouse)
+
+<Info>
+  This test works with any column type, as the data will be converted to a string format for validation. This enables natural language data validations for dates, numbers, and other structured data types.
+</Info>
+
+<RequestExample>
+
+```yml Models
+version: 2
+
+models:
+  - name: < model name >
+    columns:
+      - name: < column name >
+        tests:
+          - elementary.ai_data_validation:
+              expectation_prompt: "Description of what the data should satisfy"
+              llm_model_name: "model_name"  # Optional
+```
+
+```yml Example - Date Validation
+version: 2
+
+models:
+  - name: crm
+    description: "A table containing contract details."
+    columns:
+      - name: contract_date
+        description: "The date when the contract was signed."
+        tests:
+          - elementary.ai_data_validation:
+              expectation_prompt: "There should be no contract date in the future"
+```
+
+```yml Example - Numeric Validation
+version: 2
+
+models:
+  - name: sales
+    description: "A table containing sales data."
+    columns:
+      - name: discount_percentage
+        description: "The discount percentage applied to the sale."
+        tests:
+          - elementary.ai_data_validation:
+              expectation_prompt: "The discount percentage should be between 0 and 50, and should only be a whole number."
+              llm_model_name: "claude-3-5-sonnet"
+              config:
+                severity: warn
+```
+
+```yml Example - Complex Validation
+version: 2
+
+models:
+  - name: customer_accounts
+    description: "A table containing customer account information."
+    columns:
+      - name: account_status
+        description: "The current status of the customer account."
+        tests:
+          - elementary.ai_data_validation:
+              expectation_prompt: "The account status should be one of: 'active', 'inactive', 'suspended', or 'pending'. If the account is 'suspended', there should be a reason code in the suspension_reason column."
+              llm_model_name: "gemini-1.5-pro"
+```
+
+</RequestExample>
+
diff --git a/docs/data-tests/ai-data-tests/bigquery.mdx b/docs/data-tests/ai-data-tests/bigquery.mdx
@@ -0,0 +1,106 @@
+---
+title: "BigQuery Vertex AI"
+description: "Learn how to configure BigQuery to use Vertex AI models for unstructured data validation tests"
+---
+
+# BigQuery Setup for Unstructured Data Tests
+
+Elementary's unstructured data validation tests leverage BigQuery ML and Vertex AI models to perform advanced AI-powered validations. This guide will walk you through the setup process.
+
+## Prerequisites
+
+Before you begin, ensure you have:
+- A Google Cloud account with appropriate permissions
+- Access to BigQuery and Vertex AI services
+- A BigQuery dataset where you'll create your model, that will be used by Elementary's data validation tests. This is the dataset where you have unstructured data stored and that you want to apply validations on.
+
+## Step 1: Enable the Vertex AI API
+
+1. Navigate to the Google Cloud Console
+2. Go to **APIs & Services** > **API Library**
+3. Search for "Vertex AI API"
+4. Click on the API and select **Enable**
+
+## Step 2: Create a Remote Connection to Vertex AI
+
+Elementary's unstructured data validation tests use BigQuery ML to access pre-trained Vertex AI models. To establish this connection:
+
+1. Navigate to the Google Cloud Console > **BigQuery**
+2. In the Explorer panel, click the **+** button
+3. Select **Connections to external data sources**
+4. Change the connection type to **Vertex AI remote models, remote functions and BigLake (Cloud Resource)**
+5. Select the appropriate region:
+   - If your model and dataset are in the same region, select that specific region
+   - Otherwise, select multi-region
+
+After creating the connection:
+1. In the BigQuery Explorer, navigate to **External Connections**
+2. Find and click on your newly created connection
+3. Copy the **Service Account ID** for the next step
+
+## Step 3: Grant Vertex AI Access Permissions
+
+Now you need to give the connection's service account permission to access Vertex AI:
+
+1. In the Google Cloud Console, go to **IAM & Admin**
+2. Click **+ Grant Access**
+3. Under "New principals", paste the service account ID you copied
+4. Assign the **Vertex AI User** role
+5. Click **Save**
+
+## Step 4: Create an LLM Model Interface in BigQuery
+
+1. In the BigQuery Explorer, navigate to **External Connections**
+2. Find again your newly created connection from previous step and clikc on it
+3. Copy the **Connection ID** (format: `projects/<project-name>/locations/<region>/connections/<connection-name>`)
+4. [Select a model endpoint](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models). You can use `gemini-1.5-pro-002` as a default endpoint.
+5. Run the following SQL query to create a model in your dataset:
+
+```sql
+CREATE OR REPLACE MODEL
+  `<your-project>.<your-dataset>.<name-your-model>`
+REMOTE WITH CONNECTION
+  `<paste-here-your-connection-id>` 
+OPTIONS (
+  endpoint = '<model-endpoint>'
+);
+```
+
+### Example
+
+```sql
+CREATE OR REPLACE MODEL
+  `my-project.my-dataset.gemini-1.5-pro`
+REMOTE WITH CONNECTION
+  `projects/my-project/locations/us/connections/my-remote-connection-model-name` 
+OPTIONS (
+  endpoint = 'gemini-1.5-pro-002'
+);
+```
+
+> **Note:** During development, we used `gemini-1.5-pro` and recommend it as the default model for unstructured data tests in BigQuery.
+
+### Additional Resources
+
+- [Available models and endpoints](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models)
+- [Documentation on creating remote models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model)
+
+## Step 5: Running an Unstructured Data Test
+
+Once your model is set up, you can reference it in your Elementary tests:
+
+```yaml
+models:
+  - name: table_with_unstructured_data
+    description: "A table containing unstructured text data."
+    columns:
+      - name: text_data
+        description: "Unstructured text data stored as a string."
+        tests:
+          - elementary.validate_unstructured_data:
+              expectation_prompt: "The text data should represent an example of unstructured data."
+              llm_model_name: "gemini-1.5-pro"
+```
+
+
+
diff --git a/docs/data-tests/ai-data-tests/data-lakes.mdx b/docs/data-tests/ai-data-tests/data-lakes.mdx
@@ -0,0 +1,7 @@
+---
+title: "Data lakes"
+---
+
+Currently, you can apply Elementary's unstructured data validation tests on data lakes using Snowflake, Databricks, or BigQuery external object tables.
+
+Native and direct support for data lakes is coming soon. Please reach out if you would like to discuss this integration and use case.
diff --git a/docs/data-tests/ai-data-tests/databricks.mdx b/docs/data-tests/ai-data-tests/databricks.mdx
@@ -0,0 +1,35 @@
+---
+title: "Databricks AI Functions"
+---
+
+# Setting Up Databricks AI Functions
+
+Elementary unstructured data validation tests run on top of Databricks AI Functions for Databricks users.
+This guide provides details on the prerequisites to use Databricks AI Functions.
+
+## What are Databricks AI Functions?
+
+Databricks AI Functions are built-in SQL functions that allow you to apply AI capabilities directly to your data using SQL. These functions enable you to leverage large language models and other AI capabilities without complex setup or external dependencies, making them ideal for data validation tests.
+
+## Availability and Prerequisites
+
+To use Databricks AI Functions, your environment must meet the following requirements:
+
+### Runtime Requirements
+- **Recommended**: Databricks Runtime 15.3 or above for optimal performance
+
+### Environment Requirements
+- Your workspace must be in a supported Model Serving region.
+- For Pro SQL warehouses, AWS PrivateLink must be enabled.
+- Databricks SQL does support AI functions but Databricks SQL Classic does not support it. 
+
+### Models
+Databricks AI functions can run on foundation models hosted in Databricks, external foundation models (like OpenAI's models) and custom models. 
+Currently Elementary's unstructured data validations support only foundation models hosted in Databricks. Adding support for external and custom models is coming soon.
+> **Note**: While developing the tests we worked with `databricks-meta-llama-3-3-70b-instruct` so we recommend using this model as a default when running unstructured data validation tests in Databricks.
+
+
+## Region Considerations
+
+When using AI functions, be aware that some models are limited to specific regions (US and EU). Make sure your Databricks workspace is in a supported region for the Databricks AI functions.
+
diff --git a/docs/data-tests/ai-data-tests/redshift.mdx b/docs/data-tests/ai-data-tests/redshift.mdx
@@ -0,0 +1,7 @@
+---
+title: "Redshift"
+---
+
+Elementary's unstructured data validation tests do not currently support Redshift.
+
+On Redshift setting up LLM functions is more complex and requires deploying a lambda function to call external LLM models. Documentation and support for this integration is coming soon. Please reach out if you'd like to discuss this use case and integration options.
diff --git a/docs/data-tests/ai-data-tests/snowflake.mdx b/docs/data-tests/ai-data-tests/snowflake.mdx
@@ -0,0 +1,70 @@
+---
+title: "Snowflake Cortex AI"
+---
+
+# Snowflake Cortex AI LLM Functions
+
+This guide provides instructions on how to enable Snowflake Cortex AI LLM functions, which is a prerequisite for running Elementary unstructured data validation tests on Snowflake.
+
+## What is Snowflake Cortex?
+
+Snowflake Cortex is a fully managed service that brings cutting-edge AI and ML solutions directly into your Snowflake environment. It allows you to leverage the power of large language models (LLMs) without any complex setup or external dependencies.
+Snowflake provides LLMs that are fully hosted and managed by Snowflake, using them requires no setup and your data stays within Snowflake. 
+
+
+## Cross-Region Model Usage
+
+> **Important**: It is always better to use models in the same region as your dataset to avoid errors and optimize performance.
+
+To learn where each model is located we recommend checking this [models list](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#availability).
+If you encounter a "model not found" error, it may be because the model you're trying to use is not available in your current region. In such cases, you can enable cross-region model access with the following command (requires ACCOUNTADMIN privileges):
+
+```sql
+-- Enable access to models in any region
+ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION';
+```
+
+This setting allows your account to use models from any region, which can be helpful when the model you need is not available in your current region. However, be aware that cross-region access may impact performance and could have additional cost implications.
+
+
+## Supported LLM Models
+
+Snowflake Cortex provides access to various industry-leading LLM models with different capabilities and context lengths. Here are the key models available:
+
+### Native Snowflake Models
+
+* **Snowflake Arctic**: An open enterprise-grade model developed by Snowflake, optimized for business use cases.
+
+### External Models (Hosted within Snowflake)
+
+* **Claude Models (Anthropic)**: High-capability models for complex reasoning tasks.
+* **Mistral Models**: Including mistral-large, mixtral-8x7b, and mistral-7b for various use cases.
+* **Llama Models (Meta)**: Including llama3.2-1b, llama3.2-3b, llama3.1-8b, and llama2-70b-chat.
+* **Gemma Models (Google)**: Including gemma-7b for code and text completion tasks.
+
+> **Note**: While developing the tests we worked with `claude-3-5-sonnet` so we recommend using this model as a default when running unstructured data tests in Snowflake.
+
+## Permissions
+
+> **Note**: By default, all users in your Snowflake account already have access to Cortex AI LLM functions through the PUBLIC role. In most cases, you don't need to do anything to enable access.
+
+The `CORTEX_USER` database role in the SNOWFLAKE database includes all the privileges needed to call Snowflake Cortex LLM functions. This role is automatically granted to the PUBLIC role, which all users have by default.
+
+The following commands are **only needed if** your administrator has revoked the default access from the PUBLIC role or if you need to set up specific access controls. If you can already use Cortex functions, you can skip this section.
+
+```sql
+-- Run as ACCOUNTADMIN
+USE ROLE ACCOUNTADMIN;
+
+-- Create a dedicated role for Cortex users
+CREATE ROLE cortex_user_role;
+
+-- Grant the database role to the custom role
+GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE cortex_user_role;
+
+-- Grant the role to specific users
+GRANT ROLE cortex_user_role TO USER <username>;
+
+-- Optionally, grant warehouse access to the role
+GRANT USAGE ON WAREHOUSE <warehouse_name> TO ROLE cortex_user_role;
+``` 
diff --git a/docs/data-tests/ai-data-tests/unstructured_data_validations.mdx b/docs/data-tests/ai-data-tests/unstructured_data_validations.mdx
diff --git a/docs/mint.json b/docs/mint.json