Skip to content

Commit ffa1c0f

Browse files
authored
Merge pull request #1851 from elementary-data/unstructured_data_tests_docs_v2
Unstructured data tests docs v2
2 parents 6ab611e + 0a00136 commit ffa1c0f

File tree

9 files changed

+610
-2
lines changed

9 files changed

+610
-2
lines changed

docs/Dockerfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
1-
FROM node:19
1+
FROM node:20.3.0
22

33
WORKDIR /app
44
RUN npm i -g mintlify
5-
RUN mintlify install
65

76
EXPOSE 3000
87
CMD ["mintlify", "dev"]
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
title: "AI Data Validations"
3+
---
4+
5+
<Note type="warning">
6+
**Beta Feature**: AI data validation tests is currently in beta. The functionality and interface may change in future releases.
7+
</Note>
8+
9+
# AI Data Validation with Elementary
10+
11+
## What is AI Data Validation?
12+
13+
Elementary's `elementary.ai_data_validation` test allows you to validate any data column using AI and LLM language models. This test is more flexible than traditional tests as it can be applied to any column type and uses natural language to define validation rules.
14+
15+
With `ai_data_validation`, you can simply describe what you expect from your data in plain English, and Elementary will check if your data meets those expectations. This is particularly useful for complex validation rules that would be difficult to express with traditional SQL or dbt tests.
16+
17+
## How It Works
18+
19+
Elementary leverages the AI and LLM capabilities built directly into your data warehouse. When you run a validation test:
20+
21+
1. Your data stays within your data warehouse
22+
2. The warehouse's built-in AI and LLM functions analyze the data
23+
3. Elementary reports whether each value meets your expectations based on the prompt
24+
25+
## Required Setup for Each Data Warehouse
26+
27+
Before you can use Elementary's AI data validations, you need to set up AI and LLM capabilities in your data warehouse:
28+
29+
### Snowflake
30+
- **Prerequisite**: Enable Snowflake Cortex AI LLM functions
31+
- **Recommended Model**: `claude-3-5-sonnet`
32+
- [View Snowflake's Guide](/data-tests/unstructured-data-tests/snowflake)
33+
34+
### Databricks
35+
- **Prerequisite**: Ensure Databricks AI Functions are available
36+
- **Recommended Model**: `databricks-meta-llama-3-3-70b-instruct`
37+
- [View Databrick's Setup Guide](/data-tests/unstructured-data-tests/databricks)
38+
39+
### BigQuery
40+
- **Prerequisite**: Configure BigQuery to use Vertex AI models
41+
- **Recommended Model**: `gemini-1.5-pro`
42+
- [View BigQuery's Setup Guide](/data-tests/unstructured-data-tests/bigquery)
43+
44+
### Redshift
45+
- Support coming soon
46+
47+
### Data Lakes
48+
- Currently supported through Snowflake, Databricks, or BigQuery external object tables
49+
- [View Data Lakes Information](/data-tests/unstructured-data-tests/data-lakes)
50+
51+
## Using the AI Data Validation Test
52+
53+
The test requires one main parameter:
54+
- `expectation_prompt`: Describe what you expect from the data in plain English
55+
56+
Optionally, you can also specify:
57+
- `llm_model_name`: Specify which AI model to use (see recommendations above for each warehouse)
58+
59+
<Info>
60+
This test works with any column type, as the data will be converted to a string format for validation. This enables natural language data validations for dates, numbers, and other structured data types.
61+
</Info>
62+
63+
<RequestExample>
64+
65+
```yml Models
66+
version: 2
67+
68+
models:
69+
- name: < model name >
70+
columns:
71+
- name: < column name >
72+
tests:
73+
- elementary.ai_data_validation:
74+
expectation_prompt: "Description of what the data should satisfy"
75+
llm_model_name: "model_name" # Optional
76+
```
77+
78+
```yml Example - Date Validation
79+
version: 2
80+
81+
models:
82+
- name: crm
83+
description: "A table containing contract details."
84+
columns:
85+
- name: contract_date
86+
description: "The date when the contract was signed."
87+
tests:
88+
- elementary.ai_data_validation:
89+
expectation_prompt: "There should be no contract date in the future"
90+
```
91+
92+
```yml Example - Numeric Validation
93+
version: 2
94+
95+
models:
96+
- name: sales
97+
description: "A table containing sales data."
98+
columns:
99+
- name: discount_percentage
100+
description: "The discount percentage applied to the sale."
101+
tests:
102+
- elementary.ai_data_validation:
103+
expectation_prompt: "The discount percentage should be between 0 and 50, and should only be a whole number."
104+
llm_model_name: "claude-3-5-sonnet"
105+
config:
106+
severity: warn
107+
```
108+
109+
```yml Example - Complex Validation
110+
version: 2
111+
112+
models:
113+
- name: customer_accounts
114+
description: "A table containing customer account information."
115+
columns:
116+
- name: account_status
117+
description: "The current status of the customer account."
118+
tests:
119+
- elementary.ai_data_validation:
120+
expectation_prompt: "The account status should be one of: 'active', 'inactive', 'suspended', or 'pending'. If the account is 'suspended', there should be a reason code in the suspension_reason column."
121+
llm_model_name: "gemini-1.5-pro"
122+
```
123+
124+
</RequestExample>
125+
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
title: "BigQuery Vertex AI"
3+
description: "Learn how to configure BigQuery to use Vertex AI models for unstructured data validation tests"
4+
---
5+
6+
# BigQuery Setup for Unstructured Data Tests
7+
8+
Elementary's unstructured data validation tests leverage BigQuery ML and Vertex AI models to perform advanced AI-powered validations. This guide will walk you through the setup process.
9+
10+
## Prerequisites
11+
12+
Before you begin, ensure you have:
13+
- A Google Cloud account with appropriate permissions
14+
- Access to BigQuery and Vertex AI services
15+
- A BigQuery dataset where you'll create your model, that will be used by Elementary's data validation tests. This is the dataset where you have unstructured data stored and that you want to apply validations on.
16+
17+
## Step 1: Enable the Vertex AI API
18+
19+
1. Navigate to the Google Cloud Console
20+
2. Go to **APIs & Services** > **API Library**
21+
3. Search for "Vertex AI API"
22+
4. Click on the API and select **Enable**
23+
24+
## Step 2: Create a Remote Connection to Vertex AI
25+
26+
Elementary's unstructured data validation tests use BigQuery ML to access pre-trained Vertex AI models. To establish this connection:
27+
28+
1. Navigate to the Google Cloud Console > **BigQuery**
29+
2. In the Explorer panel, click the **+** button
30+
3. Select **Connections to external data sources**
31+
4. Change the connection type to **Vertex AI remote models, remote functions and BigLake (Cloud Resource)**
32+
5. Select the appropriate region:
33+
- If your model and dataset are in the same region, select that specific region
34+
- Otherwise, select multi-region
35+
36+
After creating the connection:
37+
1. In the BigQuery Explorer, navigate to **External Connections**
38+
2. Find and click on your newly created connection
39+
3. Copy the **Service Account ID** for the next step
40+
41+
## Step 3: Grant Vertex AI Access Permissions
42+
43+
Now you need to give the connection's service account permission to access Vertex AI:
44+
45+
1. In the Google Cloud Console, go to **IAM & Admin**
46+
2. Click **+ Grant Access**
47+
3. Under "New principals", paste the service account ID you copied
48+
4. Assign the **Vertex AI User** role
49+
5. Click **Save**
50+
51+
## Step 4: Create an LLM Model Interface in BigQuery
52+
53+
1. In the BigQuery Explorer, navigate to **External Connections**
54+
2. Find again your newly created connection from previous step and clikc on it
55+
3. Copy the **Connection ID** (format: `projects/<project-name>/locations/<region>/connections/<connection-name>`)
56+
4. [Select a model endpoint](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models). You can use `gemini-1.5-pro-002` as a default endpoint.
57+
5. Run the following SQL query to create a model in your dataset:
58+
59+
```sql
60+
CREATE OR REPLACE MODEL
61+
`<your-project>.<your-dataset>.<name-your-model>`
62+
REMOTE WITH CONNECTION
63+
`<paste-here-your-connection-id>`
64+
OPTIONS (
65+
endpoint = '<model-endpoint>'
66+
);
67+
```
68+
69+
### Example
70+
71+
```sql
72+
CREATE OR REPLACE MODEL
73+
`my-project.my-dataset.gemini-1.5-pro`
74+
REMOTE WITH CONNECTION
75+
`projects/my-project/locations/us/connections/my-remote-connection-model-name`
76+
OPTIONS (
77+
endpoint = 'gemini-1.5-pro-002'
78+
);
79+
```
80+
81+
> **Note:** During development, we used `gemini-1.5-pro` and recommend it as the default model for unstructured data tests in BigQuery.
82+
83+
### Additional Resources
84+
85+
- [Available models and endpoints](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models)
86+
- [Documentation on creating remote models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model)
87+
88+
## Step 5: Running an Unstructured Data Test
89+
90+
Once your model is set up, you can reference it in your Elementary tests:
91+
92+
```yaml
93+
models:
94+
- name: table_with_unstructured_data
95+
description: "A table containing unstructured text data."
96+
columns:
97+
- name: text_data
98+
description: "Unstructured text data stored as a string."
99+
tests:
100+
- elementary.validate_unstructured_data:
101+
expectation_prompt: "The text data should represent an example of unstructured data."
102+
llm_model_name: "gemini-1.5-pro"
103+
```
104+
105+
106+
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: "Data lakes"
3+
---
4+
5+
Currently, you can apply Elementary's unstructured data validation tests on data lakes using Snowflake, Databricks, or BigQuery external object tables.
6+
7+
Native and direct support for data lakes is coming soon. Please reach out if you would like to discuss this integration and use case.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
title: "Databricks AI Functions"
3+
---
4+
5+
# Setting Up Databricks AI Functions
6+
7+
Elementary unstructured data validation tests run on top of Databricks AI Functions for Databricks users.
8+
This guide provides details on the prerequisites to use Databricks AI Functions.
9+
10+
## What are Databricks AI Functions?
11+
12+
Databricks AI Functions are built-in SQL functions that allow you to apply AI capabilities directly to your data using SQL. These functions enable you to leverage large language models and other AI capabilities without complex setup or external dependencies, making them ideal for data validation tests.
13+
14+
## Availability and Prerequisites
15+
16+
To use Databricks AI Functions, your environment must meet the following requirements:
17+
18+
### Runtime Requirements
19+
- **Recommended**: Databricks Runtime 15.3 or above for optimal performance
20+
21+
### Environment Requirements
22+
- Your workspace must be in a supported Model Serving region.
23+
- For Pro SQL warehouses, AWS PrivateLink must be enabled.
24+
- Databricks SQL does support AI functions but Databricks SQL Classic does not support it.
25+
26+
### Models
27+
Databricks AI functions can run on foundation models hosted in Databricks, external foundation models (like OpenAI's models) and custom models.
28+
Currently Elementary's unstructured data validations support only foundation models hosted in Databricks. Adding support for external and custom models is coming soon.
29+
> **Note**: While developing the tests we worked with `databricks-meta-llama-3-3-70b-instruct` so we recommend using this model as a default when running unstructured data validation tests in Databricks.
30+
31+
32+
## Region Considerations
33+
34+
When using AI functions, be aware that some models are limited to specific regions (US and EU). Make sure your Databricks workspace is in a supported region for the Databricks AI functions.
35+
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: "Redshift"
3+
---
4+
5+
Elementary's unstructured data validation tests do not currently support Redshift.
6+
7+
On Redshift setting up LLM functions is more complex and requires deploying a lambda function to call external LLM models. Documentation and support for this integration is coming soon. Please reach out if you'd like to discuss this use case and integration options.
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: "Snowflake Cortex AI"
3+
---
4+
5+
# Snowflake Cortex AI LLM Functions
6+
7+
This guide provides instructions on how to enable Snowflake Cortex AI LLM functions, which is a prerequisite for running Elementary unstructured data validation tests on Snowflake.
8+
9+
## What is Snowflake Cortex?
10+
11+
Snowflake Cortex is a fully managed service that brings cutting-edge AI and ML solutions directly into your Snowflake environment. It allows you to leverage the power of large language models (LLMs) without any complex setup or external dependencies.
12+
Snowflake provides LLMs that are fully hosted and managed by Snowflake, using them requires no setup and your data stays within Snowflake.
13+
14+
15+
## Cross-Region Model Usage
16+
17+
> **Important**: It is always better to use models in the same region as your dataset to avoid errors and optimize performance.
18+
19+
To learn where each model is located we recommend checking this [models list](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#availability).
20+
If you encounter a "model not found" error, it may be because the model you're trying to use is not available in your current region. In such cases, you can enable cross-region model access with the following command (requires ACCOUNTADMIN privileges):
21+
22+
```sql
23+
-- Enable access to models in any region
24+
ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION';
25+
```
26+
27+
This setting allows your account to use models from any region, which can be helpful when the model you need is not available in your current region. However, be aware that cross-region access may impact performance and could have additional cost implications.
28+
29+
30+
## Supported LLM Models
31+
32+
Snowflake Cortex provides access to various industry-leading LLM models with different capabilities and context lengths. Here are the key models available:
33+
34+
### Native Snowflake Models
35+
36+
* **Snowflake Arctic**: An open enterprise-grade model developed by Snowflake, optimized for business use cases.
37+
38+
### External Models (Hosted within Snowflake)
39+
40+
* **Claude Models (Anthropic)**: High-capability models for complex reasoning tasks.
41+
* **Mistral Models**: Including mistral-large, mixtral-8x7b, and mistral-7b for various use cases.
42+
* **Llama Models (Meta)**: Including llama3.2-1b, llama3.2-3b, llama3.1-8b, and llama2-70b-chat.
43+
* **Gemma Models (Google)**: Including gemma-7b for code and text completion tasks.
44+
45+
> **Note**: While developing the tests we worked with `claude-3-5-sonnet` so we recommend using this model as a default when running unstructured data tests in Snowflake.
46+
47+
## Permissions
48+
49+
> **Note**: By default, all users in your Snowflake account already have access to Cortex AI LLM functions through the PUBLIC role. In most cases, you don't need to do anything to enable access.
50+
51+
The `CORTEX_USER` database role in the SNOWFLAKE database includes all the privileges needed to call Snowflake Cortex LLM functions. This role is automatically granted to the PUBLIC role, which all users have by default.
52+
53+
The following commands are **only needed if** your administrator has revoked the default access from the PUBLIC role or if you need to set up specific access controls. If you can already use Cortex functions, you can skip this section.
54+
55+
```sql
56+
-- Run as ACCOUNTADMIN
57+
USE ROLE ACCOUNTADMIN;
58+
59+
-- Create a dedicated role for Cortex users
60+
CREATE ROLE cortex_user_role;
61+
62+
-- Grant the database role to the custom role
63+
GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE cortex_user_role;
64+
65+
-- Grant the role to specific users
66+
GRANT ROLE cortex_user_role TO USER <username>;
67+
68+
-- Optionally, grant warehouse access to the role
69+
GRANT USAGE ON WAREHOUSE <warehouse_name> TO ROLE cortex_user_role;
70+
```

0 commit comments

Comments
 (0)