|
| 1 | +--- |
| 2 | +title: "BigQuery Vertex AI" |
| 3 | +description: "Learn how to configure BigQuery to use Vertex AI models for unstructured data validation tests" |
| 4 | +--- |
| 5 | + |
| 6 | +# BigQuery Setup for Unstructured Data Tests |
| 7 | + |
| 8 | +Elementary's unstructured data validation tests leverage BigQuery ML and Vertex AI models to perform advanced AI-powered validations. This guide will walk you through the setup process. |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +Before you begin, ensure you have: |
| 13 | +- A Google Cloud account with appropriate permissions |
| 14 | +- Access to BigQuery and Vertex AI services |
| 15 | +- A BigQuery dataset where you'll create your model, that will be used by Elementary's data validation tests. This is the dataset where you have unstructured data stored and that you want to apply validations on. |
| 16 | + |
| 17 | +## Step 1: Enable the Vertex AI API |
| 18 | + |
| 19 | +1. Navigate to the Google Cloud Console |
| 20 | +2. Go to **APIs & Services** > **API Library** |
| 21 | +3. Search for "Vertex AI API" |
| 22 | +4. Click on the API and select **Enable** |
| 23 | + |
| 24 | +## Step 2: Create a Remote Connection to Vertex AI |
| 25 | + |
| 26 | +Elementary's unstructured data validation tests use BigQuery ML to access pre-trained Vertex AI models. To establish this connection: |
| 27 | + |
| 28 | +1. Navigate to the Google Cloud Console > **BigQuery** |
| 29 | +2. In the Explorer panel, click the **+** button |
| 30 | +3. Select **Connections to external data sources** |
| 31 | +4. Change the connection type to **Vertex AI remote models, remote functions and BigLake (Cloud Resource)** |
| 32 | +5. Select the appropriate region: |
| 33 | + - If your model and dataset are in the same region, select that specific region |
| 34 | + - Otherwise, select multi-region |
| 35 | + |
| 36 | +After creating the connection: |
| 37 | +1. In the BigQuery Explorer, navigate to **External Connections** |
| 38 | +2. Find and click on your newly created connection |
| 39 | +3. Copy the **Service Account ID** for the next step |
| 40 | + |
| 41 | +## Step 3: Grant Vertex AI Access Permissions |
| 42 | + |
| 43 | +Now you need to give the connection's service account permission to access Vertex AI: |
| 44 | + |
| 45 | +1. In the Google Cloud Console, go to **IAM & Admin** |
| 46 | +2. Click **+ Grant Access** |
| 47 | +3. Under "New principals", paste the service account ID you copied |
| 48 | +4. Assign the **Vertex AI User** role |
| 49 | +5. Click **Save** |
| 50 | + |
| 51 | +## Step 4: Create an LLM Model Interface in BigQuery |
| 52 | + |
| 53 | +1. In the BigQuery Explorer, navigate to **External Connections** |
| 54 | +2. Find again your newly created connection from previous step and clikc on it |
| 55 | +3. Copy the **Connection ID** (format: `projects/<project-name>/locations/<region>/connections/<connection-name>`) |
| 56 | +4. [Select a model endpoint](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models). You can use `gemini-1.5-pro-002` as a default endpoint. |
| 57 | +5. Run the following SQL query to create a model in your dataset: |
| 58 | + |
| 59 | +```sql |
| 60 | +CREATE OR REPLACE MODEL |
| 61 | + `<your-project>.<your-dataset>.<name-your-model>` |
| 62 | +REMOTE WITH CONNECTION |
| 63 | + `<paste-here-your-connection-id>` |
| 64 | +OPTIONS ( |
| 65 | + endpoint = '<model-endpoint>' |
| 66 | +); |
| 67 | +``` |
| 68 | + |
| 69 | +### Example |
| 70 | + |
| 71 | +```sql |
| 72 | +CREATE OR REPLACE MODEL |
| 73 | + `my-project.my-dataset.gemini-1.5-pro` |
| 74 | +REMOTE WITH CONNECTION |
| 75 | + `projects/my-project/locations/us/connections/my-remote-connection-model-name` |
| 76 | +OPTIONS ( |
| 77 | + endpoint = 'gemini-1.5-pro-002' |
| 78 | +); |
| 79 | +``` |
| 80 | + |
| 81 | +> **Note:** During development, we used `gemini-1.5-pro` and recommend it as the default model for unstructured data tests in BigQuery. |
| 82 | +
|
| 83 | +### Additional Resources |
| 84 | + |
| 85 | +- [Available models and endpoints](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model#gemini-api-multimodal-models) |
| 86 | +- [Documentation on creating remote models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-remote-model) |
| 87 | + |
| 88 | +## Step 5: Running an Unstructured Data Test |
| 89 | + |
| 90 | +Once your model is set up, you can reference it in your Elementary tests: |
| 91 | + |
| 92 | +```yaml |
| 93 | +models: |
| 94 | + - name: table_with_unstructured_data |
| 95 | + description: "A table containing unstructured text data." |
| 96 | + columns: |
| 97 | + - name: text_data |
| 98 | + description: "Unstructured text data stored as a string." |
| 99 | + tests: |
| 100 | + - elementary.validate_unstructured_data: |
| 101 | + expectation_prompt: "The text data should represent an example of unstructured data." |
| 102 | + llm_model_name: "gemini-1.5-pro" |
| 103 | +``` |
| 104 | +
|
| 105 | +
|
| 106 | +
|
0 commit comments