Skip to content

Commit 6a4b0b0

Browse files
authored
[doc] drive example (#1319)
1 parent 0fe6645 commit 6a4b0b0

File tree

13 files changed

+194
-0
lines changed

13 files changed

+194
-0
lines changed
Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
---
2+
title: Real-time data transformation from Google Drive
3+
description: Build real-time data transformation from Google Drive with CocoIndex.
4+
sidebar_class_name: hidden
5+
slug: /examples/google_drive
6+
canonicalUrl: '/examples/google_drive'
7+
sidebar_custom_props:
8+
image: /img/integrations/google_drive/cover.png
9+
tags: [vector-index, google-drive, realtime, etl]
10+
image: /img/integrations/google_drive/cover.png
11+
---
12+
import { DocumentationButton, GitHubButton } from '../../../src/components/GitHubButton';
13+
14+
<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding" margin="0 0 24px 0" />
15+
16+
![Text Embedding from Google Drive](/img/integrations/google_drive/cover.png)
17+
18+
This guide shows how to build a real-time data pipeline with CocoIndex to transform and index files from Google Drive. It walks through setting up Google credentials, configuring CocoIndex, and build vector index for semantic search.
19+
20+
21+
## Prerequisites
22+
### Install Postgres
23+
If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation).
24+
25+
### Enable Google Drive access by service account
26+
CocoIndex provides native builtin to support Google Drive as a source.
27+
28+
<DocumentationButton url="https://cocoindex.io/docs/sources/googledrive" text="GoogleDrive Source" margin="0 0 16px 0" />
29+
30+
### 1. Register / login in Google Cloud.
31+
First, you need to create a Google Cloud account if you don't have one already. Go to the [Google Cloud Console](https://console.cloud.google.com/) and sign up or sign in.
32+
33+
### 2. Select or create a GCP project
34+
35+
Once you've logged into Google Cloud Console, you need to select an existing project or create a new one. Click on the project selector dropdown at the top of the page:
36+
37+
![Select or Create a GCP Project](/img/integrations/google_drive/select_project.png)
38+
39+
40+
41+
### 3. Create a Service Account
42+
1. In Google Cloud Console, search for Service Accounts, to enter the IAM & Admin / Service Accounts page.
43+
![Service Account Search](/img/integrations/google_drive/service_account_search.png)
44+
45+
2. Click on "CREATE SERVICE ACCOUNT" at the top of the page:
46+
47+
![Create Service Account](/img/integrations/google_drive/create_service_account.png)
48+
49+
3. Fill in the service account name, e.g. `cocoindex-test`.
50+
51+
![Create Service Account Form](/img/integrations/google_drive/create_service_account_form.png)
52+
53+
And make a note on that email address, you will need it in the later step.
54+
55+
4. Click on "CREATE" to create the service account.
56+
You will see the service account created successfully.
57+
![Service Account Listing](/img/integrations/google_drive/service_account_listing.png)
58+
59+
### 4. Create and download the key for the service account
60+
1. Click on "Actions" and select "Manage Keys".
61+
![Manage Keys](/img/integrations/google_drive/manage_keys.png)
62+
63+
2. Select "Add Key" and select "Create new key".
64+
![Create New Key](/img/integrations/google_drive/create_new_key.png)
65+
66+
Choose "JSON" as the key type and click "Create".
67+
![Create JSON Key](/img/integrations/google_drive/create_new_key_form.png)
68+
69+
3. The key file will be downloaded to your computer. Depends on the browser setting, it start download automatically or may pop up a dialog to for the location to download. Keep this file secure as it provides access to your Google Drive resources. It looks like this:
70+
```json
71+
{
72+
"type": "service_account",
73+
"project_id": "cocoindexdriveexample",
74+
"private_key_id": "key_id",
75+
"private_key": "PRIVATE_KEY",
76+
"client_email": "[email protected]",
77+
"client_id": "id",
78+
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
79+
"token_uri": "https://oauth2.googleapis.com/token",
80+
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
81+
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/cocoindex-test%40cocoindexdriveexample.iam.gserviceaccount.com",
82+
"universe_domain": "googleapis.com"
83+
}
84+
```
85+
86+
87+
### 5. Enable Google Drive API
88+
Search for "Google Drive API" in Google Cloud Console and enable it.
89+
90+
### 6. Prepare and share a folder
91+
1. Create a new folder or use an existing folder in your Google Drive.
92+
- For this project, we will create a folder in my own Google Drive, and share it with the service account email address we created in [Step 3](#3-create-a-service-account). For example, `[email protected]`.
93+
- My example Google Drive folder is [here](https://drive.google.com/drive/folders/1Yerp-CTs1TQUH52oy7eRqR1WHzRYhtJW?dmr=1&ec=wgc-drive-globalnav-goto).
94+
- The files are also available in the [example repo](https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding/data).
95+
2. Share the folder with the service account. Enter the service account email address (e.g., `[email protected]`) and give it "Viewer" access.
96+
97+
![Create a new folder in Google Drive](/img/integrations/google_drive/drive_folder.png)
98+
99+
3. Note the folder ID from the URL when you open the folder. The URL will look like:
100+
```
101+
https://drive.google.com/drive/folders/1AbCdEfGhIjKlMnOpQrStUvWxYz
102+
```
103+
104+
The folder ID is the part after `folders/` (in this example: `1AbCdEfGhIjKlMnOpQrStUvWxYz`).
105+
You'll need this folder ID when connecting to the Google Drive API.
106+
107+
108+
## Project setup
109+
110+
1. Create a `pyproject.toml` file in the root directory.
111+
112+
```toml
113+
[project]
114+
name = "gdrive-text-embedding"
115+
version = "0.1.0"
116+
description = "Simple example for cocoindex: build embedding index based on Google Drive files."
117+
requires-python = ">=3.11"
118+
dependencies = ["cocoindex>=0.2.4", "python-dotenv>=1.0.1"]
119+
```
120+
121+
2. Setup `.env`
122+
Create a `.env` file in the root directory and add the following:
123+
You can copy it from the [`.env.example`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/gdrive_text_embedding/.env.example) file.
124+
125+
```
126+
# Postgres database address for cocoindex
127+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
128+
129+
# Google Drive service account credential path.
130+
#! PLEASE FILL IN
131+
GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account_credential.json
132+
133+
# Google Drive root folder IDs, comma separated.
134+
#! PLEASE FILL IN YOUR GOOGLE DRIVE FOLDER ID
135+
GOOGLE_DRIVE_ROOT_FOLDER_IDS=1AbCdEfGhIjKlMnOpQrStUvWxYz
136+
```
137+
138+
## Define CocoIndex Flow
139+
140+
Let's define the CocoIndex flow to build text embeddings from Google Drive.
141+
142+
First, let's load the files from Google Drive as a source. CocoIndex provides a `GoogleDrive` source as a native built-in [source](https://cocoindex.io/docs/sources). You just need to provide the service account credential path and the root folder IDs.
143+
144+
<DocumentationButton url="https://cocoindex.io/docs/sources/googledrive" text="GoogleDrive Source" margin="0 0 16px 0" />
145+
146+
### 1. Load the files from Google Drive
147+
```python
148+
@cocoindex.flow_def(name="GoogleDriveTextEmbedding")
149+
def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
150+
"""
151+
Define an example flow that embeds text into a vector database.
152+
"""
153+
credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
154+
root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")
155+
156+
data_scope["documents"] = flow_builder.add_source(
157+
cocoindex.sources.GoogleDrive(
158+
service_account_credential_path=credential_path,
159+
root_folder_ids=root_folder_ids))
160+
161+
doc_embeddings = data_scope.add_collector()
162+
```
163+
164+
`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/sources) here.
165+
- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md`
166+
- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
167+
168+
169+
### Rest of the flow
170+
For the rest of the flow, we can follow the tutorial
171+
[Simple Vector Index](https://cocoindex.io/docs/examples/simple_vector_index).
172+
The entire project is available [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding).
173+
174+
175+
### Query and test your index
176+
🎉 Now you are all set!
177+
178+
#### Run the following command to setup and update the index.
179+
```sh
180+
cocoindex update --setup main
181+
```
182+
183+
You'll see the index updates state in the terminal. For example, you'll see the following output:
184+
```sh
185+
documents: 3 added, 0 removed, 0 updated
186+
```
187+
188+
#### CocoInsight
189+
190+
CocoInsight is a comprehensive web interface to understand your data pipeline and interact with the index. CocoInsight has zero data retention with your pipeline data.
191+
192+
```sh
193+
cocoindex server -ci main
194+
```
185 KB
Loading
67.1 KB
Loading
60.9 KB
Loading
110 KB
Loading
61.8 KB
Loading
47.3 KB
Loading
51.7 KB
Loading
23.9 KB
Loading
89.7 KB
Loading

0 commit comments

Comments
 (0)