|
| 1 | +--- |
| 2 | +title: Real-time data transformation from Google Drive |
| 3 | +description: Build real-time data transformation from Google Drive with CocoIndex. |
| 4 | +sidebar_class_name: hidden |
| 5 | +slug: /examples/google_drive |
| 6 | +canonicalUrl: '/examples/google_drive' |
| 7 | +sidebar_custom_props: |
| 8 | + image: /img/integrations/google_drive/cover.png |
| 9 | + tags: [vector-index, google-drive, realtime, etl] |
| 10 | +image: /img/integrations/google_drive/cover.png |
| 11 | +--- |
| 12 | +import { DocumentationButton, GitHubButton } from '../../../src/components/GitHubButton'; |
| 13 | + |
| 14 | +<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding" margin="0 0 24px 0" /> |
| 15 | + |
| 16 | + |
| 17 | + |
| 18 | +This guide shows how to build a real-time data pipeline with CocoIndex to transform and index files from Google Drive. It walks through setting up Google credentials, configuring CocoIndex, and build vector index for semantic search. |
| 19 | + |
| 20 | + |
| 21 | +## Prerequisites |
| 22 | +### Install Postgres |
| 23 | +If you don't have Postgres installed, please refer to the [installation guide](https://cocoindex.io/docs/getting_started/installation). |
| 24 | + |
| 25 | +### Enable Google Drive access by service account |
| 26 | +CocoIndex provides native builtin to support Google Drive as a source. |
| 27 | + |
| 28 | +<DocumentationButton url="https://cocoindex.io/docs/sources/googledrive" text="GoogleDrive Source" margin="0 0 16px 0" /> |
| 29 | + |
| 30 | +### 1. Register / login in Google Cloud. |
| 31 | +First, you need to create a Google Cloud account if you don't have one already. Go to the [Google Cloud Console](https://console.cloud.google.com/) and sign up or sign in. |
| 32 | + |
| 33 | +### 2. Select or create a GCP project |
| 34 | + |
| 35 | +Once you've logged into Google Cloud Console, you need to select an existing project or create a new one. Click on the project selector dropdown at the top of the page: |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | + |
| 41 | +### 3. Create a Service Account |
| 42 | +1. In Google Cloud Console, search for Service Accounts, to enter the IAM & Admin / Service Accounts page. |
| 43 | +  |
| 44 | + |
| 45 | +2. Click on "CREATE SERVICE ACCOUNT" at the top of the page: |
| 46 | + |
| 47 | +  |
| 48 | + |
| 49 | +3. Fill in the service account name, e.g. `cocoindex-test`. |
| 50 | + |
| 51 | +  |
| 52 | + |
| 53 | + And make a note on that email address, you will need it in the later step. |
| 54 | + |
| 55 | +4. Click on "CREATE" to create the service account. |
| 56 | + You will see the service account created successfully. |
| 57 | +  |
| 58 | + |
| 59 | +### 4. Create and download the key for the service account |
| 60 | +1. Click on "Actions" and select "Manage Keys". |
| 61 | +  |
| 62 | + |
| 63 | +2. Select "Add Key" and select "Create new key". |
| 64 | +  |
| 65 | + |
| 66 | + Choose "JSON" as the key type and click "Create". |
| 67 | +  |
| 68 | + |
| 69 | +3. The key file will be downloaded to your computer. Depends on the browser setting, it start download automatically or may pop up a dialog to for the location to download. Keep this file secure as it provides access to your Google Drive resources. It looks like this: |
| 70 | + ```json |
| 71 | + { |
| 72 | + "type": "service_account", |
| 73 | + "project_id": "cocoindexdriveexample", |
| 74 | + "private_key_id": "key_id", |
| 75 | + "private_key": "PRIVATE_KEY", |
| 76 | + "client_email": "[email protected]", |
| 77 | + "client_id": "id", |
| 78 | + "auth_uri": "https://accounts.google.com/o/oauth2/auth", |
| 79 | + "token_uri": "https://oauth2.googleapis.com/token", |
| 80 | + "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", |
| 81 | + "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/cocoindex-test%40cocoindexdriveexample.iam.gserviceaccount.com", |
| 82 | + "universe_domain": "googleapis.com" |
| 83 | + } |
| 84 | + ``` |
| 85 | + |
| 86 | + |
| 87 | +### 5. Enable Google Drive API |
| 88 | +Search for "Google Drive API" in Google Cloud Console and enable it. |
| 89 | + |
| 90 | +### 6. Prepare and share a folder |
| 91 | +1. Create a new folder or use an existing folder in your Google Drive. |
| 92 | + - For this project, we will create a folder in my own Google Drive, and share it with the service account email address we created in [Step 3](#3-create-a-service-account). For example, `[email protected]`. |
| 93 | + - My example Google Drive folder is [here](https://drive.google.com/drive/folders/1Yerp-CTs1TQUH52oy7eRqR1WHzRYhtJW?dmr=1&ec=wgc-drive-globalnav-goto). |
| 94 | + - The files are also available in the [example repo](https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding/data). |
| 95 | +2. Share the folder with the service account. Enter the service account email address (e.g., `[email protected]`) and give it "Viewer" access. |
| 96 | + |
| 97 | +  |
| 98 | + |
| 99 | +3. Note the folder ID from the URL when you open the folder. The URL will look like: |
| 100 | + ``` |
| 101 | + https://drive.google.com/drive/folders/1AbCdEfGhIjKlMnOpQrStUvWxYz |
| 102 | + ``` |
| 103 | + |
| 104 | + The folder ID is the part after `folders/` (in this example: `1AbCdEfGhIjKlMnOpQrStUvWxYz`). |
| 105 | + You'll need this folder ID when connecting to the Google Drive API. |
| 106 | + |
| 107 | + |
| 108 | +## Project setup |
| 109 | + |
| 110 | +1. Create a `pyproject.toml` file in the root directory. |
| 111 | + |
| 112 | + ```toml |
| 113 | + [project] |
| 114 | + name = "gdrive-text-embedding" |
| 115 | + version = "0.1.0" |
| 116 | + description = "Simple example for cocoindex: build embedding index based on Google Drive files." |
| 117 | + requires-python = ">=3.11" |
| 118 | + dependencies = ["cocoindex>=0.2.4", "python-dotenv>=1.0.1"] |
| 119 | + ``` |
| 120 | + |
| 121 | +2. Setup `.env` |
| 122 | + Create a `.env` file in the root directory and add the following: |
| 123 | + You can copy it from the [`.env.example`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/gdrive_text_embedding/.env.example) file. |
| 124 | + |
| 125 | + ``` |
| 126 | + # Postgres database address for cocoindex |
| 127 | + COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex |
| 128 | + |
| 129 | + # Google Drive service account credential path. |
| 130 | + #! PLEASE FILL IN |
| 131 | + GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account_credential.json |
| 132 | + |
| 133 | + # Google Drive root folder IDs, comma separated. |
| 134 | + #! PLEASE FILL IN YOUR GOOGLE DRIVE FOLDER ID |
| 135 | + GOOGLE_DRIVE_ROOT_FOLDER_IDS=1AbCdEfGhIjKlMnOpQrStUvWxYz |
| 136 | + ``` |
| 137 | + |
| 138 | +## Define CocoIndex Flow |
| 139 | + |
| 140 | +Let's define the CocoIndex flow to build text embeddings from Google Drive. |
| 141 | + |
| 142 | +First, let's load the files from Google Drive as a source. CocoIndex provides a `GoogleDrive` source as a native built-in [source](https://cocoindex.io/docs/sources). You just need to provide the service account credential path and the root folder IDs. |
| 143 | + |
| 144 | +<DocumentationButton url="https://cocoindex.io/docs/sources/googledrive" text="GoogleDrive Source" margin="0 0 16px 0" /> |
| 145 | + |
| 146 | +### 1. Load the files from Google Drive |
| 147 | +```python |
| 148 | +@cocoindex.flow_def(name="GoogleDriveTextEmbedding") |
| 149 | +def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): |
| 150 | + """ |
| 151 | + Define an example flow that embeds text into a vector database. |
| 152 | + """ |
| 153 | + credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"] |
| 154 | + root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",") |
| 155 | + |
| 156 | + data_scope["documents"] = flow_builder.add_source( |
| 157 | + cocoindex.sources.GoogleDrive( |
| 158 | + service_account_credential_path=credential_path, |
| 159 | + root_folder_ids=root_folder_ids)) |
| 160 | + |
| 161 | + doc_embeddings = data_scope.add_collector() |
| 162 | +``` |
| 163 | + |
| 164 | +`flow_builder.add_source` will create a table with the following sub fields, see [documentation](https://cocoindex.io/docs/sources) here. |
| 165 | +- `filename` (key, type: `str`): the filename of the file, e.g. `dir1/file1.md` |
| 166 | +- `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file |
| 167 | + |
| 168 | + |
| 169 | +### Rest of the flow |
| 170 | +For the rest of the flow, we can follow the tutorial |
| 171 | +[Simple Vector Index](https://cocoindex.io/docs/examples/simple_vector_index). |
| 172 | +The entire project is available [here](https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding). |
| 173 | + |
| 174 | + |
| 175 | +### Query and test your index |
| 176 | +🎉 Now you are all set! |
| 177 | + |
| 178 | +#### Run the following command to setup and update the index. |
| 179 | + ```sh |
| 180 | + cocoindex update --setup main |
| 181 | + ``` |
| 182 | + |
| 183 | + You'll see the index updates state in the terminal. For example, you'll see the following output: |
| 184 | + ```sh |
| 185 | + documents: 3 added, 0 removed, 0 updated |
| 186 | + ``` |
| 187 | + |
| 188 | +#### CocoInsight |
| 189 | + |
| 190 | + CocoInsight is a comprehensive web interface to understand your data pipeline and interact with the index. CocoInsight has zero data retention with your pipeline data. |
| 191 | + |
| 192 | + ```sh |
| 193 | + cocoindex server -ci main |
| 194 | + ``` |
0 commit comments