|
| 1 | +--- |
| 2 | +sidebar_position: 3 |
| 3 | +--- |
| 4 | + |
| 5 | +# Enterprise Data Loaders |
| 6 | + |
| 7 | +This data loader facilitate the seamless integration of data into MongoDB's vector store, leveraging specific embedding models. They are designed to support the construction of gen AI application. |
| 8 | + |
| 9 | + |
| 10 | +## Loaders with their usage |
| 11 | +Currently the support of multiple enterprise data sources are extend throught the pyairbyte service that can be hosted following steps in [pyairbyte-svc github repo](https://github.com/ashwin-gangadhar-mdb/pyairbyte-svc). |
| 12 | +The usage steps defines what values are required to be set by the user in the `config.yaml` file to work with the loader. |
| 13 | +Multiple data sources can be configured at the same time to ingest data by decalring their respective configurations. |
| 14 | +The Enterprise data loader is extension to the exisiting data loader and works along side the [Data Loaders](data-loader.md) |
| 15 | + |
| 16 | +### Loader Example format to test Data Load |
| 17 | + |
| 18 | +Used to load and ingest content directly from Google drive spaces by specifying the necessary credentials and configuration. |
| 19 | + |
| 20 | +**Usage:** |
| 21 | +```js |
| 22 | +ingest: |
| 23 | + - source: 'enterprise' |
| 24 | + connectorName: 'source-faker' |
| 25 | + connectorConfigPath: '/path/to/your/ent_connector_config.json' |
| 26 | + filterStream: 'products' |
| 27 | + pickKeysForEmbedding: |
| 28 | + - 'make' |
| 29 | + - 'model' |
| 30 | + - 'price' |
| 31 | + - 'id' |
| 32 | + - 'year' |
| 33 | +``` |
| 34 | +Sample Data format in MongoDB once the ingest command is run from the console |
| 35 | + |
| 36 | +--- |
| 37 | +### 1. Gdrive Loader |
| 38 | + |
| 39 | + #### Prerequisites: |
| 40 | + - A GCP project |
| 41 | + - Enable the Google Drive API in your GCP project |
| 42 | + - Service Account Key with access to the Spreadsheet you want to replicate |
| 43 | + |
| 44 | + #### Set up the service account key (Airbyte Open Source) |
| 45 | + |
| 46 | + ##### Create a service account |
| 47 | + |
| 48 | + - Open the [Service Accounts page](https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts) in your Google Cloud console. |
| 49 | + - Select an existing project, or create a new project. |
| 50 | + - At the top of the page, click **+ Create service account**. |
| 51 | + - Enter a name and description for the service account, then click **Create and Continue**. |
| 52 | + - Under **Service account permissions**, select the roles to grant to the service account, then click **Continue**. We recommend the **Viewer** role. |
| 53 | + |
| 54 | + ##### Generate a key |
| 55 | + - Go to the [API Console/Credentials](https://console.cloud.google.com/apis/credentials) page and click on the email address of the service account you just created. |
| 56 | + - In the **Keys** tab, click **+ Add key**, then click **Create new key**. |
| 57 | + - Select **JSON** as the Key type. This will generate and download the JSON key file that you'll use for authentication. Click **Continue**. |
| 58 | + |
| 59 | + ##### Enable the Google Drive API |
| 60 | + - Go to the [API Console/Library](https://console.cloud.google.com/apis/library) page. |
| 61 | + - Make sure you have selected the correct project from the top. |
| 62 | + - Find and select the **Google Drive API**. |
| 63 | + - Click **ENABLE**. |
| 64 | + |
| 65 | + If your folder is viewable by anyone with its link, no further action is needed. If not, give your Service account access to your folder. |
| 66 | + |
| 67 | + ##### Setting up the config file |
| 68 | + |
| 69 | + If you are successfully able to complete all the above steps you will get a `svc-acc-key.json` file that has the following format: |
| 70 | + |
| 71 | +```js |
| 72 | +{ |
| 73 | + "type": "service_account", |
| 74 | + "project_id": "YOUR_PROJECT_ID", |
| 75 | + "private_key_id": "YOUR_PRIVATE_KEY", |
| 76 | + ... |
| 77 | +} |
| 78 | +``` |
| 79 | + |
| 80 | + Create a JSON file `ent_connector_config.json` in your `builder/partnerproduct/src` folder. Add the following fields in it as follows: |
| 81 | + |
| 82 | +```js |
| 83 | +{ |
| 84 | + "folder_url": "https://drive.google.com/drive/folders/your-folder-id", |
| 85 | + "credentials": { |
| 86 | + "auth_type": "Service", |
| 87 | + "service_account_info": "your-service-account-json-as-shown-above" |
| 88 | + }, |
| 89 | + "streams": [ |
| 90 | + { |
| 91 | + "name": "your-stream-name", |
| 92 | + "globs": ["**/*.csv"], |
| 93 | + "format": {"filetype": "csv"}, |
| 94 | + "validation_policy": "Emit Record", |
| 95 | + "days_to_sync_if_history_is_full": 3 |
| 96 | + } |
| 97 | + ] |
| 98 | +} |
| 99 | +``` |
| 100 | + |
| 101 | + Create a Loader config in your `config.yaml`. Key names mentioned under `pickKeysForEmbedding` will merge all the required fields into the document and concatenate them as a single field. See the example of Sample Data format for reference. |
| 102 | + |
| 103 | +```yaml |
| 104 | +ingest: |
| 105 | + - source: 'enterprise' |
| 106 | + connectorName: 'source-google-drive' |
| 107 | + connectorConfigPath: '/path/to/your/ent_connector_config.json' |
| 108 | + filterStream: 'your-stream-name' |
| 109 | + pickKeysForEmbedding: |
| 110 | + - your-field-name1 |
| 111 | + - your-field-name2 |
| 112 | + - ... |
| 113 | +``` |
| 114 | +
|
| 115 | +--- |
| 116 | +
|
| 117 | +
|
| 118 | +### 2. Salesforce Loader |
| 119 | +
|
| 120 | + Prerequisites: |
| 121 | + - [Salesforce Account](https://login.salesforce.com/) with Enterprise access or API quota purchased |
| 122 | + - (Optional, Recommended) Dedicated Salesforce [user](https://help.salesforce.com/s/articleView?id=adding_new_users.htm&type=5&language=en_US) |
| 123 | + - (For Airbyte Open Source) Salesforce [OAuth](https://help.salesforce.com/s/articleView?id=sf.remoteaccess_oauth_tokens_scopes.htm&type=5) credentials |
| 124 | + -You will need to obtain the following OAuth credentials to authenticate: |
| 125 | + 1. Client ID |
| 126 | + 2. Client Secret |
| 127 | + 3. Refresh Token |
| 128 | +
|
| 129 | +
|
| 130 | + #### Fetch the required credential for Salesforce Account |
| 131 | + To obtain these credentials, follow [this walkthrough](https://medium.com/@bpmmendis94/obtain-access-refresh-tokens-from-salesforce-rest-api-a324fe4ccd9b) with the following modifications: |
| 132 | +
|
| 133 | + - If your Salesforce URL is not in the `X.salesforce.com` format, use your Salesforce domain name. For example, if your Salesforce URL is `awesomecompany.force.com` then use that instead of `awesomecompany.salesforce.com`. |
| 134 | + - When running a curl command, run it with the `-L` option to follow any redirects. |
| 135 | + - If you created a read-only user, use the user credentials when logging in to generate OAuth tokens. |
| 136 | + |
| 137 | + |
| 138 | + ##### Setting up the config file |
| 139 | + |
| 140 | + Once you are able to fetch the above mentioned fields you can configure the `ent_connector_config.json` in the following format |
| 141 | +```js |
| 142 | +{ "client_id": "your-client-id", "client_secret": "your-client-secret", "refresh_token": "your-refresh-token", "is_sandbox": False, "start_date": "2023-01-01T00:00:00Z" } |
| 143 | +``` |
| 144 | + Create a Loader config in your `config.yaml`. Key names mentioned under `pickKeysForEmbedding` will merge all the required fields into the document and concatenate them as a single field. See the example of Sample Data format for reference. |
| 145 | + |
| 146 | +```yaml |
| 147 | +ingest: |
| 148 | + - source: 'enterprise' |
| 149 | + connectorName: 'source-salesforce' |
| 150 | + connectorConfigPath: '/path/to/your/ent_connector_config.json' |
| 151 | + filterStream: 'your-stream-name' |
| 152 | + pickKeysForEmbedding: |
| 153 | + - your-field-name1 |
| 154 | + - your-field-name2 |
| 155 | + - ... |
| 156 | +``` |
| 157 | + |
| 158 | + #### Supported usage with connector |
| 159 | + Airbyte allows exporting all available Salesforce objects dynamically based on: |
| 160 | + |
| 161 | + - If the authenticated Salesforce user has the Role and Permissions to read and fetch objects. This would be set as part of the Permission Set you assign to the Airbyte user. |
| 162 | + - If the Salesforce object has the queryable property set to true. Airbyte can only fetch objects which are queryable. If you don’t see an object available via Airbyte, and it is queryable, check if it is API-accessible to the Salesforce user you authenticated with. |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +### 3. Slack Loader |
| 167 | + |
| 168 | +#### Prerequisites: |
| 169 | + - Administrator access to an active Slack Workspace |
| 170 | + - Slack App OAuth (preferred) or API Key |
| 171 | + |
| 172 | +#### Set up Slack |
| 173 | + |
| 174 | +The following instructions guide you through creating a Slack app. Airbyte can only replicate messages from channels that the app has been added to. |
| 175 | + |
| 176 | +:::info |
| 177 | +If you are using a legacy Slack API Key, you can skip this section. |
| 178 | +::: |
| 179 | + |
| 180 | +To create a Slack App, read this [tutorial](https://api.slack.com/tutorials/tracks/getting-a-token) on how to create an app, or follow these instructions. |
| 181 | + |
| 182 | +- Go to your [Apps](https://api.slack.com/apps) |
| 183 | +- Click **Create New App**. Select **From Scratch**. |
| 184 | +- Choose a name for your app and select the name of your Slack workspace. Click **Create App**. |
| 185 | +- In the navigation menu, select **OAuth & Permissions**. |
| 186 | +- Navigate to **Scopes**. In **Bot Token Scopes**, select the following scopes: |
| 187 | + |
| 188 | +``` |
| 189 | + channels:history |
| 190 | + channels:join |
| 191 | + channels:read |
| 192 | + files:read |
| 193 | + groups:read |
| 194 | + links:read |
| 195 | + reactions:read |
| 196 | + remote_files:read |
| 197 | + team:read |
| 198 | + usergroups:read |
| 199 | + users:read |
| 200 | + users.profile:read |
| 201 | +``` |
| 202 | +- At the top of the "OAuth & Permissions" page, click **Install to Workspace**. This will generate a Bot User OAuth Token. Copy this for later if you are using the API token for authentication. |
| 203 | +- Go to your Slack instance. For any public channel, go to **Info**, **More**, and select **Add Apps**. |
| 204 | +- Search for your newly created app. (If you are using the desktop version of Slack, you may need to restart Slack for it to pick up the new App). Add the App to all channels you want to sync data from. |
| 205 | + |
| 206 | +:::note |
| 207 | +If you are using an API key to authenticate to Slack, a refresh token is not required, as acccess tokens never expire. You can learn more about refresh tokens [here](https://api.slack.com/authentication/rotation). |
| 208 | +::: |
| 209 | + |
| 210 | +##### Setting up the config file |
| 211 | +Once you are able to fetch the above mentioned fields you can configure the `ent_connector_config.json` in the following format |
| 212 | +```js |
| 213 | +{ "token": "your-bot-user-oauth-token", "start_date": "2023-01-01T00:00:00Z", "join_channels": True, "include_private_channels": False } |
| 214 | +``` |
| 215 | +Create a Loader config in your `config.yaml`. Key names mentioned under `pickKeysForEmbedding` will merge all the required fields into the document and concatenate them as a single field. See the example of Sample Data format for reference. |
| 216 | + |
| 217 | +```yaml |
| 218 | +ingest: |
| 219 | + - source: 'enterprise' |
| 220 | + connectorName: 'source-slack' |
| 221 | + connectorConfigPath: '/path/to/your/ent_connector_config.json' |
| 222 | + filterStream: 'your-stream-name' |
| 223 | + pickKeysForEmbedding: |
| 224 | + - your-field-name1 |
| 225 | + - your-field-name2 |
| 226 | + - ... |
| 227 | +``` |
| 228 | + |
| 229 | +##### Supported Streams |
| 230 | + |
| 231 | +For most of the streams, the Slack source connector uses the [Conversations API](https://api.slack.com/docs/conversations-api) under the hood. |
| 232 | + |
| 233 | +- [Channels \(Conversations\)](https://api.slack.com/methods/conversations.list) |
| 234 | +- [Channel Members \(Conversation Members\)](https://api.slack.com/methods/conversations.members) |
| 235 | +- [Messages \(Conversation History\)](https://api.slack.com/methods/conversations.history) It will only replicate messages from non-archive, public and private channels that the Slack App is a member of. |
| 236 | +- [Users](https://api.slack.com/methods/users.list) |
| 237 | +- [Threads \(Conversation Replies\)](https://api.slack.com/methods/conversations.repli |
0 commit comments