Skip to content

Commit 875d33c

Browse files
Enterprise data loader (#28)
* added enterprise data loader * added credal models API * update documentation
1 parent afbbf45 commit 875d33c

File tree

5 files changed

+409
-2
lines changed

5 files changed

+409
-2
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
---
2+
sidebar_position: 3
3+
---
4+
5+
# Enterprise Data Loaders
6+
7+
This data loader facilitate the seamless integration of data into MongoDB's vector store, leveraging specific embedding models. They are designed to support the construction of gen AI application.
8+
9+
10+
## Loaders with their usage
11+
Currently the support of multiple enterprise data sources are extend throught the pyairbyte service that can be hosted following steps in [pyairbyte-svc github repo](!https://github.com/ashwin-gangadhar-mdb/pyairbyte-svc).
12+
The usage steps defines what values are required to be set by the user in the `config.yaml` file to work with the loader.
13+
Multiple data sources can be configured at the same time to ingest data by decalring their respective configurations.
14+
The Enterprise data loader is extension to the exisiting data loader and works along side the [Data Loaders](!data-loader.md)
15+
16+
### Loader Example format to test Data Load
17+
18+
Used to load and ingest content directly from Google drive spaces by specifying the necessary credentials and configuration.
19+
20+
**Usage:**
21+
```js
22+
ingest:
23+
- source: 'enterprise'
24+
connectorName: 'source-faker'
25+
connectorConfigPath: '/path/to/your/ent_connector_config.json'
26+
filterStream: 'products'
27+
pickKeysForEmbedding:
28+
- 'make'
29+
- 'model'
30+
- 'price'
31+
- 'id'
32+
- 'year'
33+
```
34+
Sample Data format in MongoDB once the ingest command is run from the console
35+
![Sample Data](img/entDataLoaderSampleOUT.png)
36+
---
37+
### 1. Gdrive Loader
38+
39+
#### Prerequisites:
40+
- A GCP project
41+
- Enable the Google Drive API in your GCP project
42+
- Service Account Key with access to the Spreadsheet you want to replicate
43+
44+
#### Set up the service account key (Airbyte Open Source)
45+
46+
##### Create a service account
47+
48+
- Open the [Service Accounts page](https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts) in your Google Cloud console.
49+
- Select an existing project, or create a new project.
50+
- At the top of the page, click **+ Create service account**.
51+
- Enter a name and description for the service account, then click **Create and Continue**.
52+
- Under **Service account permissions**, select the roles to grant to the service account, then click **Continue**. We recommend the **Viewer** role.
53+
54+
##### Generate a key
55+
- Go to the [API Console/Credentials](https://console.cloud.google.com/apis/credentials) page and click on the email address of the service account you just created.
56+
- In the **Keys** tab, click **+ Add key**, then click **Create new key**.
57+
- Select **JSON** as the Key type. This will generate and download the JSON key file that you'll use for authentication. Click **Continue**.
58+
59+
##### Enable the Google Drive API
60+
- Go to the [API Console/Library](https://console.cloud.google.com/apis/library) page.
61+
- Make sure you have selected the correct project from the top.
62+
- Find and select the **Google Drive API**.
63+
- Click **ENABLE**.
64+
65+
If your folder is viewable by anyone with its link, no further action is needed. If not, give your Service account access to your folder.
66+
67+
##### Setting up the config file
68+
69+
If you are successfully able to complete all the above steps you will get a `svc-acc-key.json` file that has the following format:
70+
71+
```js
72+
{
73+
"type": "service_account",
74+
"project_id": "YOUR_PROJECT_ID",
75+
"private_key_id": "YOUR_PRIVATE_KEY",
76+
...
77+
}
78+
```
79+
80+
Create a JSON file `ent_connector_config.json` in your `builder/partnerproduct/src` folder. Add the following fields in it as follows:
81+
82+
```js
83+
{
84+
"folder_url": "https://drive.google.com/drive/folders/your-folder-id",
85+
"credentials": {
86+
"auth_type": "Service",
87+
"service_account_info": "your-service-account-json-as-shown-above"
88+
},
89+
"streams": [
90+
{
91+
"name": "your-stream-name",
92+
"globs": ["**/*.csv"],
93+
"format": {"filetype": "csv"},
94+
"validation_policy": "Emit Record",
95+
"days_to_sync_if_history_is_full": 3
96+
}
97+
]
98+
}
99+
```
100+
101+
Create a Loader config in your `config.yaml`. Key names mentioned under `pickKeysForEmbedding` will merge all the required fields into the document and concatenate them as a single field. See the example of Sample Data format for reference.
102+
103+
```yaml
104+
ingest:
105+
- source: 'enterprise'
106+
connectorName: 'source-google-drive'
107+
connectorConfigPath: '/path/to/your/ent_connector_config.json'
108+
filterStream: 'your-stream-name'
109+
pickKeysForEmbedding:
110+
- your-field-name1
111+
- your-field-name2
112+
- ...
113+
```
114+
115+
---
116+
117+
118+
### 2. Salesforce Loader
119+
120+
Prerequisites:
121+
- [Salesforce Account](https://login.salesforce.com/) with Enterprise access or API quota purchased
122+
- (Optional, Recommended) Dedicated Salesforce [user](https://help.salesforce.com/s/articleView?id=adding_new_users.htm&type=5&language=en_US)
123+
- (For Airbyte Open Source) Salesforce [OAuth](https://help.salesforce.com/s/articleView?id=sf.remoteaccess_oauth_tokens_scopes.htm&type=5) credentials
124+
-You will need to obtain the following OAuth credentials to authenticate:
125+
1. Client ID
126+
2. Client Secret
127+
3. Refresh Token
128+
129+
130+
#### Fetch the required credential for Salesforce Account
131+
To obtain these credentials, follow [this walkthrough](https://medium.com/@bpmmendis94/obtain-access-refresh-tokens-from-salesforce-rest-api-a324fe4ccd9b) with the following modifications:
132+
133+
- If your Salesforce URL is not in the `X.salesforce.com` format, use your Salesforce domain name. For example, if your Salesforce URL is `awesomecompany.force.com` then use that instead of `awesomecompany.salesforce.com`.
134+
- When running a curl command, run it with the `-L` option to follow any redirects.
135+
- If you created a read-only user, use the user credentials when logging in to generate OAuth tokens.
136+
137+
138+
##### Setting up the config file
139+
140+
Once you are able to fetch the above mentioned fields you can configure the `ent_connector_config.json` in the following format
141+
```js
142+
{ "client_id": "your-client-id", "client_secret": "your-client-secret", "refresh_token": "your-refresh-token", "is_sandbox": False, "start_date": "2023-01-01T00:00:00Z" }
143+
```
144+
Create a Loader config in your `config.yaml`. Key names mentioned under `pickKeysForEmbedding` will merge all the required fields into the document and concatenate them as a single field. See the example of Sample Data format for reference.
145+
146+
```yaml
147+
ingest:
148+
- source: 'enterprise'
149+
connectorName: 'source-salesforce'
150+
connectorConfigPath: '/path/to/your/ent_connector_config.json'
151+
filterStream: 'your-stream-name'
152+
pickKeysForEmbedding:
153+
- your-field-name1
154+
- your-field-name2
155+
- ...
156+
```
157+
158+
#### Supported usage with connector
159+
Airbyte allows exporting all available Salesforce objects dynamically based on:
160+
161+
- If the authenticated Salesforce user has the Role and Permissions to read and fetch objects. This would be set as part of the Permission Set you assign to the Airbyte user.
162+
- If the Salesforce object has the queryable property set to true. Airbyte can only fetch objects which are queryable. If you don’t see an object available via Airbyte, and it is queryable, check if it is API-accessible to the Salesforce user you authenticated with.
163+
164+
---
165+
166+
### 3. Slack Loader
167+
168+
#### Prerequisites:
169+
- Administrator access to an active Slack Workspace
170+
- Slack App OAuth (preferred) or API Key
171+
172+
#### Set up Slack
173+
174+
The following instructions guide you through creating a Slack app. Airbyte can only replicate messages from channels that the app has been added to.
175+
176+
:::info
177+
If you are using a legacy Slack API Key, you can skip this section.
178+
:::
179+
180+
To create a Slack App, read this [tutorial](https://api.slack.com/tutorials/tracks/getting-a-token) on how to create an app, or follow these instructions.
181+
182+
- Go to your [Apps](https://api.slack.com/apps)
183+
- Click **Create New App**. Select **From Scratch**.
184+
- Choose a name for your app and select the name of your Slack workspace. Click **Create App**.
185+
- In the navigation menu, select **OAuth & Permissions**.
186+
- Navigate to **Scopes**. In **Bot Token Scopes**, select the following scopes:
187+
188+
```
189+
channels:history
190+
channels:join
191+
channels:read
192+
files:read
193+
groups:read
194+
links:read
195+
reactions:read
196+
remote_files:read
197+
team:read
198+
usergroups:read
199+
users:read
200+
users.profile:read
201+
```
202+
- At the top of the "OAuth & Permissions" page, click **Install to Workspace**. This will generate a Bot User OAuth Token. Copy this for later if you are using the API token for authentication.
203+
- Go to your Slack instance. For any public channel, go to **Info**, **More**, and select **Add Apps**.
204+
- Search for your newly created app. (If you are using the desktop version of Slack, you may need to restart Slack for it to pick up the new App). Add the App to all channels you want to sync data from.
205+
206+
:::note
207+
If you are using an API key to authenticate to Slack, a refresh token is not required, as acccess tokens never expire. You can learn more about refresh tokens [here](https://api.slack.com/authentication/rotation).
208+
:::
209+
210+
##### Setting up the config file
211+
Once you are able to fetch the above mentioned fields you can configure the `ent_connector_config.json` in the following format
212+
```js
213+
{ "token": "your-bot-user-oauth-token", "start_date": "2023-01-01T00:00:00Z", "join_channels": True, "include_private_channels": False }
214+
```
215+
Create a Loader config in your `config.yaml`. Key names mentioned under `pickKeysForEmbedding` will merge all the required fields into the document and concatenate them as a single field. See the example of Sample Data format for reference.
216+
217+
```yaml
218+
ingest:
219+
- source: 'enterprise'
220+
connectorName: 'source-slack'
221+
connectorConfigPath: '/path/to/your/ent_connector_config.json'
222+
filterStream: 'your-stream-name'
223+
pickKeysForEmbedding:
224+
- your-field-name1
225+
- your-field-name2
226+
- ...
227+
```
228+
229+
##### Supported Streams
230+
231+
For most of the streams, the Slack source connector uses the [Conversations API](https://api.slack.com/docs/conversations-api) under the hood.
232+
233+
- [Channels \(Conversations\)](https://api.slack.com/methods/conversations.list)
234+
- [Channel Members \(Conversation Members\)](https://api.slack.com/methods/conversations.members)
235+
- [Messages \(Conversation History\)](https://api.slack.com/methods/conversations.history) It will only replicate messages from non-archive, public and private channels that the Slack App is a member of.
236+
- [Users](https://api.slack.com/methods/users.list)
237+
- [Threads \(Conversation Replies\)](https://api.slack.com/methods/conversations.repli
87.8 KB
Loading

src/loaders/enterprise-loader.ts

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import axios from 'axios';
2+
import { BaseLoader } from '../interfaces/base-loader.js';
3+
import { UnfilteredLoaderChunk } from '../global/types.js';
4+
import { cleanString } from '../util/strings.js';
5+
import md5 from 'md5';
6+
import createDebugMessages from 'debug';
7+
import { JsonLoader } from './json-loader.js';
8+
9+
export class EnterpriseLoader extends BaseLoader<{ type: 'EnterpriseLoader' }> {
10+
private readonly debug = createDebugMessages('maap:loader:EnterpriseLoader');
11+
private readonly enterpriseUrl: string;
12+
private readonly connectorConfig: object;
13+
private readonly pickKeysForEmbedding: string[];
14+
private readonly connectorName: string;
15+
private readonly filterStream: string;
16+
private readonly requestBody: object;
17+
18+
constructor({
19+
connectorName,
20+
connectorConfig,
21+
filterStream,
22+
pickKeysForEmbedding
23+
}: { connectorName: string;
24+
connectorConfig: object;
25+
filterStream: string;
26+
pickKeysForEmbedding: string[]; }) {
27+
super(`EnterpriseLoader_${md5(cleanString(JSON.stringify(connectorConfig)))}`);
28+
this.connectorName = connectorName;
29+
this.enterpriseUrl = process.env.ENTERPRISE_URL ?? "http://localhost:5000/ab/source";
30+
this.connectorConfig = connectorConfig;
31+
this.pickKeysForEmbedding = pickKeysForEmbedding;
32+
this.filterStream = filterStream;
33+
34+
this.requestBody = {
35+
source: this.connectorName,
36+
config: this.connectorConfig,
37+
streams: this.filterStream,
38+
};
39+
40+
}
41+
42+
override async *getUnfilteredChunks(): AsyncGenerator<UnfilteredLoaderChunk<{ type: 'EnterpriseLoader' }>, void, void> {
43+
try {
44+
const response = await axios.post(this.enterpriseUrl, this.requestBody, {
45+
headers: {
46+
'Accept-Encoding': 'gzip, deflate',
47+
'Keep-Alive': '10000',
48+
'Content-Type': 'application/json',
49+
},
50+
});
51+
const data = response.data.data;
52+
this.debug(`Enterprise '${this.enterpriseUrl}' returned ${data.length} entries`);
53+
54+
for (const item of data) {
55+
const jsonLoader = new JsonLoader({ object: item, pickKeysForEmbedding: this.pickKeysForEmbedding });
56+
for await (const chunk of jsonLoader.getUnfilteredChunks()) {
57+
yield {
58+
...chunk,
59+
metadata: {
60+
...chunk.metadata,
61+
type: <'EnterpriseLoader'>'EnterpriseLoader',
62+
source: response.data.source + ' ' + response.data.stream,
63+
},
64+
};
65+
}
66+
}
67+
} catch (e) {
68+
console.log('Could not fetch data from URL', this.enterpriseUrl, e);
69+
this.debug('Could not fetch data from URL', this.enterpriseUrl, e);
70+
}
71+
}
72+
}
73+

src/models/credal-model.ts

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
import createDebugMessages from 'debug';
2+
import { ChatOpenAI } from '@langchain/openai';
3+
import { HumanMessage, AIMessage, SystemMessage } from '@langchain/core/messages';
4+
5+
import { BaseModel } from '../interfaces/base-model.js';
6+
import { Chunk, ConversationHistory } from '../global/types.js';
7+
import { StringOutputParser } from '@langchain/core/output_parsers';
8+
9+
export class CredalModel extends BaseModel {
10+
11+
private readonly debug = createDebugMessages('maap:model:CredalLLM');
12+
private readonly modelName: string;
13+
private readonly maxTokens: number;
14+
private model: ChatOpenAI;
15+
private modelSource: string;
16+
17+
constructor(params?: { modelSource?: string; temperature?: number; modelName?: string; maxTokens?: number}) {
18+
super(params?.temperature ?? 0.1);
19+
this.modelSource = params?.modelSource ?? 'openai';
20+
this.modelName = params?.modelName ?? 'gpt-4';
21+
this.maxTokens = params?.maxTokens ?? 2048;
22+
}
23+
24+
override async init(): Promise<void> {
25+
var baseUrl = process.env.CREDAL_BASE_URL
26+
if (this.modelSource == 'openai') {
27+
baseUrl = process.env.CREDAL_BASE_URL + '/openai'
28+
} else if (this.modelSource == 'anthropic') {
29+
baseUrl = process.env.CREDAL_BASE_URL + '/anthropic'
30+
}
31+
this.model = new ChatOpenAI({ temperature: this.temperature, maxTokens:this.maxTokens ,model: this.modelName, apiKey: process.env.CREDAL_API_KEY ,configuration:{ baseURL: process.env.CREDAL_BASE_URL}
32+
});
33+
}
34+
35+
override async runQuery(
36+
system: string,
37+
userQuery: string,
38+
supportingContext: Chunk[],
39+
pastConversations: ConversationHistory[],
40+
): Promise<string> {
41+
const pastMessages: (AIMessage | SystemMessage | HumanMessage)[] = this.generatePastMessages(system, supportingContext, pastConversations, userQuery);
42+
const result = await this.model.invoke(pastMessages);
43+
this.debug('OpenAI response -', result);
44+
return result.content.toString();
45+
}
46+
47+
protected runStreamQuery(system: string, userQuery: string, supportingContext: Chunk[], pastConversations: ConversationHistory[]): Promise<any> {
48+
const pastMessages: (AIMessage | SystemMessage | HumanMessage)[] = this.generatePastMessages(system, supportingContext, pastConversations, userQuery);
49+
const parser = new StringOutputParser();
50+
return this.model.pipe(parser).stream(pastMessages);
51+
}
52+
53+
54+
private generatePastMessages(system: string, supportingContext: Chunk[], pastConversations: ConversationHistory[], userQuery: string) {
55+
const pastMessages: (AIMessage | SystemMessage | HumanMessage)[] = [new SystemMessage(system)];
56+
pastMessages.push(
57+
new SystemMessage(`Supporting context: ${supportingContext.map((s) => s.pageContent).join('; ')}`)
58+
);
59+
60+
pastMessages.push.apply(
61+
pastMessages,
62+
pastConversations.map((c) => {
63+
if (c.sender === 'AI') return new AIMessage({ content: c.message });
64+
else if (c.sender === 'SYSTEM') return new SystemMessage({ content: c.message });
65+
else return new HumanMessage({ content: c.message });
66+
})
67+
);
68+
pastMessages.push(new HumanMessage(`${userQuery}?`));
69+
70+
this.debug('Executing openai model with prompt -', userQuery);
71+
return pastMessages;
72+
}
73+
74+
public getModel() {
75+
if (!this.model) {
76+
this.init();
77+
}
78+
return this.model;
79+
}
80+
81+
}

0 commit comments

Comments
 (0)