Skip to content

Commit aa6af6a

Browse files
authored
[HF Data Loader] Agent builder dataset loading improvements (#236974)
Closes: elastic/search-team#10957 Closes : #237126 ## Summary Improving HuggingFace data loader script for the purpose of Agent Builder evaluation with the following: - Tweaked bulk API calls that brought in 30x performance improvements when indexing datasets with pre-computed embeddings. - Multiple different experiment led to simple: ensuring `_inference_fields` get picked up from the cached documents with embeddings, then increasing the `flushBytes` on the Bulk API (since there's no risk of overloading ELSER endpoint(s)). - First-time embedding generation now takes ~20 mins - Once generated (and cached) indexing of all Agent Builder datasets takes less than 1 minute) - Fixed the failures for dataset loading when there are 10k+ records in the dataset - Removed the default limit of 1000 documents per dataset. Limits will only apply when specified and default is for the entire dataset to be indexed. - Added wildcard imports the allow developers to load multiple datasets inside a HuggingFace repository directory using wildcards (`*`). For example, `onechat/knowledge-base/*` will load all datasets currently used for evaluation (from [here](https://huggingface.co/datasets/elastic/OneChatAgent/tree/main/knowledge-base)). ### Testing - Smoke tests of the loader with/without limits - Loaded all of the agent builder datasets at once using a wildcard import. With the following command: ``` HUGGING_FACE_ACCESS_TOKEN=<token> node --require ./src/setup_node_env/index.js x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts --datasets "onechat/knowledge-base/*" --clear --debug ``` - Confirmed that all 14 datasets are loaded. - Confirmed that all 10k+ datasets load correctly. - Testing of the performance improvement with the bulk API tweaks. Tests done with the computationally heaviest dataset (`wix_knowledge_base`). - Before: ```bash info Retrieved 6222 documents with embeddings debg Indexing 6222 documents with embeddings debg Indexing 6222 into wix_knowledge_base debg Indexing completed in 550.61s (11.30 docs/sec) info Indexed dataset ``` - After: ```bash debg Indexing 6222 documents with embeddings debg Indexing 6222 into wix_knowledge_base debg Indexing completed in 15.82s (393.37 docs/sec) info Indexed dataset ```
1 parent 500b1e5 commit aa6af6a

File tree

8 files changed

+128
-27
lines changed

8 files changed

+128
-27
lines changed

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ node --require ./src/setup_node_env/index.js \
2727
| Flag | Type | Description |
2828
| -------------- | --------- | ----------------------------------------------------------------------------------------------------- |
2929
| `--datasets` | `string` | Comma-separated list of dataset **names** to load. Omit the flag to load **all** predefined datasets. |
30-
| `--limit` | `number` | Max docs per dataset (handy while testing). Defaults to 1k. |
30+
| `--limit` | `number` | Max docs per dataset (handy while testing). When omitted, all rows will be loaded. |
3131
| `--clear` | `boolean` | Delete the target index **before** indexing. Defaults to `false`. |
3232
| `--kibana-url` | `string` | Kibana URL to connect to (bypasses auto-discovery when provided). |
3333

@@ -48,6 +48,9 @@ The loader also supports **OneChat datasets** from the `elastic/OneChatAgent` re
4848
Use the format `onechat/<directory>/<dataset>` to load OneChat datasets:
4949

5050
```bash
51+
# Load all OneChat datasets
52+
--datasets onechat/knowledge-base/*
53+
5154
# Load a single OneChat dataset
5255
--datasets onechat/knowledge-base/wix_knowledge_base
5356

@@ -78,3 +81,12 @@ Run the loader without `--datasets` to see all available OneChat and regular Hug
7881
## Disabling local cache
7982

8083
Set the environment variable `DISABLE_KBN_CLI_CACHE=1` to force fresh downloads instead of using the on-disk cache.
84+
85+
## Clearing the cache
86+
87+
Remove the downloaded files and cached documents by deleting the cache directories:
88+
89+
```bash
90+
rm -rf data/hugging_face_dataset_rows
91+
rm -rf data/hugging_face_dataset_embeddings
92+
```

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/datasets/config.ts

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,12 @@
77

88
import type { Logger } from '@kbn/core/server';
99
import type { HuggingFaceDatasetSpec } from '../types';
10-
import { createOneChatDatasetSpec, isOneChatDataset } from './onechat';
10+
import {
11+
createOneChatDatasetSpec,
12+
isOneChatDataset,
13+
isOneChatWildcard,
14+
listOneChatDatasets,
15+
} from './onechat';
1116

1217
const BEIR_NAMES = [
1318
'trec-covid',
@@ -89,15 +94,40 @@ export const PREDEFINED_HUGGING_FACE_DATASETS: HuggingFaceDatasetSpec[] = [
8994
];
9095

9196
/**
92-
* Get dataset specifications, including dynamically generated OneChat datasets
97+
* Expands wildcard dataset patterns into concrete dataset names
98+
*/
99+
async function expandDatasetNames(
100+
datasetNames: string[],
101+
accessToken: string,
102+
logger: Logger
103+
): Promise<string[]> {
104+
const expansions = await Promise.all(
105+
datasetNames.map(async (datasetName) => {
106+
if (isOneChatWildcard(datasetName)) {
107+
const directory = datasetName.split('/')[1];
108+
const datasetsForDirectory = await listOneChatDatasets(directory, accessToken, logger);
109+
return datasetsForDirectory;
110+
}
111+
return [datasetName];
112+
})
113+
);
114+
115+
return expansions.flat();
116+
}
117+
118+
/**
119+
* Gets dataset specifications, including dynamically generated OneChat datasets
93120
*/
94121
export async function getDatasetSpecs(
95122
accessToken: string,
96123
logger: Logger,
97124
datasetNames: string[]
98125
): Promise<HuggingFaceDatasetSpec[]> {
126+
// First, expand any wildcards into concrete dataset names
127+
const expandedNames = await expandDatasetNames(datasetNames, accessToken, logger);
128+
99129
const specs: HuggingFaceDatasetSpec[] = [];
100-
for (const name of datasetNames) {
130+
for (const name of expandedNames) {
101131
if (isOneChatDataset(name)) {
102132
const spec = await createOneChatDatasetSpec(name, accessToken, logger);
103133
specs.push(spec);

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/datasets/onechat.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,10 @@ export function isOneChatDataset(datasetName: string): boolean {
131131
return datasetName.startsWith('onechat/') && datasetName.split('/').length === 3;
132132
}
133133

134+
export function isOneChatWildcard(datasetName: string): boolean {
135+
return isOneChatDataset(datasetName) && datasetName.endsWith('/*');
136+
}
137+
134138
/**
135139
* Lists all available OneChat datasets for a specific directory
136140
*/

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/indexing/get_embeddings.ts

Lines changed: 45 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -32,21 +32,51 @@ export async function getEmbeddings({
3232
logger,
3333
});
3434

35-
const docsWithEmbeddings = await esClient
36-
.search<Record<string, any>>({
37-
index: indexName,
38-
size: documents.length,
39-
fields: ['_inference_fields'],
40-
})
41-
.then((response) =>
42-
response.hits.hits.map((hit) => {
43-
const source = hit._source!;
44-
Object.entries(source._inference_fields ?? {}).forEach(([fieldName, config]) => {
45-
delete (config as Record<string, any>).inference.model_settings.service;
46-
});
47-
return { ...source, _id: hit._id };
48-
})
49-
);
35+
const docsWithEmbeddings: Array<Record<string, unknown>> = [];
36+
const scrollDuration = '1m';
37+
const scrollSize = 1000;
38+
39+
// Use scroll API to handle large datasets with 10k+ documents.
40+
let response = await esClient.search<Record<string, any>>({
41+
index: indexName,
42+
scroll: scrollDuration,
43+
size: scrollSize,
44+
fields: ['_inference_fields'],
45+
query: {
46+
match_all: {},
47+
},
48+
});
49+
50+
const pushToDocsWithEmbeddings = (hit: Record<string, any>) => {
51+
const source = hit._source!;
52+
docsWithEmbeddings.push({ ...source, _id: hit._id });
53+
};
54+
55+
// Process initial batch
56+
for (const hit of response.hits.hits) {
57+
pushToDocsWithEmbeddings(hit);
58+
}
59+
60+
// Continue scrolling through all results
61+
while (response.hits.hits.length > 0) {
62+
response = await esClient.scroll({
63+
scroll_id: response._scroll_id!,
64+
scroll: scrollDuration,
65+
});
66+
67+
for (const hit of response.hits.hits) {
68+
pushToDocsWithEmbeddings(hit);
69+
}
70+
}
71+
72+
// Clear the scroll context
73+
if (response._scroll_id) {
74+
await esClient.clearScroll({ scroll_id: [response._scroll_id] }).catch((err) => {
75+
logger.warn(`Failed to clear scroll context: ${err.message}`);
76+
});
77+
}
78+
79+
logger.info(`Retrieved ${docsWithEmbeddings.length} documents with embeddings`);
5080

5181
await esClient.indices.delete({ index: indexName });
5282

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/indexing/index_documents.ts

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
import type { ElasticsearchClient, Logger } from '@kbn/core/server';
99
import { Readable } from 'stream';
1010
import { inspect } from 'util';
11+
import type { BulkHelperOptions } from '@elastic/elasticsearch/lib/helpers';
1112
import type { HuggingFaceDatasetSpec } from '../types';
1213
import { ensureDatasetIndexExists } from './ensure_dataset_index_exists';
1314

@@ -16,11 +17,16 @@ export async function indexDocuments({
1617
documents,
1718
dataset,
1819
logger,
20+
bulkHelperOverrides,
1921
}: {
2022
esClient: ElasticsearchClient;
2123
documents: Array<Record<string, unknown>>;
2224
dataset: HuggingFaceDatasetSpec;
2325
logger: Logger;
26+
bulkHelperOverrides?: Omit<
27+
BulkHelperOptions<Record<string, unknown>>,
28+
'datasource' | 'onDocument'
29+
>;
2430
}): Promise<void> {
2531
const indexName = dataset.index;
2632

@@ -31,6 +37,8 @@ export async function indexDocuments({
3137

3238
logger.debug(`Indexing ${documents.length} into ${indexName}`);
3339

40+
const startTime = Date.now();
41+
3442
await esClient.helpers.bulk<Record<string, unknown>>({
3543
datasource: Readable.from(documents),
3644
index: indexName,
@@ -45,5 +53,15 @@ export async function indexDocuments({
4553
logger.warn(`Dropped document: ${doc.status} (${inspect(doc.error, { depth: 5 })})`);
4654
},
4755
refresh: 'wait_for',
56+
...bulkHelperOverrides,
4857
});
58+
59+
const endTime = Date.now();
60+
const elapsedTimeMs = endTime - startTime;
61+
const elapsedTimeSec = elapsedTimeMs / 1000;
62+
const docsPerSecond = documents.length / elapsedTimeSec;
63+
64+
logger.debug(
65+
`Indexing completed in ${elapsedTimeSec.toFixed(2)}s (${docsPerSecond.toFixed(2)} docs/sec)`
66+
);
4967
}

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/load_hugging_face_datasets.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ export async function loadHuggingFaceDatasets({
3737
logger,
3838
accessToken,
3939
datasets = PREDEFINED_HUGGING_FACE_DATASETS,
40-
limit = 1000,
40+
limit,
4141
clear = false,
4242
}: {
4343
esClient: ElasticsearchClient;
@@ -97,6 +97,10 @@ export async function loadHuggingFaceDatasets({
9797
documents: docsWithEmbeddings,
9898
dataset,
9999
logger,
100+
bulkHelperOverrides: {
101+
// With embeddings already generated, larger flush size will not overload ELSER inference and improves performance
102+
flushBytes: 1024 * 1024 * 5,
103+
},
100104
});
101105

102106
logger.info(`Indexed dataset`);

x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/processing/fetch_rows_from_dataset.ts

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ async function readFromCsv(
3232
decompressed: Readable,
3333
dataset: HuggingFaceDatasetSpec,
3434
logger: Logger,
35-
limit: number
35+
limit?: number
3636
): Promise<Array<Record<string, unknown>>> {
3737
const docs: Array<Record<string, unknown>> = [];
3838

@@ -52,7 +52,7 @@ async function readFromCsv(
5252
const document = convertToDocument(row, dataset);
5353
docs.push(document);
5454

55-
if (docs.length >= limit) {
55+
if (limit !== undefined && docs.length >= limit) {
5656
logger.debug(`Reached limit of ${limit} documents`);
5757
csvStream.destroy();
5858
resolveWithCleanup(docs);
@@ -94,7 +94,7 @@ async function readFromJson(
9494
decompressed: Readable,
9595
dataset: HuggingFaceDatasetSpec,
9696
logger: Logger,
97-
limit: number
97+
limit?: number
9898
): Promise<Array<Record<string, unknown>>> {
9999
const docs: Array<Record<string, unknown>> = [];
100100
const rl = readline.createInterface({ input: decompressed, crlfDelay: Infinity });
@@ -106,7 +106,7 @@ async function readFromJson(
106106
const document = convertToDocument(raw, dataset);
107107
docs.push(document);
108108

109-
if (docs.length >= limit) {
109+
if (limit !== undefined && docs.length >= limit) {
110110
logger.debug(`Reached limit of ${limit} documents`);
111111
break;
112112
}
@@ -118,7 +118,7 @@ async function readFromJson(
118118
export async function fetchRowsFromDataset({
119119
dataset,
120120
logger,
121-
limit = 1000,
121+
limit,
122122
accessToken,
123123
}: {
124124
dataset: HuggingFaceDatasetSpec;

x-pack/platform/packages/shared/onechat/kbn-evals-suite-onechat/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,19 +58,22 @@ node scripts/scout.js start-server --stateful
5858

5959
### Load OneChat Datasets
6060

61+
**Note**: You need to be a member of the Elastic organization on HuggingFace to access OneChat datasets. Sign up with your `@elastic.co` email address.
62+
6163
Load the required OneChat datasets into Elasticsearch using the HuggingFace dataset loader:
6264

6365
```bash
6466
# Load Wix knowledge base and users datasets
6567
HUGGING_FACE_ACCESS_TOKEN=<your-token> \
6668
node --require ./src/setup_node_env/index.js \
6769
x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts \
68-
--datasets onechat/knowledge-base/wix_knowledge_base,onechat/knowledge-base/users \
70+
--datasets onechat/knowledge-base/* \
6971
--clear
7072
--kibana-url http://elastic:changeme@localhost:5620
7173
```
7274

73-
**Note**: You need to be a member of the Elastic organization on HuggingFace to access OneChat datasets. Sign up with your `@elastic.co` email address.
75+
**Note**: First download of the datasets may take a while, because of the embedding generation for `semantic_text` fields in some of the datasets.
76+
Once done, documents with embeddings will be cached and re-used on subsequent data loads.
7477

7578
For more information about HuggingFace dataset loading, refer to the [HuggingFace Dataset Loader documentation](../../kbn-ai-tools-cli/src/hf_dataset_loader/README.md).
7679

0 commit comments

Comments
 (0)