[HF Data Loader] Agent builder dataset loading improvements (#236974)

SrdjanLL · web-flow · commit aa6af6aa3be5 · 2025-10-06T13:33:31.000+02:00
Closes: elastic/search-team#10957 Closes : #237126 ## Summary Improving HuggingFace data loader script for the purpose of Agent Builder evaluation with the following: - Tweaked bulk API calls that brought in 30x performance improvements when indexing datasets with pre-computed embeddings. - Multiple different experiment led to simple: ensuring `_inference_fields` get picked up from the cached documents with embeddings, then increasing the `flushBytes` on the Bulk API (since there's no risk of overloading ELSER endpoint(s)). - First-time embedding generation now takes ~20 mins - Once generated (and cached) indexing of all Agent Builder datasets takes less than 1 minute) - Fixed the failures for dataset loading when there are 10k+ records in the dataset - Removed the default limit of 1000 documents per dataset. Limits will only apply when specified and default is for the entire dataset to be indexed. - Added wildcard imports the allow developers to load multiple datasets inside a HuggingFace repository directory using wildcards (`*`). For example, `onechat/knowledge-base/*` will load all datasets currently used for evaluation (from [here](https://huggingface.co/datasets/elastic/OneChatAgent/tree/main/knowledge-base)). ### Testing - Smoke tests of the loader with/without limits - Loaded all of the agent builder datasets at once using a wildcard import. With the following command: ``` HUGGING_FACE_ACCESS_TOKEN=<token> node --require ./src/setup_node_env/index.js x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts --datasets "onechat/knowledge-base/*" --clear --debug ``` - Confirmed that all 14 datasets are loaded. - Confirmed that all 10k+ datasets load correctly. - Testing of the performance improvement with the bulk API tweaks. Tests done with the computationally heaviest dataset (`wix_knowledge_base`). - Before: ```bash info Retrieved 6222 documents with embeddings debg Indexing 6222 documents with embeddings debg Indexing 6222 into wix_knowledge_base debg Indexing completed in 550.61s (11.30 docs/sec) info Indexed dataset ``` - After: ```bash debg Indexing 6222 documents with embeddings debg Indexing 6222 into wix_knowledge_base debg Indexing completed in 15.82s (393.37 docs/sec) info Indexed dataset ```
diff --git a/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md b/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/README.md
@@ -27,7 +27,7 @@ node --require ./src/setup_node_env/index.js \
 | Flag           | Type      | Description                                                                                           |
 | -------------- | --------- | ----------------------------------------------------------------------------------------------------- |
 | `--datasets`   | `string`  | Comma-separated list of dataset **names** to load. Omit the flag to load **all** predefined datasets. |
-| `--limit`      | `number`  | Max docs per dataset (handy while testing). Defaults to 1k.                                           |
+| `--limit`      | `number`  | Max docs per dataset (handy while testing). When omitted, all rows will be loaded.                    |
 | `--clear`      | `boolean` | Delete the target index **before** indexing. Defaults to `false`.                                     |
 | `--kibana-url` | `string`  | Kibana URL to connect to (bypasses auto-discovery when provided).                                     |
 
@@ -48,6 +48,9 @@ The loader also supports **OneChat datasets** from the `elastic/OneChatAgent` re
 Use the format `onechat/<directory>/<dataset>` to load OneChat datasets:
 
 ```bash
+# Load all OneChat datasets
+--datasets onechat/knowledge-base/*
+
 # Load a single OneChat dataset
 --datasets onechat/knowledge-base/wix_knowledge_base
 
@@ -78,3 +81,12 @@ Run the loader without `--datasets` to see all available OneChat and regular Hug
 ## Disabling local cache
 
 Set the environment variable `DISABLE_KBN_CLI_CACHE=1` to force fresh downloads instead of using the on-disk cache.
+
+## Clearing the cache
+
+Remove the downloaded files and cached documents by deleting the cache directories:
+
+```bash
+rm -rf data/hugging_face_dataset_rows
+rm -rf data/hugging_face_dataset_embeddings
+```
diff --git a/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/datasets/config.ts b/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/datasets/config.ts
@@ -7,7 +7,12 @@
 
 import type { Logger } from '@kbn/core/server';
 import type { HuggingFaceDatasetSpec } from '../types';
-import { createOneChatDatasetSpec, isOneChatDataset } from './onechat';
+import {
+  createOneChatDatasetSpec,
+  isOneChatDataset,
+  isOneChatWildcard,
+  listOneChatDatasets,
+} from './onechat';
 
 const BEIR_NAMES = [
   'trec-covid',
@@ -89,15 +94,40 @@ export const PREDEFINED_HUGGING_FACE_DATASETS: HuggingFaceDatasetSpec[] = [
 ];
 
 /**
- * Get dataset specifications, including dynamically generated OneChat datasets
+ * Expands wildcard dataset patterns into concrete dataset names
+ */
+async function expandDatasetNames(
+  datasetNames: string[],
+  accessToken: string,
+  logger: Logger
+): Promise<string[]> {
+  const expansions = await Promise.all(
+    datasetNames.map(async (datasetName) => {
+      if (isOneChatWildcard(datasetName)) {
+        const directory = datasetName.split('/')[1];
+        const datasetsForDirectory = await listOneChatDatasets(directory, accessToken, logger);
+        return datasetsForDirectory;
+      }
+      return [datasetName];
+    })
+  );
+
+  return expansions.flat();
+}
+
+/**
+ * Gets dataset specifications, including dynamically generated OneChat datasets
  */
 export async function getDatasetSpecs(
   accessToken: string,
   logger: Logger,
   datasetNames: string[]
 ): Promise<HuggingFaceDatasetSpec[]> {
+  // First, expand any wildcards into concrete dataset names
+  const expandedNames = await expandDatasetNames(datasetNames, accessToken, logger);
+
   const specs: HuggingFaceDatasetSpec[] = [];
-  for (const name of datasetNames) {
+  for (const name of expandedNames) {
     if (isOneChatDataset(name)) {
       const spec = await createOneChatDatasetSpec(name, accessToken, logger);
       specs.push(spec);
diff --git a/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/datasets/onechat.ts b/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/datasets/onechat.ts
@@ -131,6 +131,10 @@ export function isOneChatDataset(datasetName: string): boolean {
   return datasetName.startsWith('onechat/') && datasetName.split('/').length === 3;
 }
 
+export function isOneChatWildcard(datasetName: string): boolean {
+  return isOneChatDataset(datasetName) && datasetName.endsWith('/*');
+}
+
 /**
  * Lists all available OneChat datasets for a specific directory
  */
diff --git a/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/indexing/get_embeddings.ts b/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/indexing/get_embeddings.ts
@@ -32,21 +32,51 @@ export async function getEmbeddings({
     logger,
   });
 
-  const docsWithEmbeddings = await esClient
-    .search<Record<string, any>>({
-      index: indexName,
-      size: documents.length,
-      fields: ['_inference_fields'],
-    })
-    .then((response) =>
-      response.hits.hits.map((hit) => {
-        const source = hit._source!;
-        Object.entries(source._inference_fields ?? {}).forEach(([fieldName, config]) => {
-          delete (config as Record<string, any>).inference.model_settings.service;
-        });
-        return { ...source, _id: hit._id };
-      })
-    );
+  const docsWithEmbeddings: Array<Record<string, unknown>> = [];
+  const scrollDuration = '1m';
+  const scrollSize = 1000;
+
+  // Use scroll API to handle large datasets with 10k+ documents.
+  let response = await esClient.search<Record<string, any>>({
+    index: indexName,
+    scroll: scrollDuration,
+    size: scrollSize,
+    fields: ['_inference_fields'],
+    query: {
+      match_all: {},
+    },
+  });
+
+  const pushToDocsWithEmbeddings = (hit: Record<string, any>) => {
+    const source = hit._source!;
+    docsWithEmbeddings.push({ ...source, _id: hit._id });
+  };
+
+  // Process initial batch
+  for (const hit of response.hits.hits) {
+    pushToDocsWithEmbeddings(hit);
+  }
+
+  // Continue scrolling through all results
+  while (response.hits.hits.length > 0) {
+    response = await esClient.scroll({
+      scroll_id: response._scroll_id!,
+      scroll: scrollDuration,
+    });
+
+    for (const hit of response.hits.hits) {
+      pushToDocsWithEmbeddings(hit);
+    }
+  }
+
+  // Clear the scroll context
+  if (response._scroll_id) {
+    await esClient.clearScroll({ scroll_id: [response._scroll_id] }).catch((err) => {
+      logger.warn(`Failed to clear scroll context: ${err.message}`);
+    });
+  }
+
+  logger.info(`Retrieved ${docsWithEmbeddings.length} documents with embeddings`);
 
   await esClient.indices.delete({ index: indexName });
 
diff --git a/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/indexing/index_documents.ts b/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/indexing/index_documents.ts
@@ -8,6 +8,7 @@
 import type { ElasticsearchClient, Logger } from '@kbn/core/server';
 import { Readable } from 'stream';
 import { inspect } from 'util';
+import type { BulkHelperOptions } from '@elastic/elasticsearch/lib/helpers';
 import type { HuggingFaceDatasetSpec } from '../types';
 import { ensureDatasetIndexExists } from './ensure_dataset_index_exists';
 
@@ -16,11 +17,16 @@ export async function indexDocuments({
   documents,
   dataset,
   logger,
+  bulkHelperOverrides,
 }: {
   esClient: ElasticsearchClient;
   documents: Array<Record<string, unknown>>;
   dataset: HuggingFaceDatasetSpec;
   logger: Logger;
+  bulkHelperOverrides?: Omit<
+    BulkHelperOptions<Record<string, unknown>>,
+    'datasource' | 'onDocument'
+  >;
 }): Promise<void> {
   const indexName = dataset.index;
 
@@ -31,6 +37,8 @@ export async function indexDocuments({
 
   logger.debug(`Indexing ${documents.length} into ${indexName}`);
 
+  const startTime = Date.now();
+
   await esClient.helpers.bulk<Record<string, unknown>>({
     datasource: Readable.from(documents),
     index: indexName,
@@ -45,5 +53,15 @@ export async function indexDocuments({
       logger.warn(`Dropped document: ${doc.status} (${inspect(doc.error, { depth: 5 })})`);
     },
     refresh: 'wait_for',
+    ...bulkHelperOverrides,
   });
+
+  const endTime = Date.now();
+  const elapsedTimeMs = endTime - startTime;
+  const elapsedTimeSec = elapsedTimeMs / 1000;
+  const docsPerSecond = documents.length / elapsedTimeSec;
+
+  logger.debug(
+    `Indexing completed in ${elapsedTimeSec.toFixed(2)}s (${docsPerSecond.toFixed(2)} docs/sec)`
+  );
 }
diff --git a/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/load_hugging_face_datasets.ts b/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/load_hugging_face_datasets.ts
@@ -37,7 +37,7 @@ export async function loadHuggingFaceDatasets({
   logger,
   accessToken,
   datasets = PREDEFINED_HUGGING_FACE_DATASETS,
-  limit = 1000,
+  limit,
   clear = false,
 }: {
   esClient: ElasticsearchClient;
@@ -97,6 +97,10 @@ export async function loadHuggingFaceDatasets({
       documents: docsWithEmbeddings,
       dataset,
       logger,
+      bulkHelperOverrides: {
+        // With embeddings already generated, larger flush size will not overload ELSER inference and improves performance
+        flushBytes: 1024 * 1024 * 5,
+      },
     });
 
     logger.info(`Indexed dataset`);
diff --git a/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/processing/fetch_rows_from_dataset.ts b/x-pack/platform/packages/shared/kbn-ai-tools-cli/src/hf_dataset_loader/processing/fetch_rows_from_dataset.ts
@@ -32,7 +32,7 @@ async function readFromCsv(
   decompressed: Readable,
   dataset: HuggingFaceDatasetSpec,
   logger: Logger,
-  limit: number
+  limit?: number
 ): Promise<Array<Record<string, unknown>>> {
   const docs: Array<Record<string, unknown>> = [];
 
@@ -52,7 +52,7 @@ async function readFromCsv(
         const document = convertToDocument(row, dataset);
         docs.push(document);
 
-        if (docs.length >= limit) {
+        if (limit !== undefined && docs.length >= limit) {
           logger.debug(`Reached limit of ${limit} documents`);
           csvStream.destroy();
           resolveWithCleanup(docs);
@@ -94,7 +94,7 @@ async function readFromJson(
   decompressed: Readable,
   dataset: HuggingFaceDatasetSpec,
   logger: Logger,
-  limit: number
+  limit?: number
 ): Promise<Array<Record<string, unknown>>> {
   const docs: Array<Record<string, unknown>> = [];
   const rl = readline.createInterface({ input: decompressed, crlfDelay: Infinity });
@@ -106,7 +106,7 @@ async function readFromJson(
     const document = convertToDocument(raw, dataset);
     docs.push(document);
 
-    if (docs.length >= limit) {
+    if (limit !== undefined && docs.length >= limit) {
       logger.debug(`Reached limit of ${limit} documents`);
       break;
     }
@@ -118,7 +118,7 @@ async function readFromJson(
 export async function fetchRowsFromDataset({
   dataset,
   logger,
-  limit = 1000,
+  limit,
   accessToken,
 }: {
   dataset: HuggingFaceDatasetSpec;
diff --git a/x-pack/platform/packages/shared/onechat/kbn-evals-suite-onechat/README.md b/x-pack/platform/packages/shared/onechat/kbn-evals-suite-onechat/README.md
@@ -58,19 +58,22 @@ node scripts/scout.js start-server --stateful
 
 ### Load OneChat Datasets
 
+**Note**: You need to be a member of the Elastic organization on HuggingFace to access OneChat datasets. Sign up with your `@elastic.co` email address.
+
 Load the required OneChat datasets into Elasticsearch using the HuggingFace dataset loader:
 
 ```bash
 # Load Wix knowledge base and users datasets
 HUGGING_FACE_ACCESS_TOKEN=<your-token> \
 node --require ./src/setup_node_env/index.js \
   x-pack/platform/packages/shared/kbn-ai-tools-cli/scripts/hf_dataset_loader.ts \
-  --datasets onechat/knowledge-base/wix_knowledge_base,onechat/knowledge-base/users \
+  --datasets onechat/knowledge-base/* \
   --clear
   --kibana-url http://elastic:changeme@localhost:5620
 ```
 
-**Note**: You need to be a member of the Elastic organization on HuggingFace to access OneChat datasets. Sign up with your `@elastic.co` email address.
+**Note**: First download of the datasets may take a while, because of the embedding generation for `semantic_text` fields in some of the datasets.
+Once done, documents with embeddings will be cached and re-used on subsequent data loads.
 
 For more information about HuggingFace dataset loading, refer to the [HuggingFace Dataset Loader documentation](../../kbn-ai-tools-cli/src/hf_dataset_loader/README.md).