Skip to content

Commit b0e5474

Browse files
authored
Open source: local and SaaS embedding options and settings (#675)
1 parent 43176cc commit b0e5474

File tree

2 files changed

+60
-0
lines changed

2 files changed

+60
-0
lines changed

open-source/how-to/embedding.mdx

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,13 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
4343

4444
1. Choose an embedding provider that you want to use from among the following allowed providers, and note the provider's ID:
4545

46+
<Note>
47+
The following list assumes that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
48+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), you are limited to
49+
the provider and model names that are supported by the Unstructured API.
50+
[See the list of supported provider names](/api-reference/workflow/workflows#embedder-node).
51+
</Note>
52+
4653
- The provider ID `bedrock` for [Amazon Bedrock](https://aws.amazon.com/bedrock/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/bedrock/).
4754
- `huggingface` for [Hugging Face](https://huggingface.co/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/huggingfacehub/).
4855
- `mixedbread-ai` for [Mixedbread](https://www.mixedbread.ai/). [Learn more](https://www.mixedbread.ai/docs/embeddings/overview).
@@ -65,6 +72,13 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
6572

6673
3. For the following embedding providers, you can choose the model that you want to use. If you do choose a model, note the model's name:
6774

75+
<Note>
76+
The following list assumes that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
77+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), you are limited to
78+
the model names that are supported by the Unstructured API.
79+
[See the list of supported model names](/api-reference/workflow/workflows#embedder-node).
80+
</Note>
81+
6882
- `bedrock`. [Choose a model](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html). No default model is provided. [Learn more about the supported models](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html).
6983
- `huggingface`. [Choose a model](https://huggingface.co/models?other=embeddings), or use the default model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
7084
- `mixedbread-ai`. [Choose a model](https://www.mixedbread.ai/docs/embeddings/models), or use the default model [mixedbread-ai/mxbai-embed-large-v1](https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-v1).
@@ -76,6 +90,12 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
7690

7791
4. Note the special settings to connect to the provider:
7892

93+
<Note>
94+
The following special settings assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
95+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of these special
96+
settings. Unstructured uses its own internal special settings when using the specified provider to generate the embeddings.
97+
</Note>
98+
7999
- For `bedrock`, you'll need an AWS access key value, the corresponding AWS secret access key value, and the corresponding AWS Region identifier. [Get an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html).
80100
- For `huggingface`, if you use a gated model (a model with special conditions that you must accept before you can use it, or a privately published model), you'll need an HF inference API key value, beginning with `hf_`. [Get an HF inference API key](https://huggingface.co/docs/api-inference/en/quicktour#get-your-api-token). To learn whether your model requires an HF inference API key, see your model provider's documentation.
81101
- For `mixedbread-ai`, you'll need a Mixedbread API key value. [Get a Mixedbread API key](https://www.mixedbread.ai/dashboard?next=api-keys).
@@ -89,6 +109,18 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
89109

90110
<AccordionGroup>
91111
<Accordion title="Ingest CLI">
112+
<Note>
113+
The following options assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
114+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of the following options:
115+
116+
- `--embedding-api-key`
117+
- `--embedding-aws-access-key-id`
118+
- `--embedding-aws-secret-access-key`
119+
- `--embedding-aws-region`
120+
121+
Unstructured uses its own internal settings for these options when using the specified provider to generate the embeddings.
122+
</Note>
123+
92124
For the [source connector](/open-source/ingestion/source-connectors/overview) command:
93125

94126
- Set the command's `--embedding-provider` to the provider's ID, for example `huggingface`.
@@ -101,6 +133,18 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
101133
- Set `--embedding-aws-region` to the corresponding AWS Region identifier.
102134
</Accordion>
103135
<Accordion title="Ingest Python library">
136+
<Note>
137+
The following parameters assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
138+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of the following parameters:
139+
140+
- `embedding_api_key`
141+
- `embedding_aws_access_key_id`
142+
- `embedding_aws_secret_access_key`
143+
- `embedding_aws_region`
144+
145+
Unstructured uses its own internal settings for these parameters when using the specified provider to generate the embeddings.
146+
</Note>
147+
104148
For the [source connector's](/open-source/ingestion/source-connectors/overview) `EmbedderConfig` object:
105149

106150
- Set the `embedding_provider` parameter to the provider's ID, for example `huggingface`.

snippets/ingest-configuration-shared/embedding-configuration.mdx

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
A common embedding configuration is a critical component that allows for dynamic selection of embedders and their associated parameters to create vectors from data. This configuration provides the flexibility to choose from various embedding models and fine-tune parameters to optimize the quality and characteristics of the resulting vectors. It enables users to tailor the embedding process to the specific needs of their data and downstream applications, ensuring that the generated vectors effectively capture semantic relationships and contextual information within the dataset.
22

3+
The core [Unstructured](https://github.com/Unstructured-IO/unstructured) open source library does not enable the generation of embeddings by default. (However you can generate embeddings as a separate step manually.
4+
[Learn how](/open-source/core-functionality/embedding).)
5+
6+
You can configure the Unstructured CLI and Unstructured Ingest Python library to generate embeddings by specifying the
7+
`embedding_provider`, `embedding_api_key`, and `embedding_model_name` options (and, for Amazon Bedrock, additional options), as follows.
8+
You must provide your own API key for the specified embedding provider. To get this API key, you must first create an account with that
9+
provider and set up billing directly with them. You are responsible for all costs associated with using that provider.
10+
11+
Calls to the Unstructured CLI or Unstructured Ingest Python library that are routed to Unstructured's software-as-a-service (SaaS)
12+
for processing (for example, by specifying an Unstructured API key and an Unstructured SaaS URL) require an Unstructured account for billing purposes. Unstructured's costs for generating
13+
embeddings are already included in its account pricing plans. To
14+
generate embeddings, you must specify the `embedding_provider` and `embedding_model_name` configuration options as follows. Unstructured uses its own internal API key
15+
when using the specified provider to generate the embeddings. These
16+
`embedding_provider` and `embedding_model_name` options are limited only to the provider and model names that are supported by the Unstructured API.
17+
[See the list of supported provider and model names](/api-reference/workflow/workflows#embedder-node). [Learn how to specify these options](/open-source/how-to/embedding).
18+
319
## Configs
420

521
* `embedding_provider`: The embedding provider to use while doing embedding. Available values include `bedrock`, `azure-openai`, `huggingface`, `mixedbread-ai`, `octoai`, `openai`, `togetherai`, `vertexai`, and `voyageai`.

0 commit comments

Comments
 (0)