You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: open-source/how-to/embedding.mdx
+44Lines changed: 44 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,6 +43,13 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
43
43
44
44
1. Choose an embedding provider that you want to use from among the following allowed providers, and note the provider's ID:
45
45
46
+
<Note>
47
+
The following list assumes that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
48
+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), you are limited to
49
+
the provider and model names that are supported by the Unstructured API.
50
+
[See the list of supported provider names](/api-reference/workflow/workflows#embedder-node).
51
+
</Note>
52
+
46
53
- The provider ID `bedrock` for [Amazon Bedrock](https://aws.amazon.com/bedrock/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/bedrock/).
47
54
-`huggingface` for [Hugging Face](https://huggingface.co/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/huggingfacehub/).
48
55
-`mixedbread-ai` for [Mixedbread](https://www.mixedbread.ai/). [Learn more](https://www.mixedbread.ai/docs/embeddings/overview).
@@ -65,6 +72,13 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
65
72
66
73
3. For the following embedding providers, you can choose the model that you want to use. If you do choose a model, note the model's name:
67
74
75
+
<Note>
76
+
The following list assumes that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
77
+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), you are limited to
78
+
the model names that are supported by the Unstructured API.
79
+
[See the list of supported model names](/api-reference/workflow/workflows#embedder-node).
80
+
</Note>
81
+
68
82
-`bedrock`. [Choose a model](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html). No default model is provided. [Learn more about the supported models](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html).
69
83
-`huggingface`. [Choose a model](https://huggingface.co/models?other=embeddings), or use the default model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
70
84
-`mixedbread-ai`. [Choose a model](https://www.mixedbread.ai/docs/embeddings/models), or use the default model [mixedbread-ai/mxbai-embed-large-v1](https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-v1).
@@ -76,6 +90,12 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
76
90
77
91
4. Note the special settings to connect to the provider:
78
92
93
+
<Note>
94
+
The following special settings assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
95
+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of these special
96
+
settings. Unstructured uses its own internal special settings when using the specified provider to generate the embeddings.
97
+
</Note>
98
+
79
99
- For `bedrock`, you'll need an AWS access key value, the corresponding AWS secret access key value, and the corresponding AWS Region identifier. [Get an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html).
80
100
- For `huggingface`, if you use a gated model (a model with special conditions that you must accept before you can use it, or a privately published model), you'll need an HF inference API key value, beginning with `hf_`. [Get an HF inference API key](https://huggingface.co/docs/api-inference/en/quicktour#get-your-api-token). To learn whether your model requires an HF inference API key, see your model provider's documentation.
81
101
- For `mixedbread-ai`, you'll need a Mixedbread API key value. [Get a Mixedbread API key](https://www.mixedbread.ai/dashboard?next=api-keys).
@@ -89,6 +109,18 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
89
109
90
110
<AccordionGroup>
91
111
<Accordiontitle="Ingest CLI">
112
+
<Note>
113
+
The following options assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
114
+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of the following options:
115
+
116
+
- `--embedding-api-key`
117
+
- `--embedding-aws-access-key-id`
118
+
- `--embedding-aws-secret-access-key`
119
+
- `--embedding-aws-region`
120
+
121
+
Unstructured uses its own internal settings for these options when using the specified provider to generate the embeddings.
122
+
</Note>
123
+
92
124
For the [source connector](/open-source/ingestion/source-connectors/overview) command:
93
125
94
126
- Set the command's `--embedding-provider` to the provider's ID, for example `huggingface`.
@@ -101,6 +133,18 @@ To use the Ingest CLI or Ingest Python library to generate embeddings, do the fo
101
133
- Set `--embedding-aws-region` to the corresponding AWS Region identifier.
102
134
</Accordion>
103
135
<Accordiontitle="Ingest Python library">
136
+
<Note>
137
+
The following parameters assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS)
138
+
for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of the following parameters:
139
+
140
+
- `embedding_api_key`
141
+
- `embedding_aws_access_key_id`
142
+
- `embedding_aws_secret_access_key`
143
+
- `embedding_aws_region`
144
+
145
+
Unstructured uses its own internal settings for these parameters when using the specified provider to generate the embeddings.
146
+
</Note>
147
+
104
148
For the [source connector's](/open-source/ingestion/source-connectors/overview)`EmbedderConfig` object:
105
149
106
150
- Set the `embedding_provider` parameter to the provider's ID, for example `huggingface`.
Copy file name to clipboardExpand all lines: snippets/ingest-configuration-shared/embedding-configuration.mdx
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,21 @@
1
1
A common embedding configuration is a critical component that allows for dynamic selection of embedders and their associated parameters to create vectors from data. This configuration provides the flexibility to choose from various embedding models and fine-tune parameters to optimize the quality and characteristics of the resulting vectors. It enables users to tailor the embedding process to the specific needs of their data and downstream applications, ensuring that the generated vectors effectively capture semantic relationships and contextual information within the dataset.
2
2
3
+
The core [Unstructured](https://github.com/Unstructured-IO/unstructured) open source library does not enable the generation of embeddings by default. (However you can generate embeddings as a separate step manually.
You can configure the Unstructured CLI and Unstructured Ingest Python library to generate embeddings by specifying the
7
+
`embedding_provider`, `embedding_api_key`, and `embedding_model_name` options (and, for Amazon Bedrock, additional options), as follows.
8
+
You must provide your own API key for the specified embedding provider. To get this API key, you must first create an account with that
9
+
provider and set up billing directly with them. You are responsible for all costs associated with using that provider.
10
+
11
+
Calls to the Unstructured CLI or Unstructured Ingest Python library that are routed to Unstructured's software-as-a-service (SaaS)
12
+
for processing (for example, by specifying an Unstructured API key and an Unstructured SaaS URL) require an Unstructured account for billing purposes. Unstructured's costs for generating
13
+
embeddings are already included in its account pricing plans. To
14
+
generate embeddings, you must specify the `embedding_provider` and `embedding_model_name` configuration options as follows. Unstructured uses its own internal API key
15
+
when using the specified provider to generate the embeddings. These
16
+
`embedding_provider` and `embedding_model_name` options are limited only to the provider and model names that are supported by the Unstructured API.
17
+
[See the list of supported provider and model names](/api-reference/workflow/workflows#embedder-node). [Learn how to specify these options](/open-source/how-to/embedding).
18
+
3
19
## Configs
4
20
5
21
*`embedding_provider`: The embedding provider to use while doing embedding. Available values include `bedrock`, `azure-openai`, `huggingface`, `mixedbread-ai`, `octoai`, `openai`, `togetherai`, `vertexai`, and `voyageai`.
0 commit comments