Skip to content

Commit d467922

Browse files
feat: add vertexai embeddings (#2693)
This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <[email protected]> Co-authored-by: Matt Robinson <[email protected]>
1 parent 887e6c9 commit d467922

File tree

20 files changed

+24484
-4
lines changed

20 files changed

+24484
-4
lines changed

CHANGELOG.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
## 0.13.0-dev13
1+
## 0.13.0-dev14
22

3-
### Enhancements
3+
### Enhancements
44

55
* **Add `.metadata.is_continuation` to text-split chunks.** `.metadata.is_continuation=True` is added to second-and-later chunks formed by text-splitting an oversized `Table` element but not to their counterpart `Text` element splits. Add this indicator for `CompositeElement` to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.
66
* **Add `compound_structure_acc` metric to table eval.** Add a new property to `unstructured.metrics.table_eval.TableEvaluation`: `composite_structure_acc`, which is computed from the element level row and column index and content accuracy scores
77
* **Add `.metadata.orig_elements` to chunks.** `.metadata.orig_elements: list[Element]` is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like `page_number`, `coordinates`, and `image_base64`.
88
* **Add `--include_orig_elements` option to Ingest CLI.** By default, when chunking, the original elements used to form each chunk are added to `chunk.metadata.orig_elements` for each chunk. * The `include_orig_elements` parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata.
9+
* **Add Google VertexAI embedder** Adds VertexAI embeddings to support embedding via Google Vertex AI.
910

1011
### Features
1112

docs/source/core/embedding.rst

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,59 @@ To obtain an api key, visit: https://octo.ai/docs/getting-started/how-to-create-
171171
query = "This is the query"
172172
query_embedding = embedding_encoder.embed_query(query=query)
173173
174+
[print(e.embeddings, e) for e in elements]
175+
print(query_embedding, query)
176+
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
177+
178+
``VertexAIEmbeddingEncoder``
179+
--------------------------
180+
181+
The ``VertexAIEmbeddingEncoder`` class connects to the GCP VertexAI to obtain embeddings for pieces of text.
182+
183+
``embed_documents`` will receive a list of Elements, and return an updated list which
184+
includes the ``embeddings`` attribute for each Element.
185+
186+
``embed_query`` will receive a query as a string, and return a list of floats which is the
187+
embedding vector for the given query string.
188+
189+
``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
190+
embedding vector obtained via this class.
191+
192+
``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
193+
this class are unit vectors.
194+
195+
The following code block shows an example of how to use ``VertexAIEmbeddingEncoder``. You will
196+
see the updated elements list (with the ``embeddings`` attribute included for each element),
197+
the embedding vector for the query string, and some metadata properties about the embedding model.
198+
199+
To use Vertex AI PaLM tou will need to:
200+
- either, pass the full json content of your GCP VertexAI application credentials to the
201+
VertexAIEmbeddingConfig as the api_key parameter. (This will create a file in the ``/tmp``
202+
directory with the content of the json, and set the GOOGLE_APPLICATION_CREDENTIALS environment
203+
variable to the **path** of the created file.)
204+
- or, you'll need to store the path to a manually created service account JSON file as the
205+
GOOGLE_APPLICATION_CREDENTIALS environment variable. (For more information:
206+
https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm)
207+
- or, you'll need to have the credentials configured for your environment (gcloud,
208+
workload identity, etc…)
209+
210+
.. code:: python
211+
212+
import os
213+
214+
from unstructured.documents.elements import Text
215+
from unstructured.embed.vertexai import VertexAIEmbeddingConfig, VertexAIEmbeddingEncoder
216+
217+
embedding_encoder = VertexAIEmbeddingEncoder(
218+
config=VertexAIEmbeddingConfig(api_key=os.environ["VERTEXAI_GCP_APP_CREDS_JSON_CONTENT"])
219+
)
220+
elements = embedding_encoder.embed_documents(
221+
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
222+
)
223+
224+
query = "This is the query"
225+
query_embedding = embedding_encoder.embed_query(query=query)
226+
174227
[print(e.embeddings, e) for e in elements]
175228
print(query_embedding, query)
176229
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

examples/embed/example_vertexai.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import os
2+
3+
from unstructured.documents.elements import Text
4+
from unstructured.embed.vertexai import VertexAIEmbeddingConfig, VertexAIEmbeddingEncoder
5+
6+
# To use Vertex AI PaLM tou will need to:
7+
# - either, pass the full json content of your GCP VertexAI application credentials to the
8+
# VertexAIEmbeddingConfig as the api_key parameter. (This will create a file in the ``/tmp``
9+
# directory with the content of the json, and set the GOOGLE_APPLICATION_CREDENTIALS environment
10+
# variable to the **path** of the created file.)
11+
# - or, you'll need to store the path to a manually created service account JSON file as the
12+
# GOOGLE_APPLICATION_CREDENTIALS environment variable. (For more information:
13+
# https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm)
14+
# - or, you'll need to have the credentials configured for your environment (gcloud,
15+
# workload identity, etc…)
16+
17+
embedding_encoder = VertexAIEmbeddingEncoder(
18+
config=VertexAIEmbeddingConfig(api_key=os.environ["VERTEXAI_GCP_APP_CREDS_JSON_CONTENT"])
19+
)
20+
21+
elements = embedding_encoder.embed_documents(
22+
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
23+
)
24+
25+
query = "This is the query"
26+
query_embedding = embedding_encoder.embed_query(query=query)
27+
28+
[print(e.embeddings, e) for e in elements]
29+
print(query_embedding, query)
30+
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
-c ../constraints.in
2+
-c ../base.txt
3+
openai
4+
tiktoken
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.9
3+
# by the following command:
4+
#
5+
# pip-compile --output-file=ingest/embed-octoai.txt ingest/embed-octoai.in
6+
#
7+
anyio==3.7.1
8+
# via
9+
# -c ingest/../constraints.in
10+
# httpx
11+
# openai
12+
certifi==2024.2.2
13+
# via
14+
# -c ingest/../base.txt
15+
# -c ingest/../constraints.in
16+
# httpcore
17+
# httpx
18+
# requests
19+
charset-normalizer==3.3.2
20+
# via
21+
# -c ingest/../base.txt
22+
# requests
23+
distro==1.9.0
24+
# via openai
25+
exceptiongroup==1.2.0
26+
# via anyio
27+
h11==0.14.0
28+
# via httpcore
29+
httpcore==1.0.4
30+
# via httpx
31+
httpx==0.27.0
32+
# via openai
33+
idna==3.6
34+
# via
35+
# -c ingest/../base.txt
36+
# anyio
37+
# httpx
38+
# requests
39+
openai==1.14.3
40+
# via -r ingest/embed-octoai.in
41+
pydantic==1.10.14
42+
# via
43+
# -c ingest/../constraints.in
44+
# openai
45+
regex==2023.12.25
46+
# via
47+
# -c ingest/../base.txt
48+
# tiktoken
49+
requests==2.31.0
50+
# via
51+
# -c ingest/../base.txt
52+
# tiktoken
53+
sniffio==1.3.1
54+
# via
55+
# anyio
56+
# httpx
57+
# openai
58+
tiktoken==0.6.0
59+
# via -r ingest/embed-octoai.in
60+
tqdm==4.66.2
61+
# via
62+
# -c ingest/../base.txt
63+
# openai
64+
typing-extensions==4.10.0
65+
# via
66+
# -c ingest/../base.txt
67+
# openai
68+
# pydantic
69+
urllib3==2.2.1
70+
# via
71+
# -c ingest/../base.txt
72+
# requests
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
-c ../constraints.in
2+
-c ../base.txt
3+
langchain
4+
langchain-community
5+
langchain-google-vertexai

0 commit comments

Comments
 (0)