Skip to content

Commit 6b400b4

Browse files
MthwRobinsonfzowlLiuhong99
authored
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <[email protected]> Co-authored-by: Liuhong99 <[email protected]>
1 parent 32df4ee commit 6b400b4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+20601
-56
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
## 0.14.3-dev4
1+
## 0.14.3-dev5
22

33
### Enhancements
44

55
* **Move `category` field from Text class to Element class.**
66
* **`partition_docx()` now supports pluggable picture sub-partitioners.** A subpartitioner that accepts a DOCX `Paragraph` and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.
7+
* **Add VoyageAI embedder** Adds VoyageAI embeddings to support embedding via Voyage AI.
78

89
### Features
910

examples/embed/example_voyageai.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
import os
2+
3+
from unstructured.documents.elements import Text
4+
from unstructured.embed.voyageai import VoyageAIEmbeddingConfig, VoyageAIEmbeddingEncoder
5+
6+
# To use Voyage AI you will need to pass
7+
# Voyage AI API Key (obtained from https://dash.voyageai.com/)
8+
# as the ``api_key`` parameter.
9+
#
10+
# The ``model_name`` parameter is mandatory, please check the available models
11+
# at https://docs.voyageai.com/docs/embeddings
12+
13+
embedding_encoder = VoyageAIEmbeddingEncoder(
14+
config=VoyageAIEmbeddingConfig(api_key=os.environ["VOYAGE_API_KEY"], model_name="voyage-law-2")
15+
)
16+
elements = embedding_encoder.embed_documents(
17+
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
18+
)
19+
20+
query = "This is the query"
21+
query_embedding = embedding_encoder.embed_query(query=query)
22+
23+
[print(e, e.embeddings) for e in elements]
24+
print(query, query_embedding)
25+
print(embedding_encoder.is_unit_vector, embedding_encoder.num_of_dimensions)

requirements/base.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ tabulate==0.9.0
8686
# via -r ./base.in
8787
tqdm==4.66.4
8888
# via nltk
89-
typing-extensions==4.11.0
89+
typing-extensions==4.12.0
9090
# via
9191
# -r ./base.in
9292
# emoji

requirements/deps/constraints.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,10 @@ unstructured-client<=0.18.0
5757

5858
fsspec==2024.5.0
5959

60-
# python 3.12 support
60+
# python 3.12 support
6161
numpy>=1.26.0
6262
wrapt>=1.14.0
6363

64+
65+
# NOTE(robinson): for compatiblity with voyage embeddings
66+
langsmith==0.1.62

requirements/dev.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ jsonschema-specifications==2023.12.1
151151
# jsonschema
152152
jupyter==1.0.0
153153
# via -r ./dev.in
154-
jupyter-client==8.6.1
154+
jupyter-client==8.6.2
155155
# via
156156
# ipykernel
157157
# jupyter-console
@@ -185,7 +185,7 @@ jupyter-server==2.14.0
185185
# notebook-shim
186186
jupyter-server-terminals==0.5.3
187187
# via jupyter-server
188-
jupyterlab==4.2.0
188+
jupyterlab==4.2.1
189189
# via notebook
190190
jupyterlab-pygments==0.3.0
191191
# via nbconvert
@@ -392,7 +392,7 @@ traitlets==5.14.3
392392
# qtconsole
393393
types-python-dateutil==2.9.0.20240316
394394
# via arrow
395-
typing-extensions==4.11.0
395+
typing-extensions==4.12.0
396396
# via
397397
# -c ./base.txt
398398
# -c ./test.txt

requirements/extra-docx.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ python-docx==1.1.2
1212
# via
1313
# -c ././deps/constraints.txt
1414
# -r ./extra-docx.in
15-
typing-extensions==4.11.0
15+
typing-extensions==4.12.0
1616
# via
1717
# -c ./base.txt
1818
# python-docx

requirements/extra-odt.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ python-docx==1.1.2
1414
# via
1515
# -c ././deps/constraints.txt
1616
# -r ./extra-odt.in
17-
typing-extensions==4.11.0
17+
typing-extensions==4.12.0
1818
# via
1919
# -c ./base.txt
2020
# python-docx

requirements/extra-paddleocr.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ attrdict==2.0.1
88
# via unstructured-paddleocr
99
babel==2.15.0
1010
# via flask-babel
11-
bce-python-sdk==0.9.10
11+
bce-python-sdk==0.9.11
1212
# via visualdl
1313
blinker==1.8.2
1414
# via flask
@@ -45,7 +45,7 @@ flask==3.0.3
4545
# visualdl
4646
flask-babel==4.0.0
4747
# via visualdl
48-
fonttools==4.51.0
48+
fonttools==4.52.1
4949
# via matplotlib
5050
future==1.0.0
5151
# via bce-python-sdk
@@ -200,7 +200,7 @@ six==1.16.0
200200
# imgaug
201201
# python-dateutil
202202
# visualdl
203-
tifffile==2024.5.10
203+
tifffile==2024.5.22
204204
# via scikit-image
205205
tqdm==4.66.4
206206
# via

requirements/extra-pdf-image.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ filelock==3.14.0
3939
# transformers
4040
flatbuffers==24.3.25
4141
# via onnxruntime
42-
fonttools==4.51.0
42+
fonttools==4.52.1
4343
# via matplotlib
4444
fsspec==2024.5.0
4545
# via
@@ -118,7 +118,7 @@ numpy==1.26.4
118118
# transformers
119119
omegaconf==2.3.0
120120
# via effdet
121-
onnx==1.16.0
121+
onnx==1.16.1
122122
# via
123123
# -r ./extra-pdf-image.in
124124
# unstructured-inference
@@ -278,7 +278,7 @@ tqdm==4.66.4
278278
# transformers
279279
transformers==4.41.1
280280
# via unstructured-inference
281-
typing-extensions==4.11.0
281+
typing-extensions==4.12.0
282282
# via
283283
# -c ./base.txt
284284
# huggingface-hub

requirements/huggingface.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ tqdm==4.66.4
102102
# transformers
103103
transformers==4.41.1
104104
# via -r ./huggingface.in
105-
typing-extensions==4.11.0
105+
typing-extensions==4.12.0
106106
# via
107107
# -c ./base.txt
108108
# huggingface-hub

0 commit comments

Comments
 (0)