Skip to content

Commit 4ff6a5b

Browse files
Roman/bugfix support bedrock embeddings (#2650)
### Description This PR resolved the following open issue: [bug/bedrock-encoder-not-supported-in-ingest](https://github.com/Unstructured-IO/unstructured/issues/2319). To do so, the following changes were made: * All aws configs were added as input parameters to the CLI * These were mapped to the bedrock embedder when an embedder is generated via `get_embedder` * An ingest test was added to call the aws bedrock service * Requirements for boto were bumped because the first version to introduce the bedrock runtime, which is required to hit the bedrock service, was introduced in version `1.34.63`, which was ahead of the version of boto pinned. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: rbiseck3 <[email protected]>
1 parent 9177aa2 commit 4ff6a5b

File tree

11 files changed

+20363
-55
lines changed

11 files changed

+20363
-55
lines changed

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ repos:
1414
- id: mixed-line-ending
1515

1616
- repo: https://github.com/psf/black
17-
rev: 22.10.0
17+
rev: 24.2.0
1818
hooks:
1919
- id: black
2020
args: ["--line-length=100"]
@@ -28,7 +28,7 @@ repos:
2828
["--fix"]
2929

3030
- repo: https://github.com/pycqa/flake8
31-
rev: 4.0.1
31+
rev: 7.0.0
3232
hooks:
3333
- id: flake8
3434
language_version: python3

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
* **Clarify IAM Role Requirement for GCS Platform Connectors**. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
1616
* **Fix OneDrive dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
1717
* **Adds tracking for AstraDB** Adds tracking info so AstraDB can see what source called their api.
18+
* **Support AWS Bedrock Embeddings in ingest CLI** The configs required to instantiate the bedrock embedding class are now exposed in the api and the version of boto being used meets the minimum requirement to introduce the bedrock runtime required to hit the service.
19+
>>>>>>> 6a63c941c (bump changelog)
1820
1921
## 0.12.6
2022

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -383,7 +383,7 @@ check-shfmt:
383383

384384
.PHONY: check-black
385385
check-black:
386-
black . --check
386+
black . --check --line-length=100
387387

388388
.PHONY: check-flake8
389389
check-flake8:
@@ -429,7 +429,7 @@ tidy-shell:
429429
tidy-python:
430430
ruff . --fix-only || true
431431
autoflake --in-place .
432-
black .
432+
black --line-length=100 .
433433

434434
## version-sync: update __version__.py with most recent version from CHANGELOG.md
435435
.PHONY: version-sync

requirements/constraints.in

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,6 @@
33
# extras. Putting a dependency here will only affect dependency sets that contain them -- in other
44
# words, if something does not require a constraint, it will not be installed.
55
####################################################################################################
6-
# NOTE(alan): Pinning to avoid conflicts with downstream ingest-s3
7-
urllib3<1.27, >=1.25.4
8-
boto3<1.28.18
9-
botocore<1.31.18
106
# consistency with local-inference-pin
117
protobuf<4.24
128
# NOTE(robinson) - Required pins for security scans

requirements/ingest/embed-aws-bedrock.txt

Lines changed: 30 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -2,41 +2,38 @@
22
# This file is autogenerated by pip-compile with Python 3.9
33
# by the following command:
44
#
5-
# pip-compile --output-file=ingest/embed-aws-bedrock.txt ingest/embed-aws-bedrock.in
5+
# pip-compile embed-aws-bedrock.in
66
#
77
aiohttp==3.9.3
88
# via langchain-community
99
aiosignal==1.3.1
1010
# via aiohttp
1111
anyio==3.7.1
1212
# via
13-
# -c ingest/../constraints.in
13+
# -c ../constraints.in
1414
# langchain-core
1515
async-timeout==4.0.3
1616
# via aiohttp
1717
attrs==23.2.0
1818
# via aiohttp
19-
boto3==1.28.17
19+
boto3==1.34.63
20+
# via -r embed-aws-bedrock.in
21+
botocore==1.34.63
2022
# via
21-
# -c ingest/../constraints.in
22-
# -r ingest/embed-aws-bedrock.in
23-
botocore==1.31.17
24-
# via
25-
# -c ingest/../constraints.in
2623
# boto3
2724
# s3transfer
2825
certifi==2024.2.2
2926
# via
30-
# -c ingest/../base.txt
31-
# -c ingest/../constraints.in
27+
# -c ../base.txt
28+
# -c ../constraints.in
3229
# requests
3330
charset-normalizer==3.3.2
3431
# via
35-
# -c ingest/../base.txt
32+
# -c ../base.txt
3633
# requests
3734
dataclasses-json==0.6.4
3835
# via
39-
# -c ingest/../base.txt
36+
# -c ../base.txt
4037
# langchain-community
4138
exceptiongroup==1.2.0
4239
# via anyio
@@ -46,7 +43,7 @@ frozenlist==1.4.1
4643
# aiosignal
4744
idna==3.6
4845
# via
49-
# -c ingest/../base.txt
46+
# -c ../base.txt
5047
# anyio
5148
# requests
5249
# yarl
@@ -58,82 +55,83 @@ jsonpatch==1.33
5855
# via langchain-core
5956
jsonpointer==2.4
6057
# via jsonpatch
61-
langchain-community==0.0.20
62-
# via -r ingest/embed-aws-bedrock.in
63-
langchain-core==0.1.23
58+
langchain-community==0.0.28
59+
# via -r embed-aws-bedrock.in
60+
langchain-core==0.1.32
6461
# via langchain-community
65-
langsmith==0.0.87
62+
langsmith==0.1.26
6663
# via
6764
# langchain-community
6865
# langchain-core
6966
marshmallow==3.20.2
7067
# via
71-
# -c ingest/../base.txt
68+
# -c ../base.txt
7269
# dataclasses-json
7370
multidict==6.0.5
7471
# via
7572
# aiohttp
7673
# yarl
7774
mypy-extensions==1.0.0
7875
# via
79-
# -c ingest/../base.txt
76+
# -c ../base.txt
8077
# typing-inspect
8178
numpy==1.26.4
8279
# via
83-
# -c ingest/../base.txt
80+
# -c ../base.txt
8481
# langchain-community
82+
orjson==3.9.15
83+
# via langsmith
8584
packaging==23.2
8685
# via
87-
# -c ingest/../base.txt
86+
# -c ../base.txt
8887
# langchain-core
8988
# marshmallow
9089
pydantic==1.10.14
9190
# via
92-
# -c ingest/../constraints.in
91+
# -c ../constraints.in
9392
# langchain-core
9493
# langsmith
9594
python-dateutil==2.8.2
9695
# via
97-
# -c ingest/../base.txt
96+
# -c ../base.txt
9897
# botocore
9998
pyyaml==6.0.1
10099
# via
101100
# langchain-community
102101
# langchain-core
103102
requests==2.31.0
104103
# via
105-
# -c ingest/../base.txt
104+
# -c ../base.txt
106105
# langchain-community
107106
# langchain-core
108107
# langsmith
109-
s3transfer==0.6.2
108+
s3transfer==0.10.1
110109
# via boto3
111110
six==1.16.0
112111
# via
113-
# -c ingest/../base.txt
112+
# -c ../base.txt
114113
# python-dateutil
115-
sniffio==1.3.0
114+
sniffio==1.3.1
116115
# via anyio
117-
sqlalchemy==2.0.27
116+
sqlalchemy==2.0.28
118117
# via langchain-community
119118
tenacity==8.2.3
120119
# via
121120
# langchain-community
122121
# langchain-core
123122
typing-extensions==4.9.0
124123
# via
125-
# -c ingest/../base.txt
124+
# -c ../base.txt
126125
# pydantic
127126
# sqlalchemy
128127
# typing-inspect
129128
typing-inspect==0.9.0
130129
# via
131-
# -c ingest/../base.txt
130+
# -c ../base.txt
132131
# dataclasses-json
133132
urllib3==1.26.18
134133
# via
135-
# -c ingest/../base.txt
136-
# -c ingest/../constraints.in
134+
# -c ../base.txt
137135
# botocore
138136
# requests
139137
yarl==1.9.4

requirements/ingest/s3.txt

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
# This file is autogenerated by pip-compile with Python 3.9
33
# by the following command:
44
#
5-
# pip-compile --output-file=ingest/s3.txt ingest/s3.in
5+
# pip-compile s3.in
66
#
7-
aiobotocore==2.7.0
7+
aiobotocore==2.12.1
88
# via s3fs
99
aiohttp==3.9.3
1010
# via
@@ -18,21 +18,19 @@ async-timeout==4.0.3
1818
# via aiohttp
1919
attrs==23.2.0
2020
# via aiohttp
21-
botocore==1.31.17
22-
# via
23-
# -c ingest/../constraints.in
24-
# aiobotocore
21+
botocore==1.34.51
22+
# via aiobotocore
2523
frozenlist==1.4.1
2624
# via
2725
# aiohttp
2826
# aiosignal
2927
fsspec==2024.2.0
3028
# via
31-
# -r ingest/s3.in
29+
# -r s3.in
3230
# s3fs
3331
idna==3.6
3432
# via
35-
# -c ingest/../base.txt
33+
# -c ../base.txt
3634
# yarl
3735
jmespath==1.0.1
3836
# via botocore
@@ -42,26 +40,25 @@ multidict==6.0.5
4240
# yarl
4341
python-dateutil==2.8.2
4442
# via
45-
# -c ingest/../base.txt
43+
# -c ../base.txt
4644
# botocore
4745
s3fs==2024.2.0
48-
# via -r ingest/s3.in
46+
# via -r s3.in
4947
six==1.16.0
5048
# via
51-
# -c ingest/../base.txt
49+
# -c ../base.txt
5250
# python-dateutil
5351
typing-extensions==4.9.0
5452
# via
55-
# -c ingest/../base.txt
53+
# -c ../base.txt
5654
# aioitertools
5755
urllib3==1.26.18
5856
# via
59-
# -c ingest/../base.txt
60-
# -c ingest/../constraints.in
57+
# -c ../base.txt
6158
# botocore
6259
wrapt==1.16.0
6360
# via
64-
# -c ingest/../base.txt
61+
# -c ../base.txt
6562
# aiobotocore
6663
yarl==1.9.4
6764
# via aiohttp

0 commit comments

Comments
 (0)