Skip to content

Commit 9090435

Browse files
authored
feat/remove unstructured as a required dependency to run ingest (#23)
* Refactor to support unstructured as an optional (extra) dependency * Still generate base.txt for CI * Downgrade version of unstructured back to 0.15.1 * Fix example script paths * Make python version a string * Add explicit dependency for click when using CLI * Add cli dep to unit tests * Add dataclasses_json to base deps * Remove unstructured as a required dependency from v1 * Add additional dependency * Update dependencies * Fix setup.py file * Isolate backoff dep * Make requests a connector-specific dep * Make httpx a connector-specific dep * Make httpx a connector-specific dep * Add tqdm to base deps * Fix assign_and_map_hash_ids logic * tidy * bugfix assign_and_map_hash_ids * Add numpy constraint <2 * Create DataSourceMetadata extension to pass for data_source_metadata * Apply assign_and_map_hash_ids after chunking * Add click to base deps * Fix assign_and_map_hash_ids and add unit test arround it * Remove reference to olf install-cli make target * remove reference to old cli extra * Add some more requires_dependencies annoations * Fix embedder bug * unit test both chunking strategies
1 parent 955b7de commit 9090435

File tree

194 files changed

+4668
-3016
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

194 files changed

+4668
-3016
lines changed

.github/workflows/e2e.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,30 @@ jobs:
107107
docker compose version
108108
./test_e2e/test-src.sh
109109
110+
test_src_api:
111+
runs-on: ubuntu-latest-m
112+
needs: [ setup ]
113+
steps:
114+
# actions/checkout MUST come before auth
115+
- uses: 'actions/checkout@v4'
116+
- name: Set up Python 3.10
117+
uses: actions/setup-python@v5
118+
with:
119+
python-version: "3.10"
120+
- name: Get full Python version
121+
id: full-python-version
122+
run: echo version=$(python -c "import sys; print('-'.join(str(v) for v in sys.version_info))") >> $GITHUB_OUTPUT
123+
- name: Install limited dependencies
124+
run: |
125+
make install-client
126+
make install-base
127+
- name: Run test against remote API
128+
env:
129+
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
130+
run: |
131+
./test_e2e/src/against-api.sh
132+
133+
110134

111135
test_dest:
112136
environment: ci

.github/workflows/unit_tests.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,9 +94,11 @@ jobs:
9494
uses: actions/setup-python@v5
9595
with:
9696
python-version: ${{ matrix.python-version }}
97-
- name: Validate --help
97+
- name: Install local deps
9898
run: |
9999
pip install .
100+
- name: Validate --help
101+
run: |
100102
unstructured-ingest --help
101103
102104
test_ingest_unit:

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,3 +205,5 @@ metricsdiff.txt
205205

206206
# analysis
207207
annotated/
208+
209+
/tmp_ingest

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
1-
## 0.0.4-dev2
1+
## 0.0.4-dev3
22

33
### Enhancements
44

55
* **Add Couchbase Destination Connector** Adds support for storing artifacts in Couchbase DB for Vector Search
66
* **Leverage pydantic base models** All user-supplied configs are now derived from pydantic base models to leverage better type checking and add built in support for sensitive fields.
77
* **Autogenerate click options from base models** Leverage th pydantic base models for all configs to autogenerate teh cli options exposed when running ingest as a CLI.
8+
* **Drop required Unstructured dependency** Unstructured was moved to an extra dependency to only be imported when needed for functionality such as local partitioning/chunking.
89

910
## 0.0.3
1011

Makefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@ pip-compile:
1414
install-lint:
1515
pip install -r requirements/lint.txt
1616

17+
18+
.PHONY: install-client
19+
install-client:
20+
pip install -r requirements/remote/client.txt
21+
1722
.PHONY: install-test
1823
install-test:
1924
pip install -r requirements/test.txt

requirements/common/base.in

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
-c constraints.txt
22

33
python-dateutil
4-
unstructured
54
pandas
65
pydantic
6+
dataclasses_json
7+
tqdm
8+
click

requirements/common/base.txt

Lines changed: 13 additions & 131 deletions
Original file line numberDiff line numberDiff line change
@@ -2,166 +2,48 @@
22
# This file is autogenerated by pip-compile with Python 3.9
33
# by the following command:
44
#
5-
# pip-compile ./base.in
5+
# pip-compile ./requirements/common/base.in
66
#
77
annotated-types==0.7.0
88
# via pydantic
9-
anyio==3.7.1
10-
# via
11-
# -c ./constraints.txt
12-
# httpx
13-
backoff==2.2.1
14-
# via unstructured
15-
beautifulsoup4==4.12.3
16-
# via unstructured
17-
certifi==2024.7.4
18-
# via
19-
# -c ./constraints.txt
20-
# httpcore
21-
# httpx
22-
# requests
23-
# unstructured-client
24-
chardet==5.2.0
25-
# via unstructured
26-
charset-normalizer==3.3.2
27-
# via
28-
# requests
29-
# unstructured-client
309
click==8.1.7
31-
# via nltk
10+
# via -r ./requirements/common/base.in
3211
dataclasses-json==0.6.7
33-
# via
34-
# unstructured
35-
# unstructured-client
36-
deepdiff==7.0.1
37-
# via unstructured-client
38-
emoji==2.12.1
39-
# via unstructured
40-
exceptiongroup==1.2.2
41-
# via anyio
42-
filetype==1.2.0
43-
# via unstructured
44-
h11==0.14.0
45-
# via httpcore
46-
httpcore==1.0.5
47-
# via httpx
48-
httpx==0.27.0
49-
# via unstructured-client
50-
idna==3.7
51-
# via
52-
# anyio
53-
# httpx
54-
# requests
55-
# unstructured-client
56-
joblib==1.4.2
57-
# via nltk
58-
jsonpath-python==1.0.6
59-
# via unstructured-client
60-
langdetect==1.0.9
61-
# via unstructured
62-
lxml==5.3.0
63-
# via unstructured
12+
# via -r ./requirements/common/base.in
6413
marshmallow==3.21.3
65-
# via
66-
# dataclasses-json
67-
# unstructured-client
14+
# via dataclasses-json
6815
mypy-extensions==1.0.0
69-
# via
70-
# typing-inspect
71-
# unstructured-client
72-
nest-asyncio==1.6.0
73-
# via unstructured-client
74-
nltk==3.8.1
75-
# via
76-
# -c ./constraints.txt
77-
# unstructured
16+
# via typing-inspect
7817
numpy==1.26.4
7918
# via
19+
# -c ./requirements/common/constraints.txt
8020
# pandas
81-
# unstructured
82-
ordered-set==4.1.0
83-
# via deepdiff
8421
packaging==23.2
8522
# via
86-
# -c ./constraints.txt
23+
# -c ./requirements/common/constraints.txt
8724
# marshmallow
88-
# unstructured-client
8925
pandas==2.2.2
90-
# via -r ./base.in
91-
psutil==6.0.0
92-
# via unstructured
26+
# via -r ./requirements/common/base.in
9327
pydantic==2.8.2
94-
# via -r ./base.in
28+
# via -r ./requirements/common/base.in
9529
pydantic-core==2.20.1
9630
# via pydantic
97-
pypdf==4.3.1
98-
# via unstructured-client
9931
python-dateutil==2.9.0.post0
10032
# via
101-
# -r ./base.in
33+
# -r ./requirements/common/base.in
10234
# pandas
103-
# unstructured-client
104-
python-iso639==2024.4.27
105-
# via unstructured
106-
python-magic==0.4.27
107-
# via unstructured
10835
pytz==2024.1
10936
# via pandas
110-
rapidfuzz==3.9.6
111-
# via unstructured
112-
regex==2024.7.24
113-
# via nltk
114-
requests==2.32.3
115-
# via
116-
# requests-toolbelt
117-
# unstructured
118-
# unstructured-client
119-
requests-toolbelt==1.0.0
120-
# via unstructured-client
12137
six==1.16.0
122-
# via
123-
# langdetect
124-
# python-dateutil
125-
# unstructured-client
126-
sniffio==1.3.1
127-
# via
128-
# anyio
129-
# httpx
130-
soupsieve==2.5
131-
# via beautifulsoup4
132-
tabulate==0.9.0
133-
# via unstructured
38+
# via python-dateutil
13439
tqdm==4.66.5
135-
# via
136-
# nltk
137-
# unstructured
40+
# via -r ./requirements/common/base.in
13841
typing-extensions==4.12.2
13942
# via
140-
# emoji
14143
# pydantic
14244
# pydantic-core
143-
# pypdf
14445
# typing-inspect
145-
# unstructured
146-
# unstructured-client
14746
typing-inspect==0.9.0
148-
# via
149-
# dataclasses-json
150-
# unstructured-client
47+
# via dataclasses-json
15148
tzdata==2024.1
15249
# via pandas
153-
unstructured==0.15.1
154-
# via -r ./base.in
155-
unstructured-client==0.25.4
156-
# via
157-
# -c ./constraints.txt
158-
# unstructured
159-
urllib3==1.26.19
160-
# via
161-
# -c ./constraints.txt
162-
# requests
163-
# unstructured-client
164-
wrapt==1.16.0
165-
# via
166-
# -c ./constraints.txt
167-
# unstructured

requirements/common/constraints.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,6 @@ langchain-community>=0.2.5
6363
importlib-metadata==7.1.0
6464

6565
nltk==3.8.1
66+
67+
unstructured==0.15.1
68+
numpy<2
Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
-c ../common/constraints.txt
2-
-c ../common/base.txt
1+
-r ../common/base.in
32

43
pyairtable

requirements/connectors/airtable.txt

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,50 +2,70 @@
22
# This file is autogenerated by pip-compile with Python 3.9
33
# by the following command:
44
#
5-
# pip-compile ./connectors/airtable.in
5+
# pip-compile ./requirements/connectors/airtable.in
66
#
77
annotated-types==0.7.0
8-
# via
9-
# -c ./connectors/../common/base.txt
10-
# pydantic
8+
# via pydantic
119
certifi==2024.7.4
1210
# via
13-
# -c ./connectors/../common/base.txt
14-
# -c ./connectors/../common/constraints.txt
11+
# -c ./requirements/connectors/../common/constraints.txt
1512
# requests
1613
charset-normalizer==3.3.2
17-
# via
18-
# -c ./connectors/../common/base.txt
19-
# requests
14+
# via requests
15+
click==8.1.7
16+
# via -r ./requirements/connectors/../common/base.in
17+
dataclasses-json==0.6.7
18+
# via -r ./requirements/connectors/../common/base.in
2019
idna==3.7
21-
# via
22-
# -c ./connectors/../common/base.txt
23-
# requests
20+
# via requests
2421
inflection==0.5.1
2522
# via pyairtable
23+
marshmallow==3.21.3
24+
# via dataclasses-json
25+
mypy-extensions==1.0.0
26+
# via typing-inspect
27+
numpy==1.26.4
28+
# via
29+
# -c ./requirements/connectors/../common/constraints.txt
30+
# pandas
31+
packaging==23.2
32+
# via
33+
# -c ./requirements/connectors/../common/constraints.txt
34+
# marshmallow
35+
pandas==2.2.2
36+
# via -r ./requirements/connectors/../common/base.in
2637
pyairtable==2.3.3
27-
# via -r ./connectors/airtable.in
38+
# via -r ./requirements/connectors/airtable.in
2839
pydantic==2.8.2
2940
# via
30-
# -c ./connectors/../common/base.txt
41+
# -r ./requirements/connectors/../common/base.in
3142
# pyairtable
3243
pydantic-core==2.20.1
44+
# via pydantic
45+
python-dateutil==2.9.0.post0
3346
# via
34-
# -c ./connectors/../common/base.txt
35-
# pydantic
47+
# -r ./requirements/connectors/../common/base.in
48+
# pandas
49+
pytz==2024.1
50+
# via pandas
3651
requests==2.32.3
37-
# via
38-
# -c ./connectors/../common/base.txt
39-
# pyairtable
52+
# via pyairtable
53+
six==1.16.0
54+
# via python-dateutil
55+
tqdm==4.66.5
56+
# via -r ./requirements/connectors/../common/base.in
4057
typing-extensions==4.12.2
4158
# via
42-
# -c ./connectors/../common/base.txt
4359
# pyairtable
4460
# pydantic
4561
# pydantic-core
62+
# typing-inspect
63+
typing-inspect==0.9.0
64+
# via dataclasses-json
65+
tzdata==2024.1
66+
# via pandas
4667
urllib3==1.26.19
4768
# via
48-
# -c ./connectors/../common/base.txt
49-
# -c ./connectors/../common/constraints.txt
69+
# -c ./requirements/connectors/../common/constraints.txt
5070
# pyairtable
5171
# requests

0 commit comments

Comments
 (0)