Skip to content

Commit 54a6db1

Browse files
authored
feat: Add Wikipedia ingest connector (#299)
The connector can process a Wikipedia page and output the HTML, the plain text contents, and the summary. No API key required Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
1 parent a74d389 commit 54a6db1

File tree

12 files changed

+414
-6
lines changed

12 files changed

+414
-6
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ jobs:
109109
make check-coverage
110110
make install-ingest-s3
111111
make install-ingest-github
112+
make install-ingest-wikipedia
112113
./test_unstructured_ingest/test-ingest.sh
113114
114115
changelog:

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
## 0.4.17-dev0
1+
## 0.4.17-dev1
22

33
### Enhancements
44

55
### Features
66

7+
* Added Wikipedia connector for ingest cli.
8+
79
### Fixes
810

911
* Fix `process_document` file cleaning on failure

Makefile

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ install-ingest-github:
6262
install-ingest-reddit:
6363
pip install -r requirements/ingest-reddit.txt
6464

65+
.PHONY: install-ingest-wikipedia
66+
install-ingest-wikipedia:
67+
pip install -r requirements/ingest-wikipedia.txt
68+
6569
.PHONY: install-unstructured-inference
6670
install-unstructured-inference:
6771
pip install -r requirements/local-inference.txt
@@ -90,9 +94,10 @@ pip-compile:
9094
# NOTE(robinson) - doc/requirements.txt is where the GitHub action for building
9195
# sphinx docs looks for additional requirements
9296
cp requirements/build.txt docs/requirements.txt
93-
pip-compile --upgrade --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
94-
pip-compile --upgrade --extra=reddit --output-file=requirements/ingest-reddit.txt requirements/base.txt setup.py
95-
pip-compile --upgrade --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
97+
pip-compile --upgrade --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
98+
pip-compile --upgrade --extra=reddit --output-file=requirements/ingest-reddit.txt requirements/base.txt setup.py
99+
pip-compile --upgrade --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
100+
pip-compile --upgrade --extra=wikipedia --output-file=requirements/ingest-wikipedia.txt requirements/base.txt setup.py
96101

97102
## install-project-local: install unstructured into your local python environment
98103
.PHONY: install-project-local
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/usr/bin/env bash
2+
3+
# Processes the Unstructured-IO/unstructured repository
4+
# through Unstructured's library in 2 processes.
5+
6+
# Structured outputs are stored in wikipedia-ingest-output/
7+
8+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
9+
cd "$SCRIPT_DIR"/../../.. || exit 1
10+
11+
PYTHONPATH=. ./unstructured/ingest/main.py \
12+
--wikipedia-page-title "Open Source Software" \
13+
--structured-output-dir wikipedia-ingest-output \
14+
--num-processes 2 \
15+
--verbose
16+
17+
# Alternatively, you can call it using:
18+
# unstructured-ingest --wikipedia-page-title "..." ...

requirements/ingest-wikipedia.txt

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.8
3+
# by the following command:
4+
#
5+
# pip-compile --extra=wikipedia --output-file=requirements/ingest-wikipedia.txt requirements/base.txt setup.py
6+
#
7+
anyio==3.6.2
8+
# via
9+
# -r requirements/base.txt
10+
# httpcore
11+
argilla==1.3.0
12+
# via
13+
# -r requirements/base.txt
14+
# unstructured (setup.py)
15+
backoff==2.2.1
16+
# via
17+
# -r requirements/base.txt
18+
# argilla
19+
beautifulsoup4==4.11.2
20+
# via wikipedia
21+
certifi==2022.12.7
22+
# via
23+
# -r requirements/base.txt
24+
# httpcore
25+
# httpx
26+
# requests
27+
# unstructured (setup.py)
28+
charset-normalizer==3.0.1
29+
# via
30+
# -r requirements/base.txt
31+
# requests
32+
click==8.1.3
33+
# via
34+
# -r requirements/base.txt
35+
# nltk
36+
colorama==0.4.6
37+
# via
38+
# click
39+
# tqdm
40+
deprecated==1.2.13
41+
# via
42+
# -r requirements/base.txt
43+
# argilla
44+
et-xmlfile==1.1.0
45+
# via
46+
# -r requirements/base.txt
47+
# openpyxl
48+
h11==0.14.0
49+
# via
50+
# -r requirements/base.txt
51+
# httpcore
52+
httpcore==0.16.3
53+
# via
54+
# -r requirements/base.txt
55+
# httpx
56+
httpx==0.23.3
57+
# via
58+
# -r requirements/base.txt
59+
# argilla
60+
idna==3.4
61+
# via
62+
# -r requirements/base.txt
63+
# anyio
64+
# requests
65+
# rfc3986
66+
joblib==1.2.0
67+
# via
68+
# -r requirements/base.txt
69+
# nltk
70+
lxml==4.9.2
71+
# via
72+
# -r requirements/base.txt
73+
# python-docx
74+
# python-pptx
75+
# unstructured (setup.py)
76+
monotonic==1.6
77+
# via
78+
# -r requirements/base.txt
79+
# argilla
80+
nltk==3.8.1
81+
# via
82+
# -r requirements/base.txt
83+
# unstructured (setup.py)
84+
numpy==1.23.5
85+
# via
86+
# -r requirements/base.txt
87+
# argilla
88+
# pandas
89+
openpyxl==3.1.1
90+
# via
91+
# -r requirements/base.txt
92+
# unstructured (setup.py)
93+
packaging==23.0
94+
# via
95+
# -r requirements/base.txt
96+
# argilla
97+
pandas==1.5.3
98+
# via
99+
# -r requirements/base.txt
100+
# argilla
101+
# unstructured (setup.py)
102+
pillow==9.4.0
103+
# via
104+
# -r requirements/base.txt
105+
# python-pptx
106+
# unstructured (setup.py)
107+
pydantic==1.10.4
108+
# via
109+
# -r requirements/base.txt
110+
# argilla
111+
python-dateutil==2.8.2
112+
# via
113+
# -r requirements/base.txt
114+
# pandas
115+
python-docx==0.8.11
116+
# via
117+
# -r requirements/base.txt
118+
# unstructured (setup.py)
119+
python-magic==0.4.27
120+
# via
121+
# -r requirements/base.txt
122+
# unstructured (setup.py)
123+
python-pptx==0.6.21
124+
# via
125+
# -r requirements/base.txt
126+
# unstructured (setup.py)
127+
pytz==2022.7.1
128+
# via
129+
# -r requirements/base.txt
130+
# pandas
131+
regex==2022.10.31
132+
# via
133+
# -r requirements/base.txt
134+
# nltk
135+
requests==2.28.2
136+
# via
137+
# -r requirements/base.txt
138+
# unstructured (setup.py)
139+
# wikipedia
140+
rfc3986[idna2008]==1.5.0
141+
# via
142+
# -r requirements/base.txt
143+
# httpx
144+
six==1.16.0
145+
# via
146+
# -r requirements/base.txt
147+
# python-dateutil
148+
sniffio==1.3.0
149+
# via
150+
# -r requirements/base.txt
151+
# anyio
152+
# httpcore
153+
# httpx
154+
soupsieve==2.4
155+
# via beautifulsoup4
156+
tqdm==4.64.1
157+
# via
158+
# -r requirements/base.txt
159+
# argilla
160+
# nltk
161+
typing-extensions==4.4.0
162+
# via
163+
# -r requirements/base.txt
164+
# pydantic
165+
urllib3==1.26.14
166+
# via
167+
# -r requirements/base.txt
168+
# requests
169+
wikipedia==1.4.0
170+
# via unstructured (setup.py)
171+
wrapt==1.14.1
172+
# via
173+
# -r requirements/base.txt
174+
# argilla
175+
# deprecated
176+
xlsxwriter==3.0.8
177+
# via
178+
# -r requirements/base.txt
179+
# python-pptx

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@
8484
"pygithub==1.57.0",
8585
],
8686
"reddit": ["praw"],
87+
"wikipedia": ["wikipedia"],
8788
},
8889
package_dir={"unstructured": "unstructured"},
8990
package_data={"unstructured": ["nlp/*.txt"]},
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/usr/bin/env bash
2+
3+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
4+
cd "$SCRIPT_DIR"/.. || exit 1
5+
6+
PYTHONPATH=. ./unstructured/ingest/main.py \
7+
--wikipedia-page-title "Open Source Software" \
8+
--structured-output-dir wikipedia-ingest-output \
9+
--num-processes 2 \
10+
--verbose
11+
12+
if [ "$(find 'wikipedia-ingest-output' -type f -printf '.' | wc -c)" != 3 ]; then
13+
echo
14+
echo "3 files should have been created."
15+
exit 1
16+
fi

test_unstructured_ingest/test-ingest.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ cd "$SCRIPT_DIR"/.. || exit 1
77

88
./test_unstructured_ingest/test-ingest-s3.sh
99
./test_unstructured_ingest/test-ingest-github.sh
10+
./test_unstructured_ingest/test-ingest-wikipedia.sh

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.4.17-dev0" # pragma: no cover
1+
__version__ = "0.4.17-dev1" # pragma: no cover

unstructured/documents/xml.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,6 @@ def from_string(cls, text: str, parser: VALID_PARSERS = None, stylesheet: Option
8484

8585
@classmethod
8686
def from_file(cls, filename, parser: VALID_PARSERS = None, stylesheet: Optional[str] = None):
87-
with open(filename, "r+") as f:
87+
with open(filename, "r+", encoding="utf8") as f:
8888
content = f.read()
8989
return cls.from_string(content, parser=parser, stylesheet=stylesheet)

0 commit comments

Comments
 (0)