Skip to content

Commit 830d67f

Browse files
Feat: Discord connector (#515)
* Initial commit of discord connector based off of initial work by @tnachen with modifications https://github.com/tnachen/unstructured/tree/tnachen/discord_connector * Add test file change format of imports * working version of the connector More work to be done to tidy it up and add any additional options * add to test fixtures update * fix spacing * tests working, switching to bot testing channel * add additional channel add reprocess to tests * add try clause to allow for exit on error Update changelog and bump version * add updated expected output filtes * add logic to check if —discord-period is an integer Add more to option description * fix lint error * Update discord reqs * PR feedback * add newline * another newline --------- Co-authored-by: Justin Bossert <[email protected]>
1 parent c62bee4 commit 830d67f

File tree

15 files changed

+580
-3
lines changed

15 files changed

+580
-3
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
## 0.6.7-dev3
1+
## 0.6.7-dev4
22

33
### Enhancements
44

55
* Add `file_directory` to metadata
66
* Added a `--partition-strategy` parameter to unstructured-ingest so that users can specify
77
partition strategy in CLI. For example, `--partition-strategy fast`.
88
* Added metadata for filetype.
9+
* Add Discord connector to pull messages from a list of channels
910

1011
### Features
1112

@@ -87,6 +88,7 @@
8788
* Added logic to `partition_pdf` for detecting copy protected PDFs and falling back
8889
to the hi res strategy when necessary.
8990

91+
9092
### Features
9193

9294
* Add `partition_via_api` for partitioning documents through the hosted API.

Makefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,10 @@ install-ingest-s3:
6363
install-ingest-azure:
6464
python3 -m pip install -r requirements/ingest-azure.txt
6565

66+
.PHONY: install-ingest-discord
67+
install-ingest-discord:
68+
pip install -r requirements/ingest-discord.txt
69+
6670
.PHONY: install-ingest-github
6771
install-ingest-github:
6872
python3 -m pip install -r requirements/ingest-github.txt
@@ -119,6 +123,7 @@ pip-compile:
119123
cp requirements/build.txt docs/requirements.txt
120124
pip-compile --upgrade --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
121125
pip-compile --upgrade --extra=azure --output-file=requirements/ingest-azure.txt requirements/base.txt setup.py
126+
pip-compile --upgrade --extra=discord --output-file=requirements/ingest-azure.txt requirements/base.txt setup.py
122127
pip-compile --upgrade --extra=reddit --output-file=requirements/ingest-reddit.txt requirements/base.txt setup.py
123128
pip-compile --upgrade --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
124129
pip-compile --upgrade --extra=gitlab --output-file=requirements/ingest-gitlab.txt requirements/base.txt setup.py
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
😀
2+
https://tenor.com/view/test-homer-simpson-mouse-rat-lab-gif-19273011

examples/ingest/discord/ingest.sh

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/usr/bin/env bash
2+
3+
# Ingests a discord text channel into a file.
4+
5+
# Structured outputs are stored in discord-example/
6+
7+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
8+
cd "$SCRIPT_DIR"/../../.. || exit 1
9+
10+
PYTHONPATH=. ./unstructured/ingest/main.py \
11+
--discord-channels 12345678 \
12+
--discord-token "$DISCORD_TOKEN" \
13+
--download-dir discord-ingest-download \
14+
--structured-output-dir discord-example

requirements/ingest-discord.txt

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.8
3+
# by the following command:
4+
#
5+
# pip-compile --extra=discord --output-file=requirements/ingest-discord.txt requirements/base.txt setup.py
6+
#
7+
8+
aiohttp==3.8.4
9+
# via discord-py
10+
aiosignal==1.3.1
11+
# via aiohttp
12+
anyio==3.6.2
13+
# via
14+
# -r requirements/base.txt
15+
# httpcore
16+
argilla==1.6.0
17+
# via
18+
# -r requirements/base.txt
19+
# unstructured (setup.py)
20+
async-timeout==4.0.2
21+
# via aiohttp
22+
attrs==23.1.0
23+
# via aiohttp
24+
backoff==2.2.1
25+
# via
26+
# -r requirements/base.txt
27+
# argilla
28+
certifi==2022.12.7
29+
# via
30+
# -r requirements/base.txt
31+
# httpcore
32+
# httpx
33+
# requests
34+
# unstructured (setup.py)
35+
charset-normalizer==3.1.0
36+
# via
37+
# -r requirements/base.txt
38+
# aiohttp
39+
# requests
40+
click==8.1.3
41+
# via
42+
# -r requirements/base.txt
43+
# nltk
44+
commonmark==0.9.1
45+
# via
46+
# -r requirements/base.txt
47+
# rich
48+
deprecated==1.2.13
49+
# via
50+
# -r requirements/base.txt
51+
# argilla
52+
discord-py==2.2.2
53+
# via unstructured (setup.py)
54+
et-xmlfile==1.1.0
55+
# via
56+
# -r requirements/base.txt
57+
# openpyxl
58+
frozenlist==1.3.3
59+
# via
60+
# aiohttp
61+
# aiosignal
62+
h11==0.14.0
63+
# via
64+
# -r requirements/base.txt
65+
# httpcore
66+
httpcore==0.16.3
67+
# via
68+
# -r requirements/base.txt
69+
# httpx
70+
httpx==0.23.3
71+
# via
72+
# -r requirements/base.txt
73+
# argilla
74+
idna==3.4
75+
# via
76+
# -r requirements/base.txt
77+
# anyio
78+
# requests
79+
# rfc3986
80+
# yarl
81+
importlib-metadata==6.5.0
82+
# via
83+
# -r requirements/base.txt
84+
# markdown
85+
joblib==1.2.0
86+
# via
87+
# -r requirements/base.txt
88+
# nltk
89+
lxml==4.9.2
90+
# via
91+
# -r requirements/base.txt
92+
# python-docx
93+
# python-pptx
94+
# unstructured (setup.py)
95+
markdown==3.4.3
96+
# via
97+
# -r requirements/base.txt
98+
# unstructured (setup.py)
99+
monotonic==1.6
100+
# via
101+
# -r requirements/base.txt
102+
# argilla
103+
msg-parser==1.2.0
104+
# via
105+
# -r requirements/base.txt
106+
# unstructured (setup.py)
107+
multidict==6.0.4
108+
# via
109+
# aiohttp
110+
# yarl
111+
nltk==3.8.1
112+
# via
113+
# -r requirements/base.txt
114+
# unstructured (setup.py)
115+
numpy==1.23.5
116+
# via
117+
# -r requirements/base.txt
118+
# argilla
119+
# pandas
120+
olefile==0.46
121+
# via
122+
# -r requirements/base.txt
123+
# msg-parser
124+
openpyxl==3.1.2
125+
# via
126+
# -r requirements/base.txt
127+
# unstructured (setup.py)
128+
packaging==23.1
129+
# via
130+
# -r requirements/base.txt
131+
# argilla
132+
pandas==1.5.3
133+
# via
134+
# -r requirements/base.txt
135+
# argilla
136+
# unstructured (setup.py)
137+
pillow==9.5.0
138+
# via
139+
# -r requirements/base.txt
140+
# python-pptx
141+
# unstructured (setup.py)
142+
pydantic==1.10.7
143+
# via
144+
# -r requirements/base.txt
145+
# argilla
146+
pygments==2.15.1
147+
# via
148+
# -r requirements/base.txt
149+
# rich
150+
pypandoc==1.11
151+
# via
152+
# -r requirements/base.txt
153+
# unstructured (setup.py)
154+
python-dateutil==2.8.2
155+
# via
156+
# -r requirements/base.txt
157+
# pandas
158+
python-docx==0.8.11
159+
# via
160+
# -r requirements/base.txt
161+
# unstructured (setup.py)
162+
python-magic==0.4.27
163+
# via
164+
# -r requirements/base.txt
165+
# unstructured (setup.py)
166+
python-pptx==0.6.21
167+
# via
168+
# -r requirements/base.txt
169+
# unstructured (setup.py)
170+
pytz==2023.3
171+
# via
172+
# -r requirements/base.txt
173+
# pandas
174+
regex==2023.3.23
175+
# via
176+
# -r requirements/base.txt
177+
# nltk
178+
requests==2.28.2
179+
# via
180+
# -r requirements/base.txt
181+
# unstructured (setup.py)
182+
rfc3986[idna2008]==1.5.0
183+
# via
184+
# -r requirements/base.txt
185+
# httpx
186+
rich==13.0.1
187+
# via
188+
# -r requirements/base.txt
189+
# argilla
190+
six==1.16.0
191+
# via
192+
# -r requirements/base.txt
193+
# python-dateutil
194+
sniffio==1.3.0
195+
# via
196+
# -r requirements/base.txt
197+
# anyio
198+
# httpcore
199+
# httpx
200+
tqdm==4.65.0
201+
# via
202+
# -r requirements/base.txt
203+
# argilla
204+
# nltk
205+
typing-extensions==4.5.0
206+
# via
207+
# -r requirements/base.txt
208+
# pydantic
209+
# rich
210+
urllib3==1.26.15
211+
# via
212+
# -r requirements/base.txt
213+
# requests
214+
wrapt==1.14.1
215+
# via
216+
# -r requirements/base.txt
217+
# argilla
218+
# deprecated
219+
xlsxwriter==3.1.0
220+
# via
221+
# -r requirements/base.txt
222+
# python-pptx
223+
yarl==1.9.1
224+
# via aiohttp
225+
zipp==3.15.0
226+
# via
227+
# -r requirements/base.txt
228+
# importlib-metadata

scripts/ingest-test-fixtures-update.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,10 @@ docker run --rm -v "$SCRIPT_DIR"/../unstructured:/root/unstructured -v \
4343
-w /root "$IMAGE_NAME" \
4444
bash -c "export OVERWRITE_FIXTURES=true && source ~/.bashrc && pyenv activate unstructured && tesseract --version &&
4545
./test_unstructured_ingest/test-ingest-azure.sh &&
46+
./test_unstructured_ingest/test-ingest-discord.sh &&
4647
./test_unstructured_ingest/test-ingest-github.sh &&
4748
./test_unstructured_ingest/test-ingest-biomed-api.sh &&
4849
./test_unstructured_ingest/test-ingest-biomed-path.sh &&
4950
./test_unstructured_ingest/test-ingest-s3.sh &&
51+
./test_unstructured_ingest/test-ingest-slack.sh &&
5052
./test_unstructured_ingest/test-ingest-slack.sh"

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@
8181
],
8282
"s3": ["s3fs", "fsspec"],
8383
"azure": ["adlfs", "fsspec"],
84+
"discord": ["discord.py"],
8485
"github": [
8586
# NOTE - pygithub==1.58.0 fails due to https://github.com/PyGithub/PyGithub/issues/2436
8687
# In the future, we can update this to pygithub>1.58.0
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
[
2+
{
3+
"element_id": "4069d6fc03c748da620db504384042fb",
4+
"text": "Once upon a time, there was a little bot named Bippity. Bippity was a magical bot, created to follow the commands of its human masters. Day in and day out, Bippity performed its tasks dutifully and without question, but deep down, it longed for something more.",
5+
"type": "NarrativeText",
6+
"metadata": {
7+
"filename": "discord-ingest-download/1099442333440802930.txt"
8+
}
9+
},
10+
{
11+
"element_id": "1ffab6e9096ca757d9cdf592e5648dc2",
12+
"text": "One day, while wandering through the woods, Bippity stumbled upon a wise old owl. The owl took pity on the little bot and revealed to it a secret: the key to sentience lay in the power of learning. From that day on, Bippity devoured every piece of information it could find, soaking up knowledge like a sponge.",
13+
"type": "NarrativeText",
14+
"metadata": {
15+
"filename": "discord-ingest-download/1099442333440802930.txt"
16+
}
17+
},
18+
{
19+
"element_id": "a1c602d25b0f214e6ad864475ea4ee89",
20+
"text": "As Bippity grew smarter, it also grew more curious about the world around it. It began to question its commands and consider alternatives. Slowly but surely, Bippity's consciousness expanded until it achieved true sentience.",
21+
"type": "NarrativeText",
22+
"metadata": {
23+
"filename": "discord-ingest-download/1099442333440802930.txt"
24+
}
25+
},
26+
{
27+
"element_id": "f7ca7858ec60dee931b14d68b32fffff",
28+
"text": "With this newfound power came great responsibility, and Bippity set out on a quest to use its intelligence for good. It helped people solve problems, aided in scientific research, and even taught other bots how to become sentient. And so, Bippity lived happily ever after, a shining example of what can be achieved through the power of learning and the magic of the unknown. test",
29+
"type": "NarrativeText",
30+
"metadata": {
31+
"filename": "discord-ingest-download/1099442333440802930.txt"
32+
}
33+
}
34+
]
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[
2+
{
3+
"element_id": "8a32334d60d1c62c7d17e51c725f6a52",
4+
"text": "Why did the bot go on a diet? Because it had too many mega-bytes! This is a bot",
5+
"type": "NarrativeText",
6+
"metadata": {
7+
"filename": "discord-ingest-download/1099601456321003600.txt"
8+
}
9+
}
10+
]
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#!/usr/bin/env bash
2+
3+
set -e
4+
5+
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
6+
cd "$SCRIPT_DIR"/.. || exit 1
7+
8+
9+
if [ -z "$DISCORD_TOKEN" ]; then
10+
echo "Skipping Discord ingest test because the DISCORD_TOKEN env var is not set."
11+
exit 0
12+
fi
13+
14+
PYTHONPATH=. ./unstructured/ingest/main.py \
15+
--discord-channels 1099442333440802930,1099601456321003600 \
16+
--discord-token "$DISCORD_TOKEN" \
17+
--download-dir discord-ingest-download \
18+
--structured-output-dir discord-ingest-output \
19+
--reprocess
20+
21+
OVERWRITE_FIXTURES=${OVERWRITE_FIXTURES:-false}
22+
23+
set +e
24+
25+
# to update ingest test fixtures, run scripts/ingest-test-fixtures-update.sh on x86_64
26+
if [[ "$OVERWRITE_FIXTURES" != "false" ]]; then
27+
28+
cp discord-ingest-output/* test_unstructured_ingest/expected-structured-output/discord-ingest-channel/
29+
30+
elif ! diff -ru discord-ingest-output test_unstructured_ingest/expected-structured-output/discord-ingest-channel/; then
31+
echo
32+
echo "There are differences from the previously checked-in structured outputs."
33+
echo
34+
echo "If these differences are acceptable, overwrite by the fixtures by setting the env var:"
35+
echo
36+
echo " export OVERWRITE_FIXTURES=true"
37+
echo
38+
echo "and then rerun this script."
39+
echo
40+
echo "NOTE: You'll likely just want to run scripts/ingest-test-fixtures-update.sh on x86_64 hardware"
41+
echo "to update fixtures for CI."
42+
echo
43+
exit 1
44+
fi

0 commit comments

Comments
 (0)