Skip to content

Commit 34d563c

Browse files
authored
feat: Create spacy notebook example (#593)
* add new notebook for spacy
1 parent 7eac1f8 commit 34d563c

File tree

5 files changed

+270
-2
lines changed

5 files changed

+270
-2
lines changed

example-docs/fake-memo.pdf

13.1 KB
Binary file not shown.

examples/spacy/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Loading `unstructured` outputs into Spacy
2+
3+
The following example shows how to load `unstructured` outputs into Spacy.
4+
This allows you to perform NLP to find important data from outputs the `unstructured`
5+
library has extracted.
6+
Follow the instructions [here](https://spacy.io/usage)
7+
to install Spacy on your system.
8+
9+
Once you have installed MySQL, you can connect to MySQL with the command `mysql -u root`.
10+
You can create a non-root user and an `unstructured_example` database using the following
11+
commands:
12+
13+
## Running the example
14+
15+
1. Run `pip install -r requirements.txt` to install the Python dependencies.
16+
1. Run `jupyter-notebook to start.
17+
1. Run the `load-into-spacy.ipynb` notebook.
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "2fac3543",
6+
"metadata": {},
7+
"source": [
8+
"# Loading Data into Spacy"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "30bc0a1b",
14+
"metadata": {},
15+
"source": [
16+
"The goal of this notebook is to show you how to start a spacy project with Unstructured's Elements. This allows you to create your NLP projects.\n",
17+
"\n",
18+
"Make sure you have Spacy installed on your local computer before running this notebook. If not, you can find the instructions for installation [here](https://spacy.io/usage)."
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"id": "ac83c096",
24+
"metadata": {},
25+
"source": [
26+
"# Preprocess Documents with Unstructured"
27+
]
28+
},
29+
{
30+
"cell_type": "markdown",
31+
"id": "a29ef57d",
32+
"metadata": {},
33+
"source": [
34+
"First, we'll pre-process a few documents using the the `unstructured` libraries. The example documents are available under the `example-docs` directory in the `unstructured` repo. At the end of this section, we'll wind up with a list of `Element` objects that we can pass into an `unstructured` staging brick."
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": 3,
40+
"id": "adb6b8f7",
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"import os\n",
45+
"\n",
46+
"from unstructured.partition.auto import partition"
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": 8,
52+
"id": "8464299b",
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"# NOTE: Update this directory if you are running the notebook\n",
57+
"# from somewhere other than the examples/spacy folder in the\n",
58+
"# unstructured repo\n",
59+
"EXAMPLE_DOCS_FOLDER = \"../../example-docs/\""
60+
]
61+
},
62+
{
63+
"cell_type": "code",
64+
"execution_count": 9,
65+
"id": "2fd24424",
66+
"metadata": {},
67+
"outputs": [],
68+
"source": [
69+
"document_to_process = \"fake-memo.pdf\"\n",
70+
"filename = os.path.join(EXAMPLE_DOCS_FOLDER, document_to_process)\n",
71+
"elements = partition(filename=filename, strategy=\"fast\")"
72+
]
73+
},
74+
{
75+
"cell_type": "code",
76+
"execution_count": 10,
77+
"id": "0aa45e81",
78+
"metadata": {},
79+
"outputs": [
80+
{
81+
"data": {
82+
"text/plain": [
83+
"'May 5, 2023'"
84+
]
85+
},
86+
"execution_count": 10,
87+
"metadata": {},
88+
"output_type": "execute_result"
89+
}
90+
],
91+
"source": [
92+
"elements[0].text"
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": 11,
98+
"id": "2429f8a5",
99+
"metadata": {},
100+
"outputs": [
101+
{
102+
"data": {
103+
"text/plain": [
104+
"{'filename': 'fake-memo.pdf',\n",
105+
" 'file_directory': '../../example-docs',\n",
106+
" 'filetype': 'application/pdf',\n",
107+
" 'page_number': 1}"
108+
]
109+
},
110+
"execution_count": 11,
111+
"metadata": {},
112+
"output_type": "execute_result"
113+
}
114+
],
115+
"source": [
116+
"elements[0].metadata.to_dict()"
117+
]
118+
},
119+
{
120+
"cell_type": "markdown",
121+
"id": "1fd556ff",
122+
"metadata": {},
123+
"source": [
124+
"# Extract Numbers Using Spacy\n"
125+
]
126+
},
127+
{
128+
"cell_type": "markdown",
129+
"id": "bdf2cefe",
130+
"metadata": {},
131+
"source": [
132+
"Now let's import `spacy` and create a function to extract noun phrases with numbers. First we'll use a simple example then we'll use the text extracted by `unstructured`.\n",
133+
"\n",
134+
"The function first creates a spacy object with the text, then iterates through the spacy object to find the noun phrases with numbers. It then formats the phrases and appends to a list."
135+
]
136+
},
137+
{
138+
"cell_type": "code",
139+
"execution_count": 1,
140+
"id": "bfd20f75",
141+
"metadata": {},
142+
"outputs": [
143+
{
144+
"name": "stdout",
145+
"output_type": "stream",
146+
"text": [
147+
"Number: 10, Noun: apples, Context: 10 apples\n",
148+
"Number: 5, Noun: oranges, Context: 5 oranges\n"
149+
]
150+
}
151+
],
152+
"source": [
153+
"import spacy\n",
154+
"\n",
155+
"nlp = spacy.load(\"en_core_web_sm\")\n",
156+
"\n",
157+
"def extract_numbers_with_context(text):\n",
158+
" doc = nlp(text)\n",
159+
" numbers = []\n",
160+
" \n",
161+
" for token in doc:\n",
162+
" if token.like_num and token.dep_ == 'nummod' and token.head.pos_ == 'NOUN':\n",
163+
" number = token.text\n",
164+
" noun = token.head.text\n",
165+
" context = ' '.join([number, noun])\n",
166+
" numbers.append((number, noun, context))\n",
167+
" \n",
168+
" return numbers\n",
169+
"\n",
170+
"# Example usage\n",
171+
"text = \"I bought 10 apples and 5 oranges yesterday.\"\n",
172+
"numbers_with_context = extract_numbers_with_context(text)\n",
173+
"\n",
174+
"for number, noun, context in numbers_with_context:\n",
175+
" print(f\"Number: {number}, Noun: {noun}, Context: {context}\")"
176+
]
177+
},
178+
{
179+
"cell_type": "markdown",
180+
"id": "7eae9735",
181+
"metadata": {},
182+
"source": [
183+
"### Using the Data Extracted with Unstructured's Library"
184+
]
185+
},
186+
{
187+
"cell_type": "code",
188+
"execution_count": 28,
189+
"id": "7c738f91",
190+
"metadata": {},
191+
"outputs": [],
192+
"source": [
193+
"numbers_with_context = extract_numbers_with_context(elements[2].text)"
194+
]
195+
},
196+
{
197+
"cell_type": "code",
198+
"execution_count": 29,
199+
"id": "3459555b",
200+
"metadata": {},
201+
"outputs": [
202+
{
203+
"name": "stdout",
204+
"output_type": "stream",
205+
"text": [
206+
"Number: 20,000, Noun: bottles, Context: 20,000 bottles\n",
207+
"Number: 10,000, Noun: blankets, Context: 10,000 blankets\n",
208+
"Number: 200, Noun: laptops, Context: 200 laptops\n",
209+
"Number: 3, Noun: trucks, Context: 3 trucks\n",
210+
"Number: 15, Noun: hours, Context: 15 hours\n"
211+
]
212+
}
213+
],
214+
"source": [
215+
"for number, noun, context in numbers_with_context:\n",
216+
" print(f\"Number: {number}, Noun: {noun}, Context: {context}\")"
217+
]
218+
},
219+
{
220+
"cell_type": "code",
221+
"execution_count": null,
222+
"id": "dadd055a",
223+
"metadata": {},
224+
"outputs": [],
225+
"source": []
226+
}
227+
],
228+
"metadata": {
229+
"kernelspec": {
230+
"display_name": "Python 3 (ipykernel)",
231+
"language": "python",
232+
"name": "python3"
233+
},
234+
"language_info": {
235+
"codemirror_mode": {
236+
"name": "ipython",
237+
"version": 3
238+
},
239+
"file_extension": ".py",
240+
"mimetype": "text/x-python",
241+
"name": "python",
242+
"nbconvert_exporter": "python",
243+
"pygments_lexer": "ipython3",
244+
"version": "3.8.15"
245+
}
246+
},
247+
"nbformat": 4,
248+
"nbformat_minor": 5
249+
}

examples/spacy/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
unstructured[local-inference]
2+
spacy

test_unstructured_ingest/test-ingest-against-api.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
1616

1717
set +e
1818

19-
if [ "$(find 'api-ingest-output' -type f -printf '.' | wc -c)" != 4 ]; then
19+
if [ "$(find 'api-ingest-output' -type f -printf '.' | wc -c)" != 5 ]; then
2020
echo
21-
echo "4 files should have been created."
21+
echo "5 files should have been created."
2222
exit 1
2323
fi

0 commit comments

Comments
 (0)