Skip to content

Commit fb16847

Browse files
authored
feat: Staging brick for attention window chunking (#34)
* add huggingface dependencies and re pip-compile * first pass on chunk by attention window * test for chunking function * completed tests for chunk_by_attention_window * change default buffer size to 2 * wrapper function for staging * added docs for transformers * fix wording and typos * updated change log and bumped the version * added docs on huggingface dependencies * fix typo * re pip-compile
1 parent ec5be8e commit fb16847

File tree

14 files changed

+373
-20
lines changed

14 files changed

+373
-20
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
## 0.2.1-dev6
1+
## 0.2.1-dev7
22

3+
* Added staging brick for separating text into attention window size chunks for `transformers`.
34
* Added staging brick for LabelBox.
45
* Added ability to upload LabelStudio predictions
56
* Added utility function for JSONL reading and writing

Makefile

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,18 @@ install-base: install-base-pip-packages install-nltk-models
2020
install: install-base-pip-packages install-dev install-detectron2 install-nltk-models install-test
2121

2222
.PHONY: install-ci
23-
install-ci: install-base-pip-packages install-pdf install-test install-nltk-models
23+
install-ci: install-base-pip-packages install-pdf install-test install-nltk-models install-huggingface
2424

2525
.PHONY: install-base-pip-packages
2626
install-base-pip-packages:
2727
python3 -m pip install pip==${PIP_VERSION}
2828
pip install -r requirements/base.txt
2929

30+
.PHONY: install-huggingface
31+
install-huggingface:
32+
python3 -m pip install pip==${PIP_VERSION}
33+
pip install -r requirements/huggingface.txt
34+
3035
.PHONY: install-pdf
3136
install-pdf:
3237
python3 -m pip install pip==${PIP_VERSION}
@@ -60,6 +65,8 @@ install-build:
6065
.PHONY: pip-compile
6166
pip-compile:
6267
pip-compile -o requirements/base.txt
68+
# Extra requirements for huggingface staging functions
69+
pip-compile --extra huggingface -o requirements/huggingface.txt
6370
# Extra requirements for parsing PDF files
6471
pip-compile --extra pdf -o requirements/pdf.txt
6572
# NOTE(robinson) - We want the dependencies for detectron2 in the requirements.txt, but not

docs/requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ pygments==2.13.0
3232
# via sphinx
3333
pyparsing==3.0.9
3434
# via packaging
35-
pytz==2022.2.1
35+
pytz==2022.4
3636
# via babel
3737
requests==2.28.1
3838
# via sphinx
@@ -58,5 +58,5 @@ sphinxcontrib-serializinghtml==1.1.5
5858
# via sphinx
5959
urllib3==1.26.12
6060
# via requests
61-
zipp==3.8.1
61+
zipp==3.9.0
6262
# via importlib-metadata

docs/source/bricks.rst

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,91 @@ Examples:
338338
isd = convert_to_isd(elements)
339339
340340
341+
``stage_for_transformers``
342+
--------------------------
343+
344+
Prepares ``Text`` elements for processing in ``transformers`` pipelines
345+
by splitting the elements into chunks that fit into the model's attention window.
346+
347+
Examples:
348+
349+
.. code:: python
350+
351+
from transformers import AutoTokenizer, AutoModelForTokenClassification
352+
from transformers import pipeline
353+
354+
from unstructured.documents.elements import NarrativeText
355+
from unstructured.staging.huggingface import stage_for_transformers
356+
357+
model_name = "hf-internal-testing/tiny-bert-for-token-classification"
358+
tokenizer = AutoTokenizer.from_pretrained(model_name)
359+
model = AutoModelForTokenClassification.from_pretrained(model_name)
360+
361+
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
362+
363+
text = """From frost advisories this morning to a strong cold front expected later this week, the chance of fall showing up is real.
364+
365+
There's a refreshing crispness to the air, and it looks to get only more pronounced as the week goes on.
366+
367+
Frost advisories were in place this morning across portions of the Appalachians and coastal Maine as temperatures dropped into the 30s.
368+
369+
Temperatures this morning were in the 40s as far south as the Florida Panhandle.
370+
371+
And Maine even had a few reports of their first snow of the season Sunday. More cities could see their first snow later this week.
372+
373+
Yes, hello fall!
374+
375+
As temperatures moderate during the next few days, much of the east will stay right around seasonal norms, but the next blast of cold air will be strong and come with the potential for hazardous conditions.
376+
377+
"A more active fall weather pattern is expected to evolve by the end of this week and continuing into the weekend as a couple of cold fronts move across the central and eastern states," the Weather Prediction Center said.
378+
379+
The potent cold front will come in from Canada with a punch of chilly air, heavy rain and strong wind.
380+
381+
The Weather Prediction Center has a slight risk of excessive rainfall for much of the Northeast and New England on Thursday, including places like New York City, Buffalo and Burlington, so we will have to look out for flash flooding in these areas.
382+
383+
"More impactful weather continues to look likely with confidence growing that our region will experience the first real fall-like system with gusty to strong winds and a period of moderate to heavy rain along and ahead of a cold front passage," the National Weather Service office in Burlington wrote.
384+
385+
The potential for very heavy rain could accompany the front, bringing up to two inches of rain for much of the area, and isolated locations could see even more.
386+
387+
"Ensembles [forecast models] show median rainfall totals by Wednesday night around a half inch, with a potential for some spots to see around one inch, our first substantial rainfall in at least a couple of weeks," the weather service office in Grand Rapids noted, adding, "It may also get cold enough for some snow to mix in Thursday night to Friday morning, especially in the higher terrain north of Grand Rapids toward Cadillac."
388+
389+
There is also a chance for very strong winds to accompany the system.
390+
391+
The weather service is forecasting winds of 30-40 mph ahead of the cold front, which could cause some tree limbs to fall and sporadic power outages.
392+
393+
Behind the front, temperatures will fall.
394+
395+
"East Coast, with highs about 5-15 degrees below average to close out the workweek and going into next weekend, with highs only in the 40s and 50s from the Great Lakes to the Northeast on most days," the Weather Prediction Center explained.
396+
397+
By the weekend, a second cold front will drop down from Canada and bring a reinforcing shot of chilly air across the eastern half of the country."""
398+
399+
chunks = stage_for_transformers([NarrativeText(text=text)], tokenizer)
400+
401+
results = [nlp(chunk) for chunk in chunks]
402+
403+
404+
The following optional keyword arguments can be specified in
405+
``stage_for_transformers``:
406+
407+
* ``buffer``: Indicates the number of tokens to leave as a buffer for the attention window. This is to account for special tokens like ``[CLS]`` that can appear at the beginning or end of an input sequence.
408+
* ``max_input_size``: The size of the attention window for the model. If not specified, the default is the ``model_max_length`` attribute on the tokenizer object.
409+
* ``split_function``: The function used to split the text into chunks to consider for adding to the attention window. Splits on spaces be default.
410+
* ``chunk_separator``: The string used to concat adjacent chunks when reconstructing the text. Uses spaces by default.
411+
412+
If you need to operate on text directly instead of ``unstructured`` ``Text``
413+
objects, use the ``chunk_by_attention_window`` helper function. Simply modify
414+
the example above to include the following:
415+
416+
.. code:: python
417+
418+
from unstructured.staging.huggingface import chunk_by_attention_window
419+
420+
chunks = chunk_by_attention_window(text, tokenizer)
421+
422+
results = [nlp(chunk) for chunk in chunks]
423+
424+
425+
341426
``stage_for_label_studio``
342427
--------------------------
343428

docs/source/installing.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,22 @@ Also ensure that you have ``poppler`` installed on your system. On a Mac, you ca
5252
.. code:: console
5353
5454
$ brew install poppler
55+
56+
57+
========================
58+
Huggingface Dependencies
59+
========================
60+
61+
The ``transformers`` requires the Rust compiler to be present on your system in
62+
order to properly ``pip`` install. If a Rust compiler is not available on your system,
63+
you can run the following command to install it:
64+
65+
.. code:: console
66+
67+
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
68+
69+
Additionally, some tokenizers in the ``transformers`` library required the ``sentencepiece``
70+
library. This is not included as an ``unstructured`` dependency because it only applies
71+
to some tokenizers. See the
72+
`sentencepiece install instructions <https://github.com/google/sentencepiece#installation>`_ for
73+
information on how to install ``sentencepiece`` if your tokenizer requires it.

requirements/build.txt

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ idna==3.4
2020
# via requests
2121
imagesize==1.4.1
2222
# via sphinx
23+
importlib-metadata==5.0.0
24+
# via sphinx
2325
jinja2==3.1.2
2426
# via sphinx
2527
markupsafe==2.1.1
@@ -38,10 +40,10 @@ snowballstemmer==2.2.0
3840
# via sphinx
3941
sphinx==5.2.3
4042
# via
41-
# -r build.in
43+
# -r requirements/build.in
4244
# sphinx-rtd-theme
4345
sphinx-rtd-theme==1.0.0
44-
# via -r build.in
46+
# via -r requirements/build.in
4547
sphinxcontrib-applehelp==1.0.2
4648
# via sphinx
4749
sphinxcontrib-devhelp==1.0.2
@@ -56,3 +58,5 @@ sphinxcontrib-serializinghtml==1.1.5
5658
# via sphinx
5759
urllib3==1.26.12
5860
# via requests
61+
zipp==3.9.0
62+
# via importlib-metadata

requirements/dev.txt

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@
44
#
55
# pip-compile requirements/dev.in
66
#
7+
appnope==0.1.3
8+
# via
9+
# ipykernel
10+
# ipython
711
argon2-cffi==21.3.0
812
# via notebook
913
argon2-cffi-bindings==21.2.0
@@ -36,6 +40,10 @@ executing==1.0.0
3640
# via stack-data
3741
fastjsonschema==2.16.2
3842
# via nbformat
43+
importlib-metadata==5.0.0
44+
# via nbconvert
45+
importlib-resources==5.10.0
46+
# via jsonschema
3947
ipykernel==6.15.3
4048
# via
4149
# ipywidgets
@@ -45,7 +53,7 @@ ipykernel==6.15.3
4553
# qtconsole
4654
ipython==8.5.0
4755
# via
48-
# -r dev.in
56+
# -r requirements/dev.in
4957
# ipykernel
5058
# ipywidgets
5159
# jupyter-console
@@ -64,7 +72,7 @@ jinja2==3.1.2
6472
jsonschema==4.16.0
6573
# via nbformat
6674
jupyter==1.0.0
67-
# via -r dev.in
75+
# via -r requirements/dev.in
6876
jupyter-client==7.3.5
6977
# via
7078
# ipykernel
@@ -133,7 +141,9 @@ pexpect==4.8.0
133141
pickleshare==0.7.5
134142
# via ipython
135143
pip-tools==6.9.0
136-
# via -r dev.in
144+
# via -r requirements/dev.in
145+
pkgutil-resolve-name==1.3.10
146+
# via jsonschema
137147
prometheus-client==0.14.1
138148
# via notebook
139149
prompt-toolkit==3.0.31
@@ -220,6 +230,10 @@ wheel==0.37.1
220230
# via pip-tools
221231
widgetsnbextension==4.0.3
222232
# via ipywidgets
233+
zipp==3.9.0
234+
# via
235+
# importlib-metadata
236+
# importlib-resources
223237

224238
# The following packages are considered to be unsafe in a requirements file:
225239
# pip

requirements/huggingface.txt

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#
2+
# This file is autogenerated by pip-compile with python 3.8
3+
# To update, run:
4+
#
5+
# pip-compile --extra=huggingface --output-file=requirements/huggingface.txt
6+
#
7+
certifi==2022.9.24
8+
# via requests
9+
charset-normalizer==2.1.1
10+
# via requests
11+
click==8.1.3
12+
# via nltk
13+
filelock==3.8.0
14+
# via
15+
# huggingface-hub
16+
# transformers
17+
huggingface-hub==0.10.1
18+
# via transformers
19+
idna==3.4
20+
# via requests
21+
joblib==1.2.0
22+
# via nltk
23+
lxml==4.9.1
24+
# via unstructured (setup.py)
25+
nltk==3.7
26+
# via unstructured (setup.py)
27+
numpy==1.23.4
28+
# via transformers
29+
packaging==21.3
30+
# via
31+
# huggingface-hub
32+
# transformers
33+
pyparsing==3.0.9
34+
# via packaging
35+
pyyaml==6.0
36+
# via
37+
# huggingface-hub
38+
# transformers
39+
regex==2022.9.13
40+
# via
41+
# nltk
42+
# transformers
43+
requests==2.28.1
44+
# via
45+
# huggingface-hub
46+
# transformers
47+
tokenizers==0.13.1
48+
# via transformers
49+
tqdm==4.64.1
50+
# via
51+
# huggingface-hub
52+
# nltk
53+
# transformers
54+
transformers==4.23.1
55+
# via unstructured (setup.py)
56+
typing-extensions==4.4.0
57+
# via huggingface-hub
58+
urllib3==1.26.12
59+
# via requests

requirements/pdf.txt

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,12 @@ cycler==0.11.0
2424
# via matplotlib
2525
effdet==0.3.0
2626
# via layoutparser
27+
filelock==3.8.0
28+
# via huggingface-hub
2729
fonttools==4.37.4
2830
# via matplotlib
31+
huggingface-hub==0.10.1
32+
# via timm
2933
idna==3.4
3034
# via requests
3135
iopath==0.1.10
@@ -58,6 +62,7 @@ opencv-python==4.6.0.66
5862
# via layoutparser
5963
packaging==21.3
6064
# via
65+
# huggingface-hub
6166
# matplotlib
6267
# pytesseract
6368
pandas==1.5.0
@@ -96,12 +101,16 @@ pytz==2022.4
96101
# via pandas
97102
pyyaml==6.0
98103
# via
104+
# huggingface-hub
99105
# layoutparser
100106
# omegaconf
107+
# timm
101108
regex==2022.9.13
102109
# via nltk
103110
requests==2.28.1
104-
# via torchvision
111+
# via
112+
# huggingface-hub
113+
# torchvision
105114
scipy==1.9.2
106115
# via layoutparser
107116
six==1.16.0
@@ -121,10 +130,12 @@ torchvision==0.13.1
121130
# timm
122131
tqdm==4.64.1
123132
# via
133+
# huggingface-hub
124134
# iopath
125135
# nltk
126136
typing-extensions==4.4.0
127137
# via
138+
# huggingface-hub
128139
# iopath
129140
# torch
130141
# torchvision

0 commit comments

Comments
 (0)