Skip to content
This repository was archived by the owner on Aug 25, 2024. It is now read-only.

Commit 16a2687

Browse files
authored
docs: tutorial: dataflow: nlp: Add example usage
Signed-off-by: John Andersen <[email protected]>
1 parent 9f48560 commit 16a2687

File tree

16 files changed

+954
-11
lines changed

16 files changed

+954
-11
lines changed

.ci/deps.sh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,12 @@ if [[ "x${PLUGIN}" == "xoperations/deploy" ]]; then
176176
python -m pip install -U -e "./feature/git"
177177
fi
178178

179+
if [[ "x${PLUGIN}" == "xoperations/nlp" ]]; then
180+
conda install -y -c conda-forge spacy
181+
python -m spacy download en_core_web_sm
182+
python -m pip install -U -e "./model/tensorflow"
183+
fi
184+
179185
if [ "x${PLUGIN}" = "xexamples/shouldi" ]; then
180186
python -m pip install -U -e "./feature/git"
181187
fi

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
66

77
## [Unreleased]
88
### Added
9+
- Tutorial for using NLP operations with models
910
- Operations plugin for NLP
1011
- Support for default value in a Definition
1112
- Transformers Question Answering model

docs/tutorials/dataflows/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ Here we have some examples to better understand the DFFML DataFlows.
99

1010
locking
1111
io
12+
nlp

docs/tutorials/dataflows/nlp.rst

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
Using NLP Operations
2+
====================
3+
4+
This example will show you how to use DFFML operations to clean text data and train a model using DFFML cli.
5+
6+
DFFML offers several :ref:`plugin_models`. For this example
7+
we will be using the tensorflow DNNClassifier model
8+
(:ref:`plugin_model_dffml_model_tensorflow_tfdnnc`) which is in the ``dffml-model-tensorflow`` package.
9+
10+
We will use two operations :ref:`plugin_operation_dffml_operations_nlp_remove_stopwords` and :ref:`plugin_operation_dffml_operations_nlp_get_embedding`.
11+
Internally, both of these operations use `spacy <https://spacy.io/usage/spacy-101>`_ functions.
12+
13+
To install DNNClassifier model and the above mentioned operations run:
14+
15+
.. code-block:: console
16+
17+
$ pip install -U dffml-model-tensorflow dffml-operations-nlp
18+
19+
Operation `remove_stopwords` cleans the text by removing most commanly used words which give the text little or no information eg. but, or, yet, it, is, am, etc.
20+
These words are called `StopWords`.
21+
Operation `get_embedding` maps the tokens in the text to their corresponding word-vectors. Here we will use embeddings from `en_core_web_sm` spacy model.
22+
You can use other models like `en_core_web_md`, `en_core_web_lg` for better results but these are bigger in size and may take a while to download.
23+
24+
Let's first download the `en_core_web_sm` model.
25+
26+
.. code-block:: console
27+
28+
$ python -m spacy download en_core_web_sm
29+
30+
Create training data:
31+
32+
.. literalinclude:: /../examples/nlp/train_data.sh
33+
34+
Now we will create a dataflow to describe how the text feature (`sentence`) will be processed.
35+
36+
.. literalinclude:: /../examples/nlp/create_dataflow.sh
37+
38+
Operation `get_embedding` takes `pad_token` as input (here `<PAD>`) to append to sentences of length smaller
39+
than `max_len` (here 10). A sentence which has length greater than `max_len` is truncated to have length equal to `max_len`.
40+
41+
To visualize the dataflow run:
42+
43+
.. literalinclude:: /../examples/nlp/dataflow_diagram.sh
44+
45+
Copy and pasting the output of the above code into the
46+
`mermaidjs live editor <https://mermaidjs.github.io/mermaid-live-editor>`_
47+
results in the graph.
48+
49+
.. image:: /.. /examples/nlp/dataflow_diagram.svg
50+
51+
We can now use this dataflow to preprocess the data and make it ready to be fed into model:
52+
53+
.. literalinclude:: /../examples/nlp/train.sh
54+
55+
As shown in the above command, a single input feature to model (here embedding) is of shape `(1, max_len, size_of_embedding)`.
56+
Here we have taken `max_len` as 10 and the embedding size of `en_core_web_sm` is 96. So the resulting size of one input feature
57+
is (1,10,96).
58+
59+
Assess accuracy:
60+
61+
.. literalinclude:: /../examples/nlp/accuracy.sh
62+
63+
The output is:
64+
65+
.. code-block:: console
66+
67+
0.5
68+
69+
Create test data:
70+
71+
.. literalinclude:: /../examples/nlp/test_data.sh
72+
73+
74+
Make prediction on test data:
75+
76+
.. literalinclude:: /../examples/nlp/predict.sh
77+
78+
The output is:
79+
80+
.. code-block:: console
81+
82+
Key: 0
83+
Record Features
84+
+------------------------------------------------------------------------------------------------------------------------------+
85+
| sentence | Cats play a lot |
86+
+------------------------------------------------------------------------------------------------------------------------------+
87+
| embedding | (0.32292864, 4.358501, 3.2268033, 1.87990 ... (length:10) |
88+
+------------------------------------------------------------------------------------------------------------------------------+
89+
90+
Prediction
91+
+------------------------------------------------------------------------------------------------------------------------------+
92+
| sentiment |
93+
+------------------------------------------------------------------------------------------------------------------------------+
94+
| Value: 1 | Confidence: 0.5122595429420471 |
95+
+------------------------------------------------------------------------------------------------------------------------------+

examples/nlp/accuracy.sh

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
dffml accuracy \
2+
-model tfdnnc \
3+
-model-batchsize 100 \
4+
-model-hidden 5 2 \
5+
-model-clstype int \
6+
-model-predict sentiment:int:1 \
7+
-model-classifications 0 1 \
8+
-model-directory tempdir \
9+
-model-features embedding:float:[1,10,96] \
10+
-sources text=df \
11+
-source-text-dataflow nlp_ops_dataflow.json \
12+
-source-text-features sentence:str:1 \
13+
-source-text-source csv \
14+
-source-text-source-filename train_data.csv \
15+
-log debug

examples/nlp/create_dataflow.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
dffml dataflow create get_single remove_stopwords get_embedding \
2+
-inputs '["embedding"]'=get_single_spec "en_core_web_sm"=spacy_model_name_def "<PAD>"=pad_token_def 10=max_len_def \
3+
-flow \
4+
'[{"seed": ["sentence"]}]'=remove_stopwords.inputs.text \
5+
'[{"seed": ["spacy_model_name_def"]}]'=get_embedding.inputs.spacy_model \
6+
'[{"seed": ["pad_token_def"]}]'=get_embedding.inputs.pad_token \
7+
'[{"seed": ["max_len_def"]}]'=get_embedding.inputs.max_len \
8+
'[{"remove_stopwords": "result"}]'=get_embedding.inputs.text |
9+
tee nlp_ops_dataflow.json

examples/nlp/dataflow_diagram.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
dffml dataflow diagram -stage processing -- nlp_ops_dataflow.json

0 commit comments

Comments
 (0)