|
| 1 | +Using NLP Operations |
| 2 | +==================== |
| 3 | + |
| 4 | +This example will show you how to use DFFML operations to clean text data and train a model using DFFML cli. |
| 5 | + |
| 6 | +DFFML offers several :ref:`plugin_models`. For this example |
| 7 | +we will be using the tensorflow DNNClassifier model |
| 8 | +(:ref:`plugin_model_dffml_model_tensorflow_tfdnnc`) which is in the ``dffml-model-tensorflow`` package. |
| 9 | + |
| 10 | +We will use two operations :ref:`plugin_operation_dffml_operations_nlp_remove_stopwords` and :ref:`plugin_operation_dffml_operations_nlp_get_embedding`. |
| 11 | +Internally, both of these operations use `spacy <https://spacy.io/usage/spacy-101>`_ functions. |
| 12 | + |
| 13 | +To install DNNClassifier model and the above mentioned operations run: |
| 14 | + |
| 15 | +.. code-block:: console |
| 16 | +
|
| 17 | + $ pip install -U dffml-model-tensorflow dffml-operations-nlp |
| 18 | +
|
| 19 | +Operation `remove_stopwords` cleans the text by removing most commanly used words which give the text little or no information eg. but, or, yet, it, is, am, etc. |
| 20 | +These words are called `StopWords`. |
| 21 | +Operation `get_embedding` maps the tokens in the text to their corresponding word-vectors. Here we will use embeddings from `en_core_web_sm` spacy model. |
| 22 | +You can use other models like `en_core_web_md`, `en_core_web_lg` for better results but these are bigger in size and may take a while to download. |
| 23 | + |
| 24 | +Let's first download the `en_core_web_sm` model. |
| 25 | + |
| 26 | +.. code-block:: console |
| 27 | +
|
| 28 | + $ python -m spacy download en_core_web_sm |
| 29 | +
|
| 30 | +Create training data: |
| 31 | + |
| 32 | +.. literalinclude:: /../examples/nlp/train_data.sh |
| 33 | + |
| 34 | +Now we will create a dataflow to describe how the text feature (`sentence`) will be processed. |
| 35 | + |
| 36 | +.. literalinclude:: /../examples/nlp/create_dataflow.sh |
| 37 | + |
| 38 | +Operation `get_embedding` takes `pad_token` as input (here `<PAD>`) to append to sentences of length smaller |
| 39 | +than `max_len` (here 10). A sentence which has length greater than `max_len` is truncated to have length equal to `max_len`. |
| 40 | + |
| 41 | +To visualize the dataflow run: |
| 42 | + |
| 43 | +.. literalinclude:: /../examples/nlp/dataflow_diagram.sh |
| 44 | + |
| 45 | +Copy and pasting the output of the above code into the |
| 46 | +`mermaidjs live editor <https://mermaidjs.github.io/mermaid-live-editor>`_ |
| 47 | +results in the graph. |
| 48 | + |
| 49 | +.. image:: /.. /examples/nlp/dataflow_diagram.svg |
| 50 | + |
| 51 | +We can now use this dataflow to preprocess the data and make it ready to be fed into model: |
| 52 | + |
| 53 | +.. literalinclude:: /../examples/nlp/train.sh |
| 54 | + |
| 55 | +As shown in the above command, a single input feature to model (here embedding) is of shape `(1, max_len, size_of_embedding)`. |
| 56 | +Here we have taken `max_len` as 10 and the embedding size of `en_core_web_sm` is 96. So the resulting size of one input feature |
| 57 | +is (1,10,96). |
| 58 | + |
| 59 | +Assess accuracy: |
| 60 | + |
| 61 | +.. literalinclude:: /../examples/nlp/accuracy.sh |
| 62 | + |
| 63 | +The output is: |
| 64 | + |
| 65 | +.. code-block:: console |
| 66 | +
|
| 67 | + 0.5 |
| 68 | +
|
| 69 | +Create test data: |
| 70 | + |
| 71 | +.. literalinclude:: /../examples/nlp/test_data.sh |
| 72 | + |
| 73 | + |
| 74 | +Make prediction on test data: |
| 75 | + |
| 76 | +.. literalinclude:: /../examples/nlp/predict.sh |
| 77 | + |
| 78 | +The output is: |
| 79 | + |
| 80 | +.. code-block:: console |
| 81 | +
|
| 82 | + Key: 0 |
| 83 | + Record Features |
| 84 | + +------------------------------------------------------------------------------------------------------------------------------+ |
| 85 | + | sentence | Cats play a lot | |
| 86 | + +------------------------------------------------------------------------------------------------------------------------------+ |
| 87 | + | embedding | (0.32292864, 4.358501, 3.2268033, 1.87990 ... (length:10) | |
| 88 | + +------------------------------------------------------------------------------------------------------------------------------+ |
| 89 | +
|
| 90 | + Prediction |
| 91 | + +------------------------------------------------------------------------------------------------------------------------------+ |
| 92 | + | sentiment | |
| 93 | + +------------------------------------------------------------------------------------------------------------------------------+ |
| 94 | + | Value: 1 | Confidence: 0.5122595429420471 | |
| 95 | + +------------------------------------------------------------------------------------------------------------------------------+ |
0 commit comments