Skip to content

Commit c00faed

Browse files
committed
add figure
1 parent f89d70c commit c00faed

File tree

1 file changed

+9
-2
lines changed

1 file changed

+9
-2
lines changed

notebooks/22_NLP_2_tokenization.ipynb

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,14 @@
9999
"source": [
100100
"## NLP Preprocessing Workflow\n",
101101
"\n",
102-
"We usually work with text in various formats and sizes, for instance, from `.txt`, `.html`, or other structured or unstructured text file formats. \n",
102+
"We usually work with text in various formats and sizes, for instance, from `.txt`, `.html`, or other structured or unstructured text file formats. For a later systematic data analysis or the training of machine-learning models, we first have to preprocess the text data consistently, typically done as sketched in {numref}`fig_nlp_processing_workflow`.\n",
103+
"\n",
104+
"```{figure} ../images/fig_nlp_processing_workflow.png\n",
105+
":name: fig_nlp_processing_workflow\n",
106+
"\n",
107+
"Typically, an NLP preprocessing workflow consists of several stages, including raw text cleaning, tokenization, token cleaning, and token normalization. This is often the basis for later analysis or modeling steps.\n",
108+
"```\n",
109+
"\n",
103110
"\n",
104111
"### Raw Text Cleaning\n",
105112
"\n",
@@ -900,7 +907,7 @@
900907
"name": "python",
901908
"nbconvert_exporter": "python",
902909
"pygments_lexer": "ipython3",
903-
"version": "3.9.18"
910+
"version": "3.12.9"
904911
}
905912
},
906913
"nbformat": 4,

0 commit comments

Comments
 (0)