Skip to content

Commit ecc4597

Browse files
Merge pull request #49967 from sherzyang/main
Update image to remove step.
2 parents 146cdbf + 742a33b commit ecc4597

File tree

2 files changed

+1
-2
lines changed

2 files changed

+1
-2
lines changed

learn-pr/wwl-data-ai/fundamentals-generative-ai/includes/3-language-models.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,10 @@ As you may expect, machines have a hard time deciphering text as they mostly rel
1010

1111
One important development to allow machines to more easily work with text has been tokenization. **Tokens** are strings with a known meaning, usually representing a word. **Tokenization** is turning words into tokens, which are then converted to numbers. A statistical approach to tokenization is by using a pipeline:
1212

13-
:::image type="content" source="../media/tokenization-pipeline.gif" alt-text="Animation showing the pipeline of tokenization of a sentence.":::
13+
:::image type="content" source="../media/tokenization-pipeline.png" alt-text="Animation showing the pipeline of tokenization of a sentence.":::
1414

1515
1. Start with the text you want to **tokenize**.
1616
1. **Split** the words in the text based on a rule. For example, split the words where there's a white space.
17-
1. **Stemming**. Merge similar words by removing the end of a word.
1817
1. **Stop word removal**. Remove noisy words that have little meaning like `the` and `a`. A dictionary of these words is provided to structurally remove them from the text.
1918
1. **Assign a number** to each unique token.
2019

28.1 KB
Loading

0 commit comments

Comments
 (0)