Skip to content

Commit d96606f

Browse files
Merge pull request #50707 from sherzyang/main
Update.
2 parents 0fd364a + bf37c1a commit d96606f

File tree

4 files changed

+17
-6
lines changed

4 files changed

+17
-6
lines changed
Binary file not shown.

learn-pr/wwl-data-ai/introduction-language/5-knowledge-check.yml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,4 +35,15 @@ quiz:
3535
explanation: "Correct. TF-IDF is a technique used to determine the importance of words in a document within the context of a larger collection of documents."
3636
- content: " Word2Vec"
3737
isCorrect: false
38-
explanation: "Incorrect. Word2Vec is a technique for generating word embeddings, which are dense vector representations of words that capture semantic relationships between words. "
38+
explanation: "Incorrect. Word2Vec is a technique for generating word embeddings, which are dense vector representations of words that capture semantic relationships between words. "
39+
- content: "Which of the following best describes the role of embeddings in natural language processing (NLP)?"
40+
choices:
41+
- content: "They visualize text data in two-dimensional space for easier interpretation."
42+
isCorrect: false
43+
explanation: "Incorrect."
44+
- content: "They summarize large text corpora into short, meaningful sentences."
45+
isCorrect: false
46+
explanation: "Incorrect."
47+
- content: "They convert language tokens into vectors that capture semantic relationships."
48+
isCorrect: true
49+
explanation: "Correct."

learn-pr/wwl-data-ai/introduction-language/includes/4-semantic-models.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
As the state of the art for NLP has advanced, the ability to train models that encapsulate the semantic relationship between tokens has led to the emergence of powerful deep learning language models. At the heart of these models is the encoding of language tokens as vectors (multi-valued arrays of numbers) known as *embeddings*.
22

3-
It can be useful to think of the elements in a token embedding vector as coordinates in multidimensional space, so that each token occupies a specific "location." The closer tokens are to one another along a particular dimension, the more semantically related they are. In other words, related words are grouped closer together. As a simple example, suppose the embeddings for our tokens consist of vectors with three elements, for example:
3+
Vectors represent lines in multidimensional space, describing direction and distance along multiple axes. Overall, the vector describes the direction and distance of the path from origin to end. Semantically similar tokens should result in vectors that have a similar orientation – in other words they point in the same direction. As a simple example, suppose the embeddings for our tokens consist of vectors with three elements, for example:
44

55
```
66
- 4 ("dog"): [10.3.2]
77
- 5 ("bark"): [10,2,2]
88
- 8 ("cat"): [10,3,1]
99
- 9 ("meow"): [10,2,1]
10-
- 10 ("skateboard"): [3,3,1]
10+
- 10 ("skateboard"): [-3,3,2]
1111
```
1212

13-
We can plot the location of tokens based on these vectors in three-dimensional space, like this:
13+
In three-dimensional space, these vectors look like this:
1414

15-
![A diagram of tokens plotted on a three-dimensional space.](../media/example-embeddings-graph.png)
15+
![A diagram of tokens plotted on a three-dimensional space.](../media/word-embeddings.png)
1616

17-
The locations of the tokens in the embeddings space include some information about how closely the tokens are related to one another. For example, the token for `"dog"` is close to `"cat"` and also to `"bark"`. The tokens for `"cat"` and `"bark"` are close to `"meow"`. The token for `"skateboard"` is further away from the other tokens.
17+
The embedding vectors for "dog" and "puppy" describe a path along an almost identical direction, which is also fairly similar to the direction for "cat". The embedding vector for "skateboard" however describes journey in a very different direction.
1818

1919
The language models we use in industry are based on these principles but have greater complexity. For example, the vectors used generally have many more dimensions. There are also multiple ways you can calculate appropriate embeddings for a given set of tokens. Different methods result in different predictions from natural language processing models.
2020

54 KB
Loading

0 commit comments

Comments
 (0)