Skip to content

Commit 7b02054

Browse files
author
Sherry Yang
committed
Update for acrolinx.
1 parent 8f6ba36 commit 7b02054

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

learn-pr/wwl-data-ai/fundamentals-generative-ai/includes/3a-transformers.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
1-
The generative AI applications we use today are made possible by utilizing **transformer architecture**. Transformers were introduced in the [*Attention is all you need* paper by Vaswani, et al. from 2017](https://arxiv.org/abs/1706.03762?azure-portal=true).
1+
The generative AI applications we use today are made possible by utilizing **Transformer architecture**. Transformers were introduced in the [*Attention is all you need* paper by Vaswani, et al. from 2017](https://arxiv.org/abs/1706.03762?azure-portal=true).
22

3-
Transformer architecture introduced concepts that drastically improved a model's ability to understand and generate text. Different models have been trained using adaptations of the transformer architecture to optimize for specific NLP tasks.
3+
Transformer architecture introduced concepts that drastically improved a model's ability to understand and generate text. Different models have been trained using adaptations of the Transformer architecture to optimize for specific NLP tasks.
44

5-
## Understand transformer architecture
5+
## Understand Transformer architecture
66

7-
There are two main components in the original transformer architecture:
7+
There are two main components in the original Transformer architecture:
88

99
- The **encoder**: Responsible for processing the input sequence and creating a representation that captures the context of each token.
1010
- The **decoder**: Generates the output sequence by attending to the encoder's representation and predicting the next token in the sequence.
1111

12-
The most important innovations presented in the transformer architecture were *positional encoding* and *multi-head attention*. A simplified representation of the architecture:
12+
The most important innovations presented in the Transformer architecture were *positional encoding* and *multi-head attention*. A simplified representation of the architecture:
1313

14-
![A diagram of the transformer architcture with the encoding and decoding layers.](../media/simplified-transformer-architecture.png)
14+
![A diagram of the Transformer architcture with the encoding and decoding layers.](../media/simplified-transformer-architecture.png)
1515

1616
- In the **encoder** layer, an input sequence is encoded with positional encoding, after which multi-head attention is used to create a representation of the text.
1717
- In the **decoder** layer, an (incomplete) output sequence is encoded in a similar way, by first using positional encoding and then multi-head attention. Then, the multi-head attention mechanism is used a second time within the decoder to combine the output of the encoder and the output of the encoded output sequence that was passed as input to the decoder part. As a result, the output can be generated.
@@ -20,7 +20,7 @@ The most important innovations presented in the transformer architecture were *p
2020

2121
The position of a word and the order of words in a sentence are important to understand the meaning of a text. To include this information, without having to process text sequentially, transformers use **positional encoding**.
2222

23-
Before transformers, language models used word embeddings to encode text into vectors. In the transformer architecture, *positional encoding* is used to encode text into vectors. Positional encoding is the sum of word embedding vectors and positional vectors. By doing so, the encoded text includes information about the meaning *and* position of a word in a sentence.
23+
Before Transformers, language models used word embeddings to encode text into vectors. In the Transformer architecture, *positional encoding* is used to encode text into vectors. Positional encoding is the sum of word embedding vectors and positional vectors. By doing so, the encoded text includes information about the meaning *and* position of a word in a sentence.
2424

2525
To encode the position of a word in a sentence, you could use a single number to represent the index value. For example:
2626

@@ -40,11 +40,11 @@ The longer a text or sequence, the larger the index values may become. Though us
4040

4141
## Understand attention
4242

43-
The most important technique used by transformers to process text is the use of attention instead of recurrence. In this way, the transformer architecture provides an alternative to RNNs. Whereas RNNs are compute-intensive since they process words sequentially, transformers don't process the words sequentially, but instead process each word independently in parallel by using **attention**.
43+
The most important technique used by Transformers to process text is the use of attention instead of recurrence. In this way, the Transformer architecture provides an alternative to RNNs. Whereas RNNs are compute-intensive since they process words sequentially, Transformers don't process the words sequentially, but instead process each word independently in parallel by using **attention**.
4444

4545
**Attention** (also referred to as self-attention or intra-attention) is a mechanism used to map new information to learned information in order to understand what the new information entails.
4646

47-
transformers use an attention *function*, where a new word is encoded (using positional encoding) and represented as a **query**. The output of an encoded word is a **key** with an associated **value**.
47+
Transformers use an attention *function*, where a new word is encoded (using positional encoding) and represented as a **query**. The output of an encoded word is a **key** with an associated **value**.
4848

4949
To illustrate the three variables that are used by the attention function: the query, keys, and values, let's explore a simplified example. Imagine encoding the sentence `Vincent van Gogh is a painter, known for his stunning and emotionally expressive artworks.` When encoding the query `Vincent van Gogh`, the output may be `Vincent van Gogh` as the key with `painter` as the associated value. The architecture stores keys and values in a table, which it can then use for future decoding:
5050

@@ -60,6 +60,6 @@ To calculate the attention function, the query, keys, and values are all encoded
6060

6161
The *softmax* function is used within the attention function, over the scaled dot-product of the vectors, to create a probability distribution with possible outcomes. In other words, the softmax function's output includes which keys are closest to the query. The key with the highest probability is then selected, and the associated value is the output of the attention function.
6262

63-
The transformer architecture uses multi-head attention, which means tokens are processed by the attention function several times in parallel. By doing so, a word or sentence can be processed multiple times, in various ways, to extract different kinds of information from the sentence.
63+
The Transformer architecture uses multi-head attention, which means tokens are processed by the attention function several times in parallel. By doing so, a word or sentence can be processed multiple times, in various ways, to extract different kinds of information from the sentence.
6464

65-
The transformer architecture has allowed us to train models in a more efficient way. Instead of processing each token in a sentence or sequence, attention allows a model to process tokens in parallel in various ways. Next, learn how different types of language models are available for building applications.
65+
The Transformer architecture has allowed us to train models in a more efficient way. Instead of processing each token in a sentence or sequence, attention allows a model to process tokens in parallel in various ways. Next, learn how different types of language models are available for building applications.

0 commit comments

Comments
 (0)