Update for acrolinx.

Sherry Yang · Sherry Yang · commit 7b020545a9aa · 2025-03-31T14:21:09.000-07:00
diff --git a/learn-pr/wwl-data-ai/fundamentals-generative-ai/includes/3a-transformers.md b/learn-pr/wwl-data-ai/fundamentals-generative-ai/includes/3a-transformers.md
@@ -1,17 +1,17 @@
-The generative AI applications we use today are made possible by utilizing **transformer architecture**. Transformers were introduced in the [*Attention is all you need* paper by Vaswani, et al. from 2017](https://arxiv.org/abs/1706.03762?azure-portal=true). 
+The generative AI applications we use today are made possible by utilizing **Transformer architecture**. Transformers were introduced in the [*Attention is all you need* paper by Vaswani, et al. from 2017](https://arxiv.org/abs/1706.03762?azure-portal=true). 
 
-Transformer architecture introduced concepts that drastically improved a model's ability to understand and generate text. Different models have been trained using adaptations of the transformer architecture to optimize for specific NLP tasks.
+Transformer architecture introduced concepts that drastically improved a model's ability to understand and generate text. Different models have been trained using adaptations of the Transformer architecture to optimize for specific NLP tasks.
 
-## Understand transformer architecture
+## Understand Transformer architecture
 
-There are two main components in the original transformer architecture:
+There are two main components in the original Transformer architecture:
 
 - The **encoder**: Responsible for processing the input sequence and creating a representation that captures the context of each token.
 - The **decoder**: Generates the output sequence by attending to the encoder's representation and predicting the next token in the sequence.
 
-The most important innovations presented in the transformer architecture were *positional encoding* and *multi-head attention*. A simplified representation of the architecture: 
+The most important innovations presented in the Transformer architecture were *positional encoding* and *multi-head attention*. A simplified representation of the architecture: 
 
-![A diagram of the transformer architcture with the encoding and decoding layers.](../media/simplified-transformer-architecture.png)
+![A diagram of the Transformer architcture with the encoding and decoding layers.](../media/simplified-transformer-architecture.png)
 
 - In the **encoder** layer, an input sequence is encoded with positional encoding, after which multi-head attention is used to create a representation of the text.
 - In the **decoder** layer, an (incomplete) output sequence is encoded in a similar way, by first using positional encoding and then multi-head attention. Then, the multi-head attention mechanism is used a second time within the decoder to combine the output of the encoder and the output of the encoded output sequence that was passed as input to the decoder part. As a result, the output can be generated.
@@ -20,7 +20,7 @@ The most important innovations presented in the transformer architecture were *p
 
 The position of a word and the order of words in a sentence are important to understand the meaning of a text. To include this information, without having to process text sequentially, transformers use **positional encoding**.
 
-Before transformers, language models used word embeddings to encode text into vectors. In the transformer architecture, *positional encoding* is used to encode text into vectors. Positional encoding is the sum of word embedding vectors and positional vectors. By doing so, the encoded text includes information about the meaning *and* position of a word in a sentence.
+Before Transformers, language models used word embeddings to encode text into vectors. In the Transformer architecture, *positional encoding* is used to encode text into vectors. Positional encoding is the sum of word embedding vectors and positional vectors. By doing so, the encoded text includes information about the meaning *and* position of a word in a sentence.
 
 To encode the position of a word in a sentence, you could use a single number to represent the index value. For example:
 
@@ -40,11 +40,11 @@ The longer a text or sequence, the larger the index values may become. Though us
 
 ## Understand attention
 
-The most important technique used by transformers to process text is the use of attention instead of recurrence. In this way, the transformer architecture provides an alternative to RNNs. Whereas RNNs are compute-intensive since they process words sequentially, transformers don't process the words sequentially, but instead process each word independently in parallel by using **attention**.
+The most important technique used by Transformers to process text is the use of attention instead of recurrence. In this way, the Transformer architecture provides an alternative to RNNs. Whereas RNNs are compute-intensive since they process words sequentially, Transformers don't process the words sequentially, but instead process each word independently in parallel by using **attention**.
 
 **Attention** (also referred to as self-attention or intra-attention) is a mechanism used to map new information to learned information in order to understand what the new information entails.
 
-transformers use an attention *function*, where a new word is encoded (using positional encoding) and represented as a **query**. The output of an encoded word is a **key** with an associated **value**.
+Transformers use an attention *function*, where a new word is encoded (using positional encoding) and represented as a **query**. The output of an encoded word is a **key** with an associated **value**.
 
 To illustrate the three variables that are used by the attention function: the query, keys, and values, let's explore a simplified example. Imagine encoding the sentence `Vincent van Gogh is a painter, known for his stunning and emotionally expressive artworks.` When encoding the query `Vincent van Gogh`, the output may be `Vincent van Gogh` as the key with `painter` as the associated value. The architecture stores keys and values in a table, which it can then use for future decoding:
 
@@ -60,6 +60,6 @@ To calculate the attention function, the query, keys, and values are all encoded
 
 The *softmax* function is used within the attention function, over the scaled dot-product of the vectors, to create a probability distribution with possible outcomes. In other words, the softmax function's output includes which keys are closest to the query. The key with the highest probability is then selected, and the associated value is the output of the attention function.
 
-The transformer architecture uses multi-head attention, which means tokens are processed by the attention function several times in parallel. By doing so, a word or sentence can be processed multiple times, in various ways, to extract different kinds of information from the sentence.
+The Transformer architecture uses multi-head attention, which means tokens are processed by the attention function several times in parallel. By doing so, a word or sentence can be processed multiple times, in various ways, to extract different kinds of information from the sentence.
 
-The transformer architecture has allowed us to train models in a more efficient way. Instead of processing each token in a sentence or sequence, attention allows a model to process tokens in parallel in various ways. Next, learn how different types of language models are available for building applications.
+The Transformer architecture has allowed us to train models in a more efficient way. Instead of processing each token in a sentence or sequence, attention allows a model to process tokens in parallel in various ways. Next, learn how different types of language models are available for building applications.