Add language concept module.

Sherry Yang · Sherry Yang · commit d4b7f9c14fd0 · 2025-05-21T15:22:37.000-07:00
diff --git a/learn-pr/wwl-data-ai/introduction-language/1-introduction.yml b/learn-pr/wwl-data-ai/introduction-language/1-introduction.yml
@@ -0,0 +1,15 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.introduction-language.introduction
+title: Introduction
+metadata:
+  title: Introduction
+  description: "Introduction"
+  ms.date: 5/21/2025
+  author: wwlpublish
+  ms.author: sheryang
+  ms.topic: unit
+  ms.custom:
+  - N/A
+durationInMinutes: 1
+content: |
+  [!include[](includes/1-introduction.md)]
diff --git a/learn-pr/wwl-data-ai/introduction-language/2-how-it-works.yml b/learn-pr/wwl-data-ai/introduction-language/2-how-it-works.yml
@@ -0,0 +1,15 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.introduction-language.how-it-works
+title: How it works
+metadata:
+  title: How it works
+  description: "How it works"
+  ms.date: 5/21/2025
+  author: wwlpublish
+  ms.author: sheryang
+  ms.topic: unit
+  ms.custom:
+  - N/A
+durationInMinutes: 6
+content: |
+  [!include[](includes/2-how-it-works.md)]
diff --git a/learn-pr/wwl-data-ai/introduction-language/3-text-analysis.yml b/learn-pr/wwl-data-ai/introduction-language/3-text-analysis.yml
@@ -0,0 +1,15 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.introduction-language.text-analysis
+title: Understand text analysis
+metadata:
+  title: Understand text analysis
+  description: "Understand text analysis"
+  ms.date: 5/21/2025
+  author: wwlpublish
+  ms.author: sheryang
+  ms.topic: unit
+  ms.custom:
+  - N/A
+durationInMinutes: 5
+content: |
+  [!include[](includes/3-text-analysis.md)]
diff --git a/learn-pr/wwl-data-ai/introduction-language/4-knowledge-check.yml b/learn-pr/wwl-data-ai/introduction-language/4-knowledge-check.yml
@@ -0,0 +1,49 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.introduction-language.knowledge-check
+title: Module assessment
+metadata:
+  title: Module assessment
+  description: "Knowledge check"
+  ms.date: 5/21/2025
+  author: wwlpublish
+  ms.author: sheryang
+  ms.topic: unit
+  ms.custom:
+  - N/A
+durationInMinutes: 3
+quiz:
+  title: "Check your knowledge"
+  questions:
+  - content: "What is the primary purpose of tokenization in natural language processing (NLP)?"
+    choices:
+    - content: "To translate text into another language."
+      isCorrect: false
+      explanation: "Incorrect."
+    - content: "To summarize large documents."
+      isCorrect: false
+      explanation: "Incorrect."
+    - content: "To break down text into smaller units for analysis."
+      isCorrect: true
+      explanation: "Correct."
+  - content: "What is the primary purpose of named entity recognition in text analysis?"
+    choices:
+    - content: "To detect the sentiment of a document."
+      isCorrect: false
+      explanation: "Incorrect."
+    - content: "To identify and categorize entities such as people, places, and organizations."
+      isCorrect: true
+      explanation: "Correct."
+    - content: "To summarize long documents."
+      isCorrect: false
+      explanation: "Incorrect."
+  - content: "Which of the following best describes the function of key phrase extraction?"
+    choices:
+    - content: "It links entities to external sources like Wikipedia."
+      isCorrect: false
+      explanation: "Incorrect."
+    - content: "It identifies the most important concepts from unstructured text."
+      isCorrect: true
+      explanation: "Correct."
+    - content: "It removes irrelevant information from the text."
+      isCorrect: false
+      explanation: "Incorrect."
diff --git a/learn-pr/wwl-data-ai/introduction-language/5-summary.yml b/learn-pr/wwl-data-ai/introduction-language/5-summary.yml
@@ -0,0 +1,15 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.introduction-language.summary
+title: Summary
+metadata:
+  title: Summary
+  description: "Summary"
+  ms.date: 5/21/2025
+  author: wwlpublish
+  ms.author: sheryang
+  ms.topic: unit
+  ms.custom:
+  - N/A
+durationInMinutes: 1
+content: |
+  [!include[](includes/5-summary.md)]
diff --git a/learn-pr/wwl-data-ai/introduction-language/includes/1-introduction.md b/learn-pr/wwl-data-ai/introduction-language/includes/1-introduction.md
@@ -0,0 +1,10 @@
+In order for computer systems to interpret the subject of a text in a similar way humans do, they use **natural language processing** (NLP), an area within AI that deals with understanding written or spoken language, and responding in kind. *Text analysis* describes NLP processes that extract information from unstructured text.   
+
+Natural language processing might be used to create:
+
+- A social media feed analyzer that detects sentiment for a product marketing campaign. 
+- A document search application that summarizes documents in a catalog.
+- An application that extracts brands and company names from text.
+
+In this module, you'll explore natural language processing. 
+
diff --git a/learn-pr/wwl-data-ai/introduction-language/includes/2-how-it-works.md b/learn-pr/wwl-data-ai/introduction-language/includes/2-how-it-works.md
@@ -0,0 +1,85 @@
+Let's examine some general principles and common techniques used to perform text analysis and other natural language processing (NLP) tasks.
+
+Some of the earliest techniques used to analyze text with computers involve statistical analysis of a body of text (a *corpus*) to infer some kind of semantic meaning. Put simply, if you can determine the most commonly used words in a given document, you can often get a good idea of what the document is about.
+
+## Tokenization
+
+The first step in analyzing a corpus is to break it down into *tokens*. For the sake of simplicity, you can think of each distinct word in the training text as a token, though in reality, tokens can be generated for partial words, or combinations of words and punctuation.
+
+For example, consider this phrase from a famous US presidential speech: `"we choose to go to the moon"`. The phrase can be broken down into the following tokens, with numeric identifiers:
+
+```
+1. we 
+2. choose
+3. to
+4. go
+5. the
+6. moon
+```
+
+Notice that `"to"` (token number 3) is used twice in the corpus. The phrase `"we choose to go to the moon"` can be represented by the tokens :::no-loc text="{1,2,3,4,3,5,6}":::.
+
+> [!NOTE]
+> We've used a simple example in which tokens are identified for each distinct word in the text. However, consider the following concepts that may apply to tokenization depending on the specific kind of NLP problem you're trying to solve:
+>
+> - **Text normalization**: Before generating tokens, you may choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning may be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he has worked. You may also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence
+> - **Stop word removal**. Stop words are words that should be excluded from the analysis. For example, `"the"`, `"a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution may be better able to identify the important words.
+> - **n-grams** are multi-term phrases such as `"I have"` or `"he walked"`. A single word phrase is a `unigram`, a two-word phrase is a `bi-gram`, a three-word phrase is a `tri-gram`, and so on. By considering words as groups, a machine learning model can make better sense of the text. 
+> - **Stemming** is a technique in which algorithms are applied to consolidate words before counting them, so that words with the same root, like `"power"`, `"powered"`, and `"powerful"`, are interpreted as being the same token.
+
+## Frequency analysis
+
+After tokenizing the words, you can perform some analysis to count the number of occurrences of each token. The most commonly used words (other than *stop words* such as `"a"`, `"the"`, and so on) can often provide a clue as to the main subject of a text corpus. For example, the most common words in the entire text of the `"go to the moon"` speech we considered previously include `"new"`, `"go"`, `"space"`, and `"moon"`. If we were to tokenize the text as `bi-grams` (word pairs), the most common `bi-gram` in the speech is `"the moon"`. From this information, we can easily surmise that the text is primarily concerned with space travel and going to the moon.
+
+> [!TIP]
+> Simple frequency analysis in which you simply count the number of occurrences of each token can be an effective way to analyze a single document, but when you need to differentiate across multiple documents within the same corpus, you need a way to determine which tokens are most relevant in each document. *Term frequency - inverse document frequency* (TF-IDF) is a common technique in which a score is calculated based on how often a word or term appears in one document compared to its more general frequency across the entire collection of documents. Using this technique, a high degree of relevance is assumed for words that appear frequently in a particular document, but relatively infrequently across a wide range of other documents.
+
+## Machine learning for text classification
+
+Another useful text analysis technique is to use a classification algorithm, such as *logistic regression*, to train a machine learning model that classifies text based on a known set of categorizations. A common application of this technique is to train a model that classifies text as *positive* or *negative* in order to perform *sentiment analysis* or *opinion mining*.
+
+For example, consider the following restaurant reviews, which are already labeled as **0** (*negative*) or **1** (*positive*):
+
+```
+- *The food and service were both great*: 1
+- *A really terrible experience*: 0
+- *Mmm! tasty food and a fun vibe*: 1
+- *Slow service and substandard food*: 0
+```
+
+With enough labeled reviews, you can train a classification model using the tokenized text as *features* and the sentiment (0 or 1) a *label*. The model will encapsulate a relationship between tokens and sentiment - for example, reviews with tokens for words like `"great"`, `"tasty"`, or `"fun"` are more likely to return a sentiment of **1** (*positive*), while reviews with words like `"terrible"`, `"slow"`, and `"substandard"` are more likely to return **0** (*negative*).
+
+## Semantic language models
+
+As the state of the art for NLP has advanced, the ability to train models that encapsulate the semantic relationship between tokens has led to the emergence of powerful language models. At the heart of these models is the encoding of language tokens as vectors (multi-valued arrays of numbers) known as *embeddings*.
+
+It can be useful to think of the elements in a token embedding vector as coordinates in multidimensional space, so that each token occupies a specific "location." The closer tokens are to one another along a particular dimension, the more semantically related they are. In other words, related words are grouped closer together. As a simple example, suppose the embeddings for our tokens consist of vectors with three elements, for example:
+
+```
+- 4 ("dog"): [10.3.2]
+- 5 ("bark"): [10,2,2]
+- 8 ("cat"): [10,3,1]
+- 9 ("meow"): [10,2,1]
+- 10 ("skateboard"): [3,3,1]
+```
+
+We can plot the location of tokens based on these vectors in three-dimensional space, like this:
+
+![A diagram of tokens plotted on a three-dimensional space.](../media/example-embeddings-graph.png)
+
+The locations of the tokens in the embeddings space include some information about how closely the tokens are related to one another. For example, the token for `"dog"` is close to `"cat"` and also to `"bark"`. The tokens for `"cat"` and `"bark"` are close to `"meow"`. The token for `"skateboard"` is further away from the other tokens.
+
+The language models we use in industry are based on these principles but have greater complexity. For example, the vectors used generally have many more dimensions. There are also multiple ways you can calculate appropriate embeddings for a given set of tokens. Different methods result in different predictions from natural language processing models.
+
+A generalized view of most modern natural language processing solutions is shown in the following diagram. A large corpus of raw text is tokenized and used to train language models, which can support many different types of natural language processing task.
+
+![A diagram of the process to tokenize text and train a language model that supports natural language processing tasks.](../media/language-model.png)
+
+Common NLP tasks supported by language models include:
+- Text analysis, such as extracting key terms or identifying named entities in text.
+- Sentiment analysis and opinion mining to categorize text as *positive* or *negative*.
+- Machine translation, in which text is automatically translated from one language to another.
+- Summarization, in which the main points of a large body of text are summarized.
+- Conversational AI solutions such as *bots* or *digital assistants* in which the language model can interpret natural language input and return an appropriate response.
+
+Next, let's learn more about the capabilities made possible by langauge models.
diff --git a/learn-pr/wwl-data-ai/introduction-language/includes/3-text-analysis.md b/learn-pr/wwl-data-ai/introduction-language/includes/3-text-analysis.md
@@ -0,0 +1,82 @@
+Text analysis includes:
+
+- **Named entity recognition** identifies people, places, events, and more. This feature can also be customized to extract custom categories. 
+- **Entity linking** identifies known entities together with a link to Wikipedia.
+- **Personal identifying information (PII) detection** identifies personally sensitive information, including personal health information (PHI). 
+- **Language detection** identifies the language of the text and returns a language code such as "en" for English.
+- **Sentiment analysis and opinion mining** identifies whether text is positive or negative.
+- **Summarization** summarizes text by identifying the most important information.
+- **Key phrase extraction** lists the main concepts from unstructured text.
+
+## Entity recognition and linking
+
+An entity is an item of a particular type or a category; and in some cases, subtype, such as those as shown in the following table.
+
+|Type|SubType|Example|
+|---|---|---|
+|Person||"Bill Gates", "John"|
+|Location||"Paris", "New York"|
+|Organization||"Microsoft"|
+|Quantity|Number|"6" or "six"|
+|Quantity|Percentage|"25%" or "fifty percent"|
+
+*Entity linking* helps disambiguate entities by linking to a specific reference.
+
+For example, suppose you want to detect entities in the following restaurant review:
+
+> "*I ate at the restaurant in Seattle last week.*"
+
+|Entity|Type|SubType|
+|---|---|---|
+|Seattle|Location||
+|last week|DateTime|DateRange|
+
+
+## Sentiment analysis and opinion mining
+
+Sentiment analysis is 
+
+Scores indicate how likely the provided text is a particular sentiment. 
+
+For example, the following two restaurant reviews could be analyzed for sentiment:
+
+> *Review 1*: "*We had dinner at this restaurant last night and the first thing I noticed was how courteous the staff was. We were greeted in a friendly manner and taken to our table right away. The table was clean, the chairs were comfortable, and the food was amazing.*"
+
+and
+
+> *Review 2*: "*Our dining experience at this restaurant was one of the worst I've ever had. The service was slow, and the food was awful. I'll never eat at this establishment again.*"
+
+The sentiment score for the first review might be: 
+Document sentiment: positive
+Positive score: .90 
+Neutral score: .10
+Negative score: .00
+ 
+The second review might return a response: 
+Document sentiment: negative
+Positive score: .00 
+Neutral score: .00
+Negative score: .99
+
+## Key phrase extraction
+
+Key phrase extraction identifies the main points from text. Consider the restaurant scenario discussed previously. If you have a large number of surveys, it can take a long time to read through the reviews. Instead, you can use the key phrase extraction capabilities of the Language service to summarize the main points.
+
+You might receive a review such as:
+
+> "*We had dinner here for a birthday celebration and had a fantastic experience. We were greeted by a friendly hostess and taken to our table right away. The ambiance was relaxed, the food was amazing, and service was terrific. If you like great food and attentive service, you should try this place.*"
+
+Key phrase extraction can provide some context to this review by extracting the following phrases:
+- birthday celebration
+- fantastic experience
+- friendly hostess
+- great food
+- attentive service
+- dinner
+- table
+- ambiance
+- place
+
+As well as using sentiment analysis to determine that this is a positive review, you can also use the key phrase service to identify important elements of the review.
+
+
diff --git a/learn-pr/wwl-data-ai/introduction-language/includes/5-summary.md b/learn-pr/wwl-data-ai/introduction-language/includes/5-summary.md
@@ -0,0 +1 @@
+In this module, you have learned about text analytics and concepts such as tokenization, frequency analysis, and text classification. You've also been introduced to semantic language models that encode language tokens as vectors for grouping related words. The module further delved into the application of these techniques in natural language processing tasks like text analysis, sentiment analysis, machine translation, summarization, and conversational AI solutions.
diff --git a/learn-pr/wwl-data-ai/introduction-language/index.yml b/learn-pr/wwl-data-ai/introduction-language/index.yml
@@ -0,0 +1,39 @@
+### YamlMime:Module
+uid: learn.wwl.introduction-language
+metadata:
+  title: Introduction to natural language processing
+  description: "Introduction to natural language processing"
+  ms.date: 5/21/2025
+  author: wwlpublish
+  ms.author: sheryang
+  ms.topic: module
+  ms.collection: wwl-ai-copilot
+  ms.custom:
+  - N/A
+  ms.service: azure-ai-language
+  ai-usage: ai-assisted
+title: Introduction to natural language processing
+summary: Explore natural language processing (NLP).
+abstract: Explore natural language processing (NLP).
+prerequisites: Ability to navigate the Azure portal
+iconUrl: /learn/achievements/analyze-text-with-text-analytics-service.svg
+levels:
+- beginner
+roles:
+- ai-engineer
+- data-scientist
+- developer
+- solution-architect
+- student
+products:
+- ai-services
+subjects:
+- natural-language-processing
+units:
+- learn.wwl.introduction-language.introduction
+- learn.wwl.introduction-language.how-it-works
+- learn.wwl.introduction-language.text-analysis
+- learn.wwl.introduction-language.knowledge-check
+- learn.wwl.introduction-language.summary
+badge:
+  uid: learn.wwl.introduction-language.badge
diff --git a/learn-pr/wwl-data-ai/introduction-language/media/example-embeddings-graph.png b/learn-pr/wwl-data-ai/introduction-language/media/example-embeddings-graph.png
diff --git a/learn-pr/wwl-data-ai/introduction-language/media/language-model.png b/learn-pr/wwl-data-ai/introduction-language/media/language-model.png
diff --git a/learn-pr/wwl-data-ai/introduction-language/media/tokenization-pipeline.gif b/learn-pr/wwl-data-ai/introduction-language/media/tokenization-pipeline.gif
diff --git a/learn-pr/wwl-data-ai/introduction-language/media/what-is-artificial-intelligence.png b/learn-pr/wwl-data-ai/introduction-language/media/what-is-artificial-intelligence.png

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+In this module, you have learned about text analytics and concepts such as tokenization, frequency analysis, and text classification. You've also been introduced to semantic language models that encode language tokens as vectors for grouping related words. The module further delved into the application of these techniques in natural language processing tasks like text analysis, sentiment analysis, machine translation, summarization, and conversational AI solutions.`