Skip to content

Commit 2f71115

Browse files
Merge pull request #50577 from sherzyang/main
Add Introduction to computer vision module with fixes.
2 parents 4dc91b3 + d39333d commit 2f71115

23 files changed

+400
-0
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.introduction-computer-vision.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: Introduction
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 5/20/2025
10+
ms.topic: unit
11+
ms.collection:
12+
- wwl-ai-copilot
13+
durationInMinutes: 1
14+
content: |
15+
[!include[](includes/1-introduction.md)]
16+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.introduction-computer-vision.overview
3+
title: Overview
4+
metadata:
5+
title: Overview
6+
description: Overview
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 5/20/2025
10+
ms.topic: unit
11+
ms.collection:
12+
- wwl-ai-copilot
13+
durationInMinutes: 4
14+
content: |
15+
[!include[](includes/2-overview.md)]
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.introduction-computer-vision.understand-image-processing
3+
title: Understand image processing
4+
metadata:
5+
title: Understand image processing
6+
description: Understand how computers process images.
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 5/20/2025
10+
ms.topic: unit
11+
ms.collection:
12+
- wwl-ai-copilot
13+
durationInMinutes: 4
14+
content: |
15+
[!include[](includes/3-understand-image-processing.md)]
16+
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.introduction-computer-vision.computer-vision-models
3+
title: Machine learning for computer vision
4+
metadata:
5+
title: Machine learning for computer vision
6+
description: Understand machine learning for computer vision
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 5/20/2025
10+
ms.topic: unit
11+
ms.collection:
12+
- wwl-ai-copilot
13+
durationInMinutes: 5
14+
content: |
15+
[!include[](includes/4-computer-vision-models.md)]
16+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.introduction-computer-vision.modern-vision-models
3+
title: Understand modern vision models
4+
metadata:
5+
title: Understand modern vision models
6+
description: Understand transformers and multimodal models
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 5/20/2025
10+
ms.topic: unit
11+
ms.collection:
12+
- wwl-ai-copilot
13+
durationInMinutes: 5
14+
content: |
15+
[!include[](includes/5-modern-vision-models.md)]
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.introduction-computer-vision.knowledge-check
3+
title: Module assessment
4+
metadata:
5+
title: Module assessment
6+
description: Knowledge check
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 5/20/2025
10+
ms.topic: unit
11+
ms.collection:
12+
- wwl-ai-copilot
13+
durationInMinutes: 3
14+
quiz:
15+
title: "Check your knowledge"
16+
questions:
17+
- content: "Computer vision is based on the manipulation and analysis of what kinds of values in an image?"
18+
choices:
19+
- content: "Timestamps in photograph metadata"
20+
isCorrect: false
21+
explanation: "Incorrect. Timestamps in the image metadata do not enable computer vision."
22+
- content: "Pixels"
23+
isCorrect: true
24+
explanation: "Correct. Pixels are numeric values that represent shade intensity for points in the image."
25+
- content: "Image file names"
26+
isCorrect: false
27+
explanation: "Incorrect. While file names might offer some clues as the image subject, they do not inherently enable computer vision."
28+
- content: "What is the primary role of filters in a convolutional neural network (CNN) used for image classification?"
29+
choices:
30+
- content: "To apply visual effects to enhance image appearance."
31+
isCorrect: false
32+
explanation: "Incorrect."
33+
- content: "To extract numeric features from images for use in a neural network."
34+
isCorrect: true
35+
explanation: "Correct."
36+
- content: "To compress image size for faster processing."
37+
isCorrect: false
38+
explanation: "Incorrect."
39+
- content: "What is the primary function of a multi-modal model in computer vision?"
40+
choices:
41+
- content: "To generate random captions for unlabeled images."
42+
isCorrect: false
43+
explanation: "Incorrect."
44+
- content: "To replace CNNs entirely in all vision tasks."
45+
isCorrect: false
46+
explanation: "Incorrect."
47+
- content: "To combine image features with natural language embeddings for richer understanding."
48+
isCorrect: true
49+
explanation: "Correct."
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.introduction-computer-vision.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: Summary
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 5/20/2025
10+
ms.topic: unit
11+
ms.collection:
12+
- wwl-ai-copilot
13+
durationInMinutes: 1
14+
content: |
15+
[!include[](includes/7-summary.md)]
16+
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
**Computer vision** is one of the core areas of artificial intelligence (AI), and focuses on creating solutions that enable AI applications to "see" the world and make sense of it.
2+
3+
Consider these scenarios:
4+
5+
- A hospital wants to detect and track surgical instruments in real-time during operations.
6+
- A retail company needs to classify products like shoes, shirts, and electronics, in images into categories.
7+
- A wildlife preservation organization needs to identify the animals that walk through video footage.
8+
- A city's transportation department needs to read and extract text from images of license plates.
9+
- A manufacturing company wants to analyze visual patterns for defects.
10+
11+
Of course, computers don't have biological eyes that work the way ours do, but they're capable of processing images; either from a live camera feed or from digital photographs or videos. This ability to process images is the key to creating software that can emulate human visual perception. In this module, we'll examine the building blocks that underlie modern computer vision solutions.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
Computer vision capabilities can be categorized into a few main types:
2+
3+
|**Type**|**Description**|
4+
|-|-|
5+
|**Image analysis**| The ability to detect, classify, caption, and generate insights.|
6+
|**Spatial analysis**| The ability to understand people's presence and movements within physical areas in real time.|
7+
|**Facial recognition**|The ability to recognize and verify human identity.|
8+
|**Optical character recognition (OCR)**| The ability to extract printed and handwritten text from images with varied languages and writing styles.|
9+
10+
To understand these computer vision capabilities, it's useful to consider what an image actually *is* in the context of data for a computer program.
11+
12+
## Images as pixel arrays
13+
14+
To a computer, an image is an array of numeric *pixel* values. For example, consider the following array:
15+
16+
```
17+
0 0 0 0 0 0 0
18+
0 0 0 0 0 0 0
19+
0 0 255 255 255 0 0
20+
0 0 255 255 255 0 0
21+
0 0 255 255 255 0 0
22+
0 0 0 0 0 0 0
23+
0 0 0 0 0 0 0
24+
```
25+
26+
The array consists of seven rows and seven columns, representing the pixel values for a 7x7 pixel image (which is known as the image's *resolution*). Each pixel has a value between 0 (black) and 255 (white); with values between these bounds representing shades of gray. The image represented by this array looks similar to the following (magnified) image:
27+
28+
![Diagram of a grayscale image.](../media/white-square.png)
29+
30+
The array of pixel values for this image is two-dimensional (representing rows and columns, or *x* and *y* coordinates) and defines a single rectangle of pixel values. A single layer of pixel values like this represents a grayscale image. In reality, most digital images are multidimensional and consist of three layers (known as *channels*) that represent red, green, and blue (RGB) color hues. For example, we could represent a color image by defining three channels of pixel values that create the same square shape as the previous grayscale example:
31+
32+
```
33+
Red:
34+
150 150 150 150 150 150 150
35+
150 150 150 150 150 150 150
36+
150 150 255 255 255 150 150
37+
150 150 255 255 255 150 150
38+
150 150 255 255 255 150 150
39+
150 150 150 150 150 150 150
40+
150 150 150 150 150 150 150
41+
42+
Green:
43+
0 0 0 0 0 0 0
44+
0 0 0 0 0 0 0
45+
0 0 255 255 255 0 0
46+
0 0 255 255 255 0 0
47+
0 0 255 255 255 0 0
48+
0 0 0 0 0 0 0
49+
0 0 0 0 0 0 0
50+
51+
Blue:
52+
255 255 255 255 255 255 255
53+
255 255 255 255 255 255 255
54+
255 255 0 0 0 255 255
55+
255 255 0 0 0 255 255
56+
255 255 0 0 0 255 255
57+
255 255 255 255 255 255 255
58+
255 255 255 255 255 255 255
59+
```
60+
61+
Here's the resulting image:
62+
63+
![Diagram of a color image.](../media/color-square.png)
64+
65+
The purple squares are represented by the combination:
66+
```
67+
Red: 150
68+
Green: 0
69+
Blue: 255
70+
```
71+
72+
The yellow squares in the center are represented by the combination:
73+
```
74+
Red: 255
75+
Green: 255
76+
Blue: 0
77+
```
78+
79+
Next, let's explore how images are processed.
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
A common way to perform image processing tasks is to apply *filters* that modify the pixel values of the image to create a visual effect. A filter is defined by one or more arrays of pixel values, called filter *kernels*. For example, you could define filter with a 3x3 kernel as shown in this example:
2+
3+
```
4+
-1 -1 -1
5+
-1 8 -1
6+
-1 -1 -1
7+
```
8+
9+
The kernel is then *convolved* across the image, calculating a weighted sum for each 3x3 patch of pixels and assigning the result to a new image. It's easier to understand how the filtering works by exploring a step-by-step example.
10+
11+
Let's start with the grayscale image we explored previously:
12+
13+
```
14+
0 0 0 0 0 0 0
15+
0 0 0 0 0 0 0
16+
0 0 255 255 255 0 0
17+
0 0 255 255 255 0 0
18+
0 0 255 255 255 0 0
19+
0 0 0 0 0 0 0
20+
0 0 0 0 0 0 0
21+
```
22+
23+
First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value by the corresponding weight value in the kernel and adding the results:
24+
25+
```
26+
(0 x -1) + (0 x -1) + (0 x -1) +
27+
(0 x -1) + (0 x 8) + (0 x -1) +
28+
(0 x -1) + (0 x -1) + (255 x -1) = -255
29+
```
30+
31+
The result (-255) becomes the first value in a new array. Then we move the filter kernel along one pixel to the right and repeat the operation:
32+
33+
```
34+
(0 x -1) + (0 x -1) + (0 x -1) +
35+
(0 x -1) + (0 x 8) + (0 x -1) +
36+
(0 x -1) + (255 x -1) + (255 x -1) = -510
37+
```
38+
39+
Again, the result is added to the new array, which now contains two values:
40+
41+
```
42+
-255 -510
43+
```
44+
45+
The process is repeated until the filter has been convolved across the entire image, as shown in this animation:
46+
47+
![Diagram of a filter.](../media/filter.gif)
48+
49+
The filter is convolved across the image, calculating a new array of values. Some of the values might be outside of the 0 to 255 pixel value range, so the values are adjusted to fit into that range. Because of the shape of the filter, the outside edge of pixels isn't calculated, so a padding value (usually 0) is applied. The resulting array represents a new image in which the filter has transformed the original image. In this case, the filter has had the effect of highlighting the *edges* of shapes in the image.
50+
51+
To see the effect of the filter more clearly, here's an example of the same filter applied to a real image:
52+
53+
| Original Image | Filtered Image |
54+
|--|--|
55+
|![Diagram of a banana.](../media/banana-grayscale.png)| ![Diagram of a filtered banana.](../media/laplace.png)|
56+
57+
Because the filter is convolved across the image, this kind of image manipulation is often referred to as *convolutional filtering*. The filter used in this example is a particular type of filter (called a *laplace* filter) that highlights the edges on objects in an image. There are many other kinds of filter that you can use to create blurring, sharpening, color inversion, and other effects.
58+
59+
Next, let's connect concepts of convolutional filtering to modern vision models.

0 commit comments

Comments
 (0)