Skip to content

Commit a5a0295

Browse files
authored
Merge branch 'microsoft:main' into main
2 parents b3aabac + 7da9964 commit a5a0295

File tree

8 files changed

+581
-56
lines changed

8 files changed

+581
-56
lines changed

1-Introduction/01-defining-data-science/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Defining Data Science
22

3-
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/01-Definitions.png)|
4-
|:---:|
5-
|Defining Data Science - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
3+
| ![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/01-Definitions.png) |
4+
| :----------------------------------------------------------------------------------------------------: |
5+
| Defining Data Science - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
66

77
---
88

@@ -69,11 +69,11 @@ Vast amounts of data are incomprehensible for a human being, but once we create
6969

7070
As we have already mentioned - data is everywhere, we just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data. The former are typically represented in some well-structured form, often as a table or number of tables, while latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly.
7171

72-
| Structured | Semi-structured | Unstructured |
73-
|----------- |-----------------|--------------|
74-
| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Britannica |
75-
| Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents |
76-
| Data for age and gender of all people entering the building | Internet pages | Raw video feed from surveillance camera |
72+
| Structured | Semi-structured | Unstructured |
73+
| ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------- |
74+
| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Britannica |
75+
| Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents |
76+
| Data for age and gender of all people entering the building | Internet pages | Raw video feed from surveillance camera |
7777

7878
## Where to get Data
7979

@@ -107,7 +107,7 @@ First step is to collect the data. While in many cases it can be a straightforwa
107107
Storing the data can be challenging, especially if we are talking about big data. When deciding how to store data, it makes sense to anticipate the way you would want later on to query them. There are several ways data can be stored:
108108
<ul>
109109
<li>Relational database stores a collection of tables, and uses a special language called SQL to query them. Typically, tables would be connected to each other using some schema. In many cases we need to convert the data from original form to fit the schema.</li>
110-
<li><a href="https://en.wikipedia.org/wiki/NoSQL">NoSQL</a> database, such as <a href="https://azure.microsoft.com/services/cosmos-db/?WT.mc_id=acad-31812-dmitryso">CosmosDB</a>, does not enforce schema on data, and allows storing more complex data, for example, hierarchical JSON documents or graphs. However, NoSQL database does not have rich querying capabilities of SQL, and cannot enforce referential integrity between data.</li>
110+
<li><a href="https://en.wikipedia.org/wiki/NoSQL">NoSQL</a> database, such as <a href="https://azure.microsoft.com/services/cosmos-db/?WT.mc_id=academic-31812-dmitryso">CosmosDB</a>, does not enforce schema on data, and allows storing more complex data, for example, hierarchical JSON documents or graphs. However, NoSQL database does not have rich querying capabilities of SQL, and cannot enforce referential integrity between data.</li>
111111
<li><a href="https://en.wikipedia.org/wiki/Data_lake">Data Lake</a> storage is used for large collections of data in raw form. Data lakes are often used with big data, where all data cannot fit into one machine, and has to be stored and processed by a cluster. <a href="https://en.wikipedia.org/wiki/Apache_Parquet">Parquet</a> is the data format that is often used in conjunction with big data.</li>
112112
</ul>
113113
</dd>

1-Introduction/02-ethics/translations/README.hi.md

Lines changed: 260 additions & 0 deletions
Large diffs are not rendered by default.

2-Working-With-Data/07-python/README.md

Lines changed: 45 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Working with Data: Python and the Pandas Library
22

3-
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/07-WorkWithPython.png)|
4-
|:---:|
5-
|Working With Python - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
3+
| ![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/07-WorkWithPython.png) |
4+
| :-------------------------------------------------------------------------------------------------------: |
5+
| Working With Python - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
66

77
[![Intro Video](images/video-ds-python.png)](https://youtu.be/dZjWOGbsN4Y)
88

@@ -16,7 +16,7 @@ Data processing can be programmed in any programming language, but there are cer
1616
In this lesson, we will focus on using Python for simple data processing. We will assume basic familiarity with the language. If you want a deeper tour of Python, you can refer to one of the following resources:
1717

1818
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse) - GitHub-based quick intro course into Python Programming
19-
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=acad-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=acad-31812-dmitryso)
19+
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=academic-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=academic-31812-dmitryso)
2020

2121
Data can come in many forms. In this lesson, we will consider three forms of data - **tabular data**, **text** and **images**.
2222

@@ -97,28 +97,28 @@ b = pd.Series(["I","like","to","play","games","and","will","not","change"],index
9797
df = pd.DataFrame([a,b])
9898
```
9999
This will create a horizontal table like this:
100-
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
101-
|---|---|---|---|---|---|---|---|---|---|
102-
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
103-
| 1 | I | like | to | use | Python | and | Pandas | very | much |
100+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
101+
| --- | --- | ---- | --- | --- | ------ | --- | ------ | ---- | ---- |
102+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
103+
| 1 | I | like | to | use | Python | and | Pandas | very | much |
104104

105105
We can also use Series as columns, and specify column names using dictionary:
106106
```python
107107
df = pd.DataFrame({ 'A' : a, 'B' : b })
108108
```
109109
This will give us a table like this:
110110

111-
| | A | B |
112-
|---|---|---|
113-
| 0 | 1 | I |
114-
| 1 | 2 | like |
115-
| 2 | 3 | to |
116-
| 3 | 4 | use |
117-
| 4 | 5 | Python |
118-
| 5 | 6 | and |
119-
| 6 | 7 | Pandas |
120-
| 7 | 8 | very |
121-
| 8 | 9 | much |
111+
| | A | B |
112+
| --- | --- | ------ |
113+
| 0 | 1 | I |
114+
| 1 | 2 | like |
115+
| 2 | 3 | to |
116+
| 3 | 4 | use |
117+
| 4 | 5 | Python |
118+
| 5 | 6 | and |
119+
| 6 | 7 | Pandas |
120+
| 7 | 8 | very |
121+
| 8 | 9 | much |
122122

123123
**Note** that we can also get this table layout by transposing the previous table, eg. by writing
124124
```python
@@ -154,17 +154,17 @@ df['LenB'] = df['B'].apply(len)
154154

155155
After operations above, we will end up with the following DataFrame:
156156

157-
| | A | B | DivA | LenB |
158-
|---|---|---|---|---|
159-
| 0 | 1 | I | -4.0 | 1 |
160-
| 1 | 2 | like | -3.0 | 4 |
161-
| 2 | 3 | to | -2.0 | 2 |
162-
| 3 | 4 | use | -1.0 | 3 |
163-
| 4 | 5 | Python | 0.0 | 6 |
164-
| 5 | 6 | and | 1.0 | 3 |
165-
| 6 | 7 | Pandas | 2.0 | 6 |
166-
| 7 | 8 | very | 3.0 | 4 |
167-
| 8 | 9 | much | 4.0 | 4 |
157+
| | A | B | DivA | LenB |
158+
| --- | --- | ------ | ---- | ---- |
159+
| 0 | 1 | I | -4.0 | 1 |
160+
| 1 | 2 | like | -3.0 | 4 |
161+
| 2 | 3 | to | -2.0 | 2 |
162+
| 3 | 4 | use | -1.0 | 3 |
163+
| 4 | 5 | Python | 0.0 | 6 |
164+
| 5 | 6 | and | 1.0 | 3 |
165+
| 6 | 7 | Pandas | 2.0 | 6 |
166+
| 7 | 8 | very | 3.0 | 4 |
167+
| 8 | 9 | much | 4.0 | 4 |
168168

169169
**Selecting rows based on numbers** can be done using `iloc` construct. For example, to select first 5 rows from the DataFrame:
170170
```python
@@ -183,13 +183,13 @@ df.groupby(by='LenB') \
183183
```
184184
This gives us the following table:
185185

186-
| LenB | Count | Mean |
187-
|------|-------|------|
188-
| 1 | 1 | 1.000000 |
189-
| 2 | 1 | 3.000000 |
190-
| 3 | 2 | 5.000000 |
191-
| 4 | 3 | 6.333333 |
192-
| 6 | 2 | 6.000000 |
186+
| LenB | Count | Mean |
187+
| ---- | ----- | -------- |
188+
| 1 | 1 | 1.000000 |
189+
| 2 | 1 | 3.000000 |
190+
| 3 | 2 | 5.000000 |
191+
| 4 | 3 | 6.333333 |
192+
| 6 | 2 | 6.000000 |
193193

194194
### Getting Data
195195

@@ -230,7 +230,7 @@ While data very often comes in tabular form, in some cases we need to deal with
230230

231231
In this challenge, we will continue with the topic of COVID pandemic, and focus on processing scientific papers on the subject. There is [CORD-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) with more than 7000 (at the time of writing) papers on COVID, available with metadata and abstracts (and for about half of them there is also full text provided).
232232

233-
A full example of analyzing this dataset using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health/?WT.mc_id=acad-31812-dmitryso) cognitive service is described [in this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/). We will discuss simplified version of this analysis.
233+
A full example of analyzing this dataset using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health/?WT.mc_id=academic-31812-dmitryso) cognitive service is described [in this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/). We will discuss simplified version of this analysis.
234234

235235
> **NOTE**: We do not provide a copy of the dataset as part of this repository. You may first need to download the [`metadata.csv`](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv) file from [this dataset on Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). Registration with Kaggle may be required. You may also download the dataset without registration [from here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html), but it will include all full texts in addition to metadata file.
236236
@@ -242,15 +242,15 @@ Open [`notebook-papers.ipynb`](notebook-papers.ipynb) and read it from top to bo
242242

243243
Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:
244244

245-
* **Image Classification**, which can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=acad-31812-dmitryso)
246-
* **Object Detection** to detect different objects in the image. Services such as [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=acad-31812-dmitryso) can detect a number of common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=acad-31812-dmitryso) model to detect some specific objects of interest.
247-
* **Face Detection**, including Age, Gender and Emotion detection. This can be done via [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=acad-31812-dmitryso).
245+
* **Image Classification**, which can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-31812-dmitryso)
246+
* **Object Detection** to detect different objects in the image. Services such as [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-31812-dmitryso) can detect a number of common objects, and you can train [Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-31812-dmitryso) model to detect some specific objects of interest.
247+
* **Face Detection**, including Age, Gender and Emotion detection. This can be done via [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-31812-dmitryso).
248248

249-
All those cloud services can be called using [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=acad-31812-dmitryso), and thus can be easily incorporated into your data exploration workflow.
249+
All those cloud services can be called using [Python SDKs](https://docs.microsoft.com/samples/azure-samples/cognitive-services-python-sdk-samples/cognitive-services-python-sdk-samples/?WT.mc_id=academic-31812-dmitryso), and thus can be easily incorporated into your data exploration workflow.
250250

251251
Here are some examples of exploring data from Image data sources:
252-
* In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/) we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=acad-31812-dmitryso), and then use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=acad-31812-dmitryso) to build interpretable model.
253-
* In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies) we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=acad-31812-dmitryso) to extract emotions on people on photographs from events, in order to try to understand what makes people happy.
252+
* In the blog post [How to Learn Data Science without Coding](https://soshnikov.com/azure/how-to-learn-data-science-without-coding/) we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using [computer vision](https://azure.microsoft.com/services/cognitive-services/computer-vision/?WT.mc_id=academic-31812-dmitryso), and then use [Azure Machine Learning AutoML](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml/?WT.mc_id=academic-31812-dmitryso) to build interpretable model.
253+
* In [Facial Studies Workshop](https://github.com/CloudAdvocacy/FaceStudies) we use [Face API](https://azure.microsoft.com/services/cognitive-services/face/?WT.mc_id=academic-31812-dmitryso) to extract emotions on people on photographs from events, in order to try to understand what makes people happy.
254254

255255
## Conclusion
256256

@@ -271,7 +271,7 @@ Whether you already have structured or unstructured data, using Python you can p
271271

272272
**Learning Python**
273273
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse)
274-
* [Take your First Steps with Python](https://docs.microsoft.com/learn/paths/python-first-steps/?WT.mc_id=acad-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=acad-31812-dmitryso)
274+
* [Take your First Steps with Python](https://docs.microsoft.com/learn/paths/python-first-steps/?WT.mc_id=academic-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=academic-31812-dmitryso)
275275

276276
## Assignment
277277

2-Working-With-Data/07-python/notebook-papers.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"\r\n",
88
"In this challenge, we will continue with the topic of COVID pandemic, and focus on processing scientific papers on the subject. There is [CORD-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) with more than 7000 (at the time of writing) papers on COVID, available with metadata and abstracts (and for about half of them there is also full text provided).\r\n",
99
"\r\n",
10-
"A full example of analyzing this dataset using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health/?WT.mc_id=acad-31812-dmitryso) cognitive service is described [in this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/). We will discuss simplified version of this analysis."
10+
"A full example of analyzing this dataset using [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health/?WT.mc_id=academic-31812-dmitryso) cognitive service is described [in this blog post](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/). We will discuss simplified version of this analysis."
1111
],
1212
"metadata": {}
1313
},

0 commit comments

Comments
 (0)