Skip to content

Commit 52ef598

Browse files
authored
Merge pull request #105804 from likebupt/update-datastore
Update article of access to datastore
2 parents 5ccbf1a + dc9220b commit 52ef598

File tree

4 files changed

+174
-3
lines changed

4 files changed

+174
-3
lines changed
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
---
2+
title: Use the sample datasets
3+
titleSuffix: Azure Machine Learning
4+
description: Descriptions of the datasets used in sample models included in Machine Learning designer. You can use these sample datasets for your pipelines.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
10+
author: likebupt
11+
ms.author: keli19
12+
ms.date: 02/19/2020
13+
---
14+
15+
# Use the sample datasets in Azure Machine Learning designer (preview)
16+
17+
When you create a new pipeline in Azure Machine Learning designer (preview), a number of sample datasets are included by default. Many of these sample datasets are used by the sample models in the designer homepage. Others are included as examples of various types of data typically used in machine learning.
18+
19+
Some of these datasets are available in Azure Blob storage. For these datasets, the following table provides a direct link. You can use these datasets in your pipelines by using the [Import Data](./algorithm-module-reference/import-data.md) module.
20+
21+
The rest of these sample datasets are available under **Datasets**-**Samples** category. You can find this in the module palette to the left of the canvas in the designer. You can use any of these datasets in your own pipeline by dragging it to the canvas.
22+
23+
## Datasets
24+
25+
<table>
26+
27+
<tr>
28+
<th>Dataset name</th>
29+
<th>Dataset description</th>
30+
</tr>
31+
32+
<tr>
33+
<td>Adult Census Income Binary Classification dataset</td>
34+
<td>
35+
A subset of the 1994 Census database, using working adults over the age of 16 with an adjusted income index of > 100.
36+
<p></p>
37+
<b>Usage:</b> Classify people using demographics to predict whether a person earns over 50K a year.
38+
<p></p>
39+
<b>Related Research:</b> Kohavi, R., Becker, B., (1996). UCI Machine Learning Repository <a href="https://archive.ics.uci.edu/ml">https://archive.ics.uci.edu/ml</a>. Irvine, CA: University of California, School of Information and Computer Science
40+
</td>
41+
</tr>
42+
43+
<tr>
44+
<td>Automobile price data (Raw)</td>
45+
<td>
46+
Information about automobiles by make and model, including the price, features such as the number of cylinders and MPG, as well as an insurance risk score.
47+
<p></p>
48+
The risk score is initially associated with auto price. It is then adjusted for actual risk in a process known to actuaries as symboling. A value of +3 indicates that the auto is risky, and a value of -3 that it is probably safe.
49+
<p></p>
50+
<b>Usage:</b> Predict the risk score by features, using regression or multivariate classification.
51+
<p></p>
52+
<b>Related Research:</b> Schlimmer, J.C. (1987). UCI Machine Learning Repository <a href="https://archive.ics.uci.edu/ml">https://archive.ics.uci.edu/ml</a>. Irvine, CA: University of California, School of Information and Computer Science
53+
</td>
54+
</tr>
55+
56+
57+
<tr>
58+
<td>CRM Appetency Labels Shared</td>
59+
<td>
60+
Labels from the KDD Cup 2009 customer relationship prediction challenge (<a href="http://www.sigkdd.org/site/2009/files/orange_small_train_appetency.labels">orange_small_train_appetency.labels</a>).
61+
</td>
62+
</tr>
63+
64+
<tr>
65+
<td>CRM Churn Labels Shared</td>
66+
<td>
67+
Labels from the KDD Cup 2009 customer relationship prediction challenge (<a href="http://www.sigkdd.org/site/2009/files/orange_small_train_churn.labels">orange_small_train_churn.labels</a>).
68+
</td>
69+
</tr>
70+
71+
<tr>
72+
<td>CRM Dataset Shared</td>
73+
<td>
74+
This data comes from the KDD Cup 2009 customer relationship prediction challenge (<a href="http://www.sigkdd.org/site/2009/files/orange_small_train.data.zip">orange_small_train.data.zip</a>).
75+
<p></p>
76+
The dataset contains 50K customers from the French Telecom company Orange. Each customer has 230 anonymized features, 190 of which are numeric and 40 are categorical. The features are very sparse.
77+
</td>
78+
</tr>
79+
80+
<tr>
81+
<td>CRM Upselling Labels Shared</td>
82+
<td>
83+
Labels from the KDD Cup 2009 customer relationship prediction challenge (<a href="http://www.sigkdd.org/site/2009/files/orange_large_train_upselling.labels">orange_large_train_upselling.labels</a>).
84+
</td>
85+
</tr>
86+
87+
<tr>
88+
<td>Flight Delays Data</td>
89+
<td>
90+
Passenger flight on-time performance data taken from the TranStats data collection of the U.S. Department of Transportation (<a href="https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time">On-Time</a>).
91+
<p></p>
92+
The dataset covers the time period April-October 2013. Before uploading to the designer, the dataset was processed as follows:
93+
<ul>
94+
<li>The dataset was filtered to cover only the 70 busiest airports in the continental US</li>
95+
<li>Canceled flights were labeled as delayed by more than 15 minutes</li>
96+
<li>Diverted flights were filtered out</li>
97+
<li>The following columns were selected: Year, Month, DayofMonth, DayOfWeek, Carrier, OriginAirportID, DestAirportID, CRSDepTime, DepDelay, DepDel15, CRSArrTime, ArrDelay, ArrDel15, Canceled</li>
98+
</ul>
99+
</td>
100+
</tr>
101+
102+
<tr>
103+
<td>German Credit Card UCI dataset</td>
104+
<td>
105+
The UCI Statlog (German Credit Card) dataset (<a href="https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)">Statlog+German+Credit+Data</a>), using the german.data file.
106+
<p></p>
107+
The dataset classifies people, described by a set of attributes, as low or high credit risks. Each example represents a person. There are 20 features, both numerical and categorical, and a binary label (the credit risk value). High credit risk entries have label = 2, low credit risk entries have label = 1. The cost of misclassifying a low risk example as high is 1, whereas the cost of misclassifying a high risk example as low is 5.
108+
</td>
109+
</tr>
110+
111+
<tr>
112+
<td>IMDB Movie Titles</td>
113+
<td>
114+
The dataset contains information about movies that were rated in Twitter tweets: IMDB movie ID, movie name, genre, and production year. There are 17K movies in the dataset. The dataset was introduced in the paper "S. Dooms, T. De Pessemier and L. Martens. MovieTweetings: a Movie Rating Dataset Collected From Twitter. Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys 2013."
115+
</td>
116+
</tr>
117+
118+
<tr>
119+
<td>Movie Ratings</td>
120+
<td>
121+
The dataset is an extended version of the Movie Tweetings dataset. The dataset has 170K ratings for movies, extracted from well-structured tweets on Twitter. Each instance represents a tweet and is a tuple: user ID, IMDB movie ID, rating, timestamp, number of favorites for this tweet, and number of retweets of this tweet. The dataset was made available by A. Said, S. Dooms, B. Loni and D. Tikk for Recommender Systems Challenge 2014.
122+
</td>
123+
</tr>
124+
125+
126+
<tr>
127+
<td>Weather Dataset</td>
128+
<td>
129+
Hourly land-based weather observations from NOAA (<a href="https://az754797.vo.msecnd.net/data/WeatherDataset.csv">merged data from 201304 to 201310</a>).
130+
<p></p>
131+
The weather data covers observations made from airport weather stations, covering the time period April-October 2013. Before uploading to the designer, the dataset was processed as follows:
132+
<ul>
133+
<li>Weather station IDs were mapped to corresponding airport IDs</li>
134+
<li>Weather stations not associated with the 70 busiest airports were filtered out</li>
135+
<li>The Date column was split into separate Year, Month, and Day columns</li>
136+
<li>The following columns were selected: AirportID, Year, Month, Day, Time, TimeZone, SkyCondition, Visibility, WeatherType, DryBulbFarenheit, DryBulbCelsius, WetBulbFarenheit, WetBulbCelsius, DewPointFarenheit, DewPointCelsius, RelativeHumidity, WindSpeed, WindDirection, ValueForWindCharacter, StationPressure, PressureTendency, PressureChange, SeaLevelPressure, RecordType, HourlyPrecip, Altimeter</li>
137+
</ul>
138+
</td>
139+
</tr>
140+
141+
<tr>
142+
<td>Wikipedia SP 500 Dataset</td>
143+
<td>
144+
Data is derived from Wikipedia (<a href="https://www.wikipedia.org/">https://www.wikipedia.org/</a>) based on articles of each S&P 500 company, stored as XML data.
145+
<p></p>
146+
Before uploading to the designer, the dataset was processed as follows:
147+
<ul>
148+
<li>Extract text content for each specific company</li>
149+
<li>Remove wiki formatting</li>
150+
<li>Remove non-alphanumeric characters</li>
151+
<li>Convert all text to lowercase</li>
152+
<li>Known company categories were added</li>
153+
</ul>
154+
<p></p>
155+
Note that for some companies an article could not be found, so the number of records is less than 500.
156+
</td>
157+
</tr>
158+
159+
</table>
160+
161+
## Next steps
162+
163+
* Learn the basics of predictive analytics and machine learning with [Tutorial: Predict automobile price with the designer](tutorial-designer-automobile-price-train-score.md)
164+
165+
* Use [Import Data](./algorithm-module-reference/import-data.md) module to import sample datasets

articles/machine-learning/how-to-access-data.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ services: machine-learning
66
ms.service: machine-learning
77
ms.subservice: core
88
ms.topic: conceptual
9-
ms.author: sihhu
10-
author: MayMSFT
9+
ms.author: keli19
10+
author: likebupt
1111
ms.reviewer: nibaccam
12-
ms.date: 01/15/2020
12+
ms.date: 02/27/2020
1313
ms.custom: seodec18
1414

1515
# Customer intent: As an experienced Python developer, I need to make my data in Azure Storage available to my remote compute to train my machine learning models.
@@ -186,6 +186,10 @@ The following example demonstrates what the form looks like when you create an A
186186

187187
## Get datastores from your workspace
188188

189+
> [!IMPORTANT]
190+
> Azure Machine Learning designer (preview) will create a datastore named **azureml_globaldatasets** automatically when you open a sample in the designer homepage. This datastore only contains sample datasets. Please **do not** use this datastore for any confidential data access!
191+
> ![Auto-created datastore for designer sample datasets](media/how-to-access-data/datastore-designer-sample.png)
192+
189193
To get a specific datastore registered in the current workspace, use the [`get()`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)?view=azure-ml-py#get-workspace--datastore-name-) static method on the `Datastore` class:
190194

191195
```Python
29.8 KB
Loading

articles/machine-learning/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@
7979
href: /azure/open-datasets/samples?context=azure/machine-learning/service/context/ml-context
8080
- name: End-to-end MLOps examples
8181
href: https://github.com/microsoft/MLOps
82+
- name: Designer sample datasets
83+
href: designer-sample-datasets.md
8284

8385
- name: Concepts
8486
items:

0 commit comments

Comments
 (0)