Skip to content

Commit 55d306a

Browse files
committed
sync with notebooks
1 parent e06d872 commit 55d306a

File tree

2 files changed

+40
-64
lines changed

2 files changed

+40
-64
lines changed

articles/machine-learning/tutorial-explore-data.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -121,22 +121,23 @@ Data asset creation also creates a *reference* to the data source location, alon
121121
122122
The next notebook cell creates the data asset. The code sample uploads the raw data file to the designated cloud storage resource.
123123

124-
Each time you create a data asset, you need a unique version for it. If the version already exists, you'll get an error. This code uses time to generate a unique version, each time the cell is run.
124+
Each time you create a data asset, you need a unique version for it. If the version already exists, you'll get an error. In this code, we're using the "initial" for the first read of the data. If that version already exists, we'll skip creating it again.
125125

126-
You can also omit the **version** parameter, and a version number is generated for you, starting with 1 and then incrementing from there. In this tutorial, we want to refer to specific version numbers, so we create a version number instead.
126+
You can also omit the **version** parameter, and a version number is generated for you, starting with 1 and then incrementing from there.
127+
128+
In this tutorial, we use the name "initial" as the first version. The [Create production machine learning pipelines](pipeline.ipynb) tutorial will also use this version of the data, so here we are using a value that you'll see again in that tutorial.
127129

128130

129131
```python
130132
from azure.ai.ml.entities import Data
131133
from azure.ai.ml.constants import AssetTypes
132-
import time
133134

134135
# update the 'my_path' variable to match the location of where you downloaded the data on your
135136
# local filesystem
136137

137138
my_path = "./data/default_of_credit_card_clients.csv"
138-
# set the version number of the data asset to the current UTC time
139-
v1 = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
139+
# set the version number of the data asset
140+
v1 = "initial"
140141

141142
my_data = Data(
142143
name="credit-card",
@@ -146,10 +147,15 @@ my_data = Data(
146147
type=AssetTypes.URI_FILE,
147148
)
148149

149-
# create data asset
150-
ml_client.data.create_or_update(my_data)
151-
152-
print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")
150+
## create data asset if it doesn't already exist:
151+
try:
152+
data_asset = ml_client.data.get(name="credit-card", version=v1)
153+
print(
154+
f"Data asset already exists. Name: {my_data.name}, version: {my_data.version}"
155+
)
156+
except:
157+
ml_client.data.create_or_update(my_data)
158+
print(f"Data asset created. Name: {my_data.name}, version: {my_data.version}")
153159
```
154160

155161
You can see the uploaded data by selecting **Data** on the left. You'll see the data is uploaded and a data asset is created:
@@ -240,11 +246,7 @@ This table shows the structure of the data in the original **default_of_credit_c
240246
|X18-23 | Explanatory | Amount of previous payment (NT dollar) from April to September 2005. |
241247
|Y | Response | Default payment (Yes = 1, No = 0) |
242248

243-
Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage):
244-
245-
> [!NOTE]
246-
>
247-
> This Python code cell sets **name** and **version** values for the data asset it creates. As a result, the code in this cell will fail if executed more than once, without a change to these values. Fixed **name** and **version** values offer a way to pass values that work for specific situations, without concern for auto-generated or randomly-generated values.
249+
Next, create a new _version_ of the data asset (the data automatically uploads to cloud storage). For this version, we'll add a time value, so that each time this code is run, a different version number will be created.
248250

249251

250252

@@ -254,7 +256,7 @@ from azure.ai.ml.constants import AssetTypes
254256
import time
255257

256258
# Next, create a new *version* of the data asset (the data is automatically uploaded to cloud storage):
257-
v2 = v1 + "_cleaned"
259+
v2 = "cleaned" + time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())
258260
my_path = "./data/cleaned-credit-card.parquet"
259261

260262
# Define the data asset, and use tags to make it clear the asset can be used in training

articles/machine-learning/tutorial-pipeline-python-sdk.md

Lines changed: 23 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,8 @@ The two steps are first data preparation and second training.
5151

5252
1. [!INCLUDE [sign in](includes/prereq-sign-in.md)]
5353

54+
1. Complete the tutorial [Upload, access and explore your data](tutorial-explore-data.md) to create the data asset you need in this tutorial. Make sure you run all the code to create the initial data asset. Explore the data and revise it if you wish, but you'll only need the initial data in this tutorial.
55+
5456
1. [!INCLUDE [open or create notebook](includes/prereq-open-or-create.md)]
5557
* [!INCLUDE [new notebook](includes/prereq-new-notebook.md)]
5658
* Or, open **tutorials/get-started-notebooks/pipeline.ipynb** from the **Samples** section of studio. [!INCLUDE [clone notebook](includes/prereq-clone-notebook.md)]
@@ -95,56 +97,30 @@ ml_client = MLClient(
9597
resource_group_name="<RESOURCE_GROUP>",
9698
workspace_name="<AML_WORKSPACE_NAME>",
9799
)
100+
cpu_cluster = None
98101
```
99102

100103
> [!NOTE]
101104
> Creating MLClient will not connect to the workspace. The client initialization is lazy, it will wait for the first time it needs to make a call (this will happen when creating the `credit_data` data asset, two code cells from here).
102105
103-
## Register data from an external url
104-
105-
If you have been following along with the other tutorials in this series and already registered the data, you can fetch the same dataset from the workspace using `credit_dataset = ml_client.data.get("<DATA ASSET NAME>", version='<VERSION>')`. Then you may skip this section. To learn about data more in depth or if you would rather complete the data tutorial first, see [Upload, access and explore your data in Azure Machine Learning](tutorial-explore-data.md).
106-
107-
* Azure Machine Learning uses a `Data` object to register a reusable definition of data, and consume data within a pipeline. In the next section, you consume some data from web url as one example. Data from other sources can be created as well. `Data` assets from other sources can be created as well.
108-
109-
110-
111-
112-
```python
113-
from azure.ai.ml.entities import Data
114-
from azure.ai.ml.constants import AssetTypes
115-
116-
web_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
117-
118-
credit_data = Data(
119-
name="creditcard_defaults",
120-
path=web_path,
121-
type=AssetTypes.URI_FILE,
122-
description="Dataset for credit card defaults",
123-
tags={"source_type": "web", "source": "UCI ML Repo"},
124-
version="1.0.0",
125-
)
126-
```
106+
## Access the registered data asset
127107

128-
This code just created a `Data` asset, ready to be consumed as an input by the pipeline that you'll define in the next sections. In addition, you can register the data to your workspace so it becomes reusable across pipelines.
108+
Start by getting the data that you previously registered in the [Upload, access and explore your data](tutorial-explore-data.md) tutorial.
129109

110+
* Azure Machine Learning uses a `Data` object to register a reusable definition of data, and consume data within a pipeline.
130111
Since this is the first time that you're making a call to the workspace, you may be asked to authenticate. Once the authentication is complete, you then see the dataset registration completion message.
131112

132113

133-
134-
135114
```python
136-
credit_data = ml_client.data.create_or_update(credit_data)
137-
print(
138-
f"Dataset with name {credit_data.name} was registered to workspace, the dataset version is {credit_data.version}"
139-
)
115+
# get a handle of the data asset and print the URI
116+
credit_data = ml_client.data.get(name="credit-card", version="initial")
117+
print(f"Data asset URI: {credit_data.path}")
140118
```
141119

142-
In the future, you can fetch the same dataset from the workspace using `credit_dataset = ml_client.data.get("<DATA ASSET NAME>", version='<VERSION>')`.
143-
144-
## Create a compute resource to run your pipeline
120+
## Create a compute resource to run your pipeline (Optional)
145121

146122
> [!NOTE]
147-
> To try [serverless compute (preview)](./how-to-use-serverless-compute.md), skip this step and proceed to [create a job environment](#create-a-job-environment-for-pipeline-steps).
123+
> To use [serverless compute (preview)](./how-to-use-serverless-compute.md) to run this pipeline, you can skip this compute creation step and proceed directly to [create a job environment](#create-a-job-environment-for-pipeline-steps).
148124
149125
Each step of an Azure Machine Learning pipeline can use a different compute resource for running the specific job of that step. It can be single or multi-node machines with Linux or Windows OS, or a specific compute fabric like Spark.
150126

@@ -172,9 +148,8 @@ except Exception:
172148
print("Creating a new cpu compute target...")
173149

174150
# Let's create the Azure Machine Learning compute object with the intended parameters
175-
# if you run into an out of quota error, change the size to a comparable VM that is available.\
151+
# if you run into an out of quota error, change the size to a comparable VM that is available.
176152
# Learn more on https://azure.microsoft.com/en-us/pricing/details/machine-learning/.
177-
178153
cpu_cluster = AmlCompute(
179154
name=cpu_compute_target,
180155
# Azure Machine Learning Compute is the on-demand VM service
@@ -229,8 +204,8 @@ dependencies:
229204
- pip:
230205
- inference-schema[numpy-support]==1.3.0
231206
- xlrd==2.0.1
232-
- mlflow== 1.26.1
233-
- azureml-mlflow==1.42.0
207+
- mlflow== 2.4.1
208+
- azureml-mlflow==1.51.0
234209
```
235210

236211
The specification contains some usual packages, that you use in your pipeline (numpy, pip), together with some Azure Machine Learning specific packages (azureml-mlflow).
@@ -252,7 +227,7 @@ pipeline_job_env = Environment(
252227
tags={"scikit-learn": "0.24.2"},
253228
conda_file=os.path.join(dependencies_dir, "conda.yaml"),
254229
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
255-
version="0.1.0",
230+
version="0.2.0",
256231
)
257232
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
258233

@@ -325,7 +300,7 @@ def main():
325300

326301
print("input data:", args.data)
327302

328-
credit_df = pd.read_excel(args.data, header=1, index_col=0)
303+
credit_df = pd.read_csv(args.data, header=1, index_col=0)
329304

330305
mlflow.log_metric("num_samples", credit_df.shape[0])
331306
mlflow.log_metric("num_features", credit_df.shape[1] - 1)
@@ -515,7 +490,7 @@ First, create the *yaml* file describing the component:
515490

516491

517492
```python
518-
%%writefile {train_src_dir}/train.yaml
493+
%%writefile {train_src_dir}/train.yml
519494
# <component>
520495
name: train_credit_defaults_model
521496
display_name: Train Credit Defaults Model
@@ -555,8 +530,8 @@ Now create and register the component. Registering it allows you to re-use it i
555530
# importing the Component Package
556531
from azure.ai.ml import load_component
557532

558-
# Loading the component from the yaml file
559-
train_component = load_component(source=os.path.join(train_src_dir, "train.yaml"))
533+
# Loading the component from the yml file
534+
train_component = load_component(source=os.path.join(train_src_dir, "train.yml"))
560535

561536
# Now we register the component to the workspace
562537
train_component = ml_client.create_or_update(train_component)
@@ -581,16 +556,15 @@ To code the pipeline, you use a specific `@dsl.pipeline` decorator that identifi
581556

582557
Here, we used *input data*, *split ratio* and *registered model name* as input variables. We then call the components and connect them via their inputs/outputs identifiers. The outputs of each step can be accessed via the `.outputs` property.
583558

584-
> [!NOTE]
585-
> To use [serverless compute (preview)](./how-to-use-serverless-compute.md), replace `compute=cpu_compute_target` with `compute=azureml:serverless` in this code.
586-
587-
```pythons
559+
```python
588560
# the dsl decorator tells the sdk that we are defining an Azure Machine Learning pipeline
589561
from azure.ai.ml import dsl, Input, Output
590562

591563

592564
@dsl.pipeline(
593-
compute=cpu_compute_target, # to use serverless compute, change this to: compute=azureml:serverless
565+
compute=cpu_compute_target
566+
if (cpu_cluster)
567+
else "serverless", # "serverless" value runs pipeline on serverless compute
594568
description="E2E data_perp-train pipeline",
595569
)
596570
def credit_defaults_pipeline(

0 commit comments

Comments
 (0)