Skip to content

Commit 9876132

Browse files
committed
Merge branch 'master' of https://github.com/Microsoft/azure-docs-pr into lbnovupdates
2 parents 340d8b4 + 94dbaac commit 9876132

File tree

2 files changed

+89
-3
lines changed

2 files changed

+89
-3
lines changed

articles/machine-learning/service/how-to-data-prep.md

Lines changed: 89 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,97 @@ ms.date: 09/24/2018
1414

1515
# Prepare data for modeling with Azure Machine Learning
1616

17-
Data preparation is an important part of a machine learning workflow. Your models will be more accurate and efficient if they have access to clean data in a format that is easier to consume.
17+
In this article, you learn about the use cases and unique features of the Azure Machine Learning Data Prep SDK. Data preparation is the most important part of a machine learning workflow. Real-world data is often broken, inconsistent, or unable to be used as training data without significant cleansing and transformation. Correcting errors and anomalies in raw data, and building new features that are relevant to the problem you're trying to solve, will increase model accuracy.
1818

19-
You can prepare your data in Python using the [Azure Machine Learning Data Prep SDK](https://docs.microsoft.com/python/api/overview/azure/dataprep?view=azure-dataprep-py).
19+
You can prepare your data in Python using the [Azure Machine Learning Data Prep SDK](https://docs.microsoft.com/python/api/overview/azure/dataprep?view=azure-dataprep-py).
20+
21+
## Azure Machine Learning Data Prep SDK
22+
23+
The Azure Machine Learning Data Prep SDK is a Python library that includes many common data preprocessing tools. It also adds advanced functionality like automated feature engineering and transformations derived from examples. The SDK is similar in core-functionality to popular libraries such as Pandas and PySpark, yet offers more flexibility. Pandas is typically most useful on smaller data sets (< 2-5 GB) before memory capacity-constraints affect performance. In contrast, PySpark is generally for big-data applications but carries an overhead that makes working with small data sets much slower.
24+
25+
The SDK offers:
26+
27+
- Practicality and convenience when working with small data sets
28+
- Scalability for modern big-data applications
29+
- The ability to use and scale the same code for both use-cases
30+
31+
The following examples highlight some of the unique functionality of the SDK.
32+
33+
### Install the SDK
34+
35+
Install the SDK in your Python environment using the following command.
36+
37+
```shell
38+
pip install azureml-dataprep
39+
```
40+
41+
Use the following code to import the package.
42+
43+
```python
44+
import azureml.dataprep as dprep
45+
```
46+
47+
### Automatic file type detection
48+
49+
Use the `smart_read_file()` function to load your data without having to specify the file type. This function automatically recognizes and parses the file type.
50+
51+
```python
52+
dataflow = dprep.smart_read_file(path="<your-file-path>")
53+
```
54+
55+
### Automated feature engineering
56+
57+
Use the SDK to split and derive columns by both example and inference to automate feature engineering. Assume you have a field in your dataflow object called `datetime` with a value of `2018-09-15 14:30:00`.
58+
59+
To automatically split the `datetime` field, call the following function.
60+
61+
```python
62+
new_dataflow = dataflow.split_column_by_example(source_column="datetime")
63+
```
64+
65+
By not defining the example parameter, the function will automatically split the `datetime` field into two new fields `datetime_1` and `datetime_2`. The resulting values are `2018-09-15` and `14:30:00`, respectively. It's also possible to provide an example pattern, and the SDK will predict and execute your intended transformation. Using the same `datetime` object, the following code will create a new column `datetime_weekday` for the weekday based on the provided example.
66+
67+
```python
68+
new_dataflow = dataflow.derive_column_by_example(
69+
source_columns="datetime",
70+
new_column_name="datetime_weekday",
71+
example_data=[("2009-01-04 10:12:00", "Sunday"), ("2013-08-22 17:00:00", "Thursday")]
72+
)
73+
```
74+
75+
### Summary statistics
76+
77+
You can generate quick summary statistics for a dataflow with one line of code. This method offers a convenient way to understand your data and how it's distributed.
78+
79+
```python
80+
dataflow.get_profile()
81+
```
82+
83+
Calling this function on a dataflow object will result in output like the following table.
84+
85+
![Summary Statistics Output](./media/concept-data-preparation/output-example.png)
86+
87+
## Multiple environment compatibilities
88+
89+
The SDK also allows for dataflow objects to be serialized and opened in *any* Python environment. The environment where it's opened can be different than the environment where it's saved. This functionality allows for easy transfer between Python environments and quick integration with Azure Machine Learning models.
90+
91+
Use the following code to save your dataflow objects.
92+
93+
```python
94+
package = dprep.Package([dataflow_1, dataflow_2])
95+
package.save("<your-local-path>")
96+
```
97+
98+
Use the following code to reopen your package in any environment and retrieve a list of dataflow objects.
99+
100+
```python
101+
package = dprep.Package.open("<your-local-path>")
102+
dataflow_list = package.dataflows
103+
```
20104

21105
## Data preparation pipeline
22106

23-
The main data preparation steps are:
107+
To see detailed examples and code for each preparation step, use the following how-to guides:
24108

25109
1. [Load data](how-to-load-data.md), which can be in various formats
26110
2. [Transform](how-to-transform-data.md) it into a more usable structure
@@ -30,3 +114,5 @@ The main data preparation steps are:
30114

31115
## Next steps
32116
Review an [example notebook](https://github.com/Microsoft/AMLDataPrepDocs/tree/master/tutorials/getting-started/getting-started.ipynb) of data preparation using the Azure Machine Learning Data Prep SDK.
117+
118+
Azure Machine Learning Data Prep SDK [reference documentation](https://docs.microsoft.com/python/api/overview/azure/dataprep/intro?view=azure-dataprep-py).
28.2 KB
Loading

0 commit comments

Comments
 (0)