Skip to content

Commit 04e03b4

Browse files
authored
Merge pull request #104576 from nibaccam/data-ingestion
New concept article | Data ingestion
2 parents 7bb5de5 + 227076a commit 04e03b4

File tree

4 files changed

+417
-1
lines changed

4 files changed

+417
-1
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Data ingestion options
3+
titleSuffix: Azure Machine Learning
4+
description: Learn about data ingestion options for training your machine learning models.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
ms.reviewer: nibaccam
10+
author: nibaccam
11+
ms.author: nibaccam
12+
ms.date: 02/26/2020
13+
14+
---
15+
16+
# Data ingestion in Azure Machine Learning
17+
18+
In this article, you learn the pros and cons of the following data ingestion options available with Azure Machine Learning.
19+
20+
1. [Azure Data Factory](#use-azure-data-factory) pipelines
21+
2. [Azure Machine Learning Python SDK](#use-the-python-sdk)
22+
23+
Data ingestion is the process in which unstructured data is extracted from one or multiple sources and then prepared for training machine learning models. It's also time intensive, especially if done manually, and if you have large amounts of data from multiple sources. Automating this effort frees up resources and ensures your models use the most recent and applicable data.
24+
25+
We recommend that you evaluate using Azure Data Factory (ADF) initially, as it is specifically built to extract, load, and transform data. If you cannot meet your requirements using ADF, you can use the Python SDK to develop a custom code solution, or use ADF and the Python SDK together to create an overall data ingestion workflow that meets your needs.
26+
27+
## Use Azure Data Factory
28+
29+
[Azure Data Factory](https://docs.microsoft.com/azure/data-factory/introduction) offers native support for data source monitoring and triggers for data ingestion pipelines.
30+
31+
The following table summarizes the pros and cons for using Azure Data Factory for your data ingestion workflows.
32+
33+
|Pros|Cons
34+
---|---
35+
Specifically built to extract, load, and transform data.|Currently offers a limited set of Azure Data Factory pipeline tasks
36+
Allows you to create data-driven workflows for orchestrating data movement and transformations at scale.|Expensive to construct and maintain. See Azure Data Factory's [pricing page](https://azure.microsoft.com/pricing/details/data-factory/data-pipeline/) for more information.
37+
Integrated with various Azure tools like [Azure Databricks](https://docs.microsoft.com/azure/data-factory/transform-data-using-databricks-notebook) and [Azure Functions](https://docs.microsoft.com/azure/data-factory/control-flow-azure-function-activity) | Doesn't natively run scripts, instead relies on separate compute for script runs
38+
Natively supports data source triggered data ingestion|
39+
Data preparation and model training processes are separate.|
40+
Embedded data lineage capability for Azure Data Factory dataflows|
41+
Provides a low code experience [user interface](https://docs.microsoft.com/azure/data-factory/quickstart-create-data-factory-portal) for non-scripting approaches |
42+
43+
These steps and the following diagram illustrate Azure Data Factory's data ingestion workflow.
44+
45+
1. Pull the data from its sources
46+
1. Transform and save the data to an output blob container, which serves as data storage for Azure Machine Learning
47+
1. With prepared data stored, the Azure Data Factory pipeline invokes a training Machine Learning pipeline that receives the prepared data for model training
48+
49+
50+
![ADF Data ingestion](media/concept-data-ingestion/data-ingest-option-one.svg)
51+
52+
## Use the Python SDK
53+
54+
With the [Python SDK](https://docs.microsoft.com/python/api/overview/azureml-sdk/?view=azure-ml-py), you can incorporate data ingestion tasks into an [Azure Machine Learning pipeline](how-to-create-your-first-pipeline.md) step.
55+
56+
The following table summarizes the pros and con for using the SDK and an ML pipelines step for data ingestion tasks.
57+
58+
Pros| Cons
59+
---|---
60+
Configure your own Python scripts | Does not natively support data source change triggering. Requires Logic App or Azure Function implementations
61+
Data preparation as part of every model training execution|Requires development skills to create a data ingestion script
62+
Supports data preparation scripts on various compute targets, including [Azure Machine Learning compute](concept-compute-target.md#azure-machine-learning-compute-managed) |Does not provide a user interface for creating the ingestion mechanism
63+
64+
In the following diagram, the Azure Machine Learning pipeline consists of two steps: data ingestion and model training. The data ingestion step encompasses tasks that can be accomplished using Python libraries and the Python SDK, such as extracting data from local/web sources, and basic data transformations, like missing value imputation. The training step then uses the prepared data as input to your training script to train your machine learning model.
65+
66+
![Azure pipeline + SDK data ingestion](media/concept-data-ingestion/data-ingest-option-two.png)
67+
68+
## Next steps
69+
70+
* Learn how to automate and manage the development life cycles of your data ingestion pipelines with [Azure Pipelines](how-to-cicd-data-ingestion.md).

0 commit comments

Comments
 (0)