|
| 1 | +--- |
| 2 | +title: Data ingestion options |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Learn about data ingestion options for training your machine learning models. |
| 5 | +services: machine-learning |
| 6 | +ms.service: machine-learning |
| 7 | +ms.subservice: core |
| 8 | +ms.topic: conceptual |
| 9 | +ms.reviewer: nibaccam |
| 10 | +author: nibaccam |
| 11 | +ms.author: nibaccam |
| 12 | +ms.date: 02/26/2020 |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +# Data ingestion in Azure Machine Learning |
| 17 | + |
| 18 | +In this article, you learn the pros and cons of the following data ingestion options available with Azure Machine Learning. |
| 19 | + |
| 20 | +1. [Azure Data Factory](#use-azure-data-factory) pipelines |
| 21 | +2. [Azure Machine Learning Python SDK](#use-the-python-sdk) |
| 22 | + |
| 23 | +Data ingestion is the process in which unstructured data is extracted from one or multiple sources and then prepared for training machine learning models. It's also time intensive, especially if done manually, and if you have large amounts of data from multiple sources. Automating this effort frees up resources and ensures your models use the most recent and applicable data. |
| 24 | + |
| 25 | +We recommend that you evaluate using Azure Data Factory (ADF) initially, as it is specifically built to extract, load, and transform data. If you cannot meet your requirements using ADF, you can use the Python SDK to develop a custom code solution, or use ADF and the Python SDK together to create an overall data ingestion workflow that meets your needs. |
| 26 | + |
| 27 | +## Use Azure Data Factory |
| 28 | + |
| 29 | +[Azure Data Factory](https://docs.microsoft.com/azure/data-factory/introduction) offers native support for data source monitoring and triggers for data ingestion pipelines. |
| 30 | + |
| 31 | +The following table summarizes the pros and cons for using Azure Data Factory for your data ingestion workflows. |
| 32 | + |
| 33 | +|Pros|Cons |
| 34 | +---|--- |
| 35 | +Specifically built to extract, load, and transform data.|Currently offers a limited set of Azure Data Factory pipeline tasks |
| 36 | +Allows you to create data-driven workflows for orchestrating data movement and transformations at scale.|Expensive to construct and maintain. See Azure Data Factory's [pricing page](https://azure.microsoft.com/pricing/details/data-factory/data-pipeline/) for more information. |
| 37 | +Integrated with various Azure tools like [Azure Databricks](https://docs.microsoft.com/azure/data-factory/transform-data-using-databricks-notebook) and [Azure Functions](https://docs.microsoft.com/azure/data-factory/control-flow-azure-function-activity) | Doesn't natively run scripts, instead relies on separate compute for script runs |
| 38 | +Natively supports data source triggered data ingestion| |
| 39 | +Data preparation and model training processes are separate.| |
| 40 | +Embedded data lineage capability for Azure Data Factory dataflows| |
| 41 | +Provides a low code experience [user interface](https://docs.microsoft.com/azure/data-factory/quickstart-create-data-factory-portal) for non-scripting approaches | |
| 42 | + |
| 43 | +These steps and the following diagram illustrate Azure Data Factory's data ingestion workflow. |
| 44 | + |
| 45 | +1. Pull the data from its sources |
| 46 | +1. Transform and save the data to an output blob container, which serves as data storage for Azure Machine Learning |
| 47 | +1. With prepared data stored, the Azure Data Factory pipeline invokes a training Machine Learning pipeline that receives the prepared data for model training |
| 48 | + |
| 49 | + |
| 50 | +  |
| 51 | + |
| 52 | +## Use the Python SDK |
| 53 | + |
| 54 | +With the [Python SDK](https://docs.microsoft.com/python/api/overview/azureml-sdk/?view=azure-ml-py), you can incorporate data ingestion tasks into an [Azure Machine Learning pipeline](how-to-create-your-first-pipeline.md) step. |
| 55 | + |
| 56 | +The following table summarizes the pros and con for using the SDK and an ML pipelines step for data ingestion tasks. |
| 57 | + |
| 58 | +Pros| Cons |
| 59 | +---|--- |
| 60 | +Configure your own Python scripts | Does not natively support data source change triggering. Requires Logic App or Azure Function implementations |
| 61 | +Data preparation as part of every model training execution|Requires development skills to create a data ingestion script |
| 62 | +Supports data preparation scripts on various compute targets, including [Azure Machine Learning compute](concept-compute-target.md#azure-machine-learning-compute-managed) |Does not provide a user interface for creating the ingestion mechanism |
| 63 | + |
| 64 | +In the following diagram, the Azure Machine Learning pipeline consists of two steps: data ingestion and model training. The data ingestion step encompasses tasks that can be accomplished using Python libraries and the Python SDK, such as extracting data from local/web sources, and basic data transformations, like missing value imputation. The training step then uses the prepared data as input to your training script to train your machine learning model. |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +## Next steps |
| 69 | + |
| 70 | +* Learn how to automate and manage the development life cycles of your data ingestion pipelines with [Azure Pipelines](how-to-cicd-data-ingestion.md). |
0 commit comments