You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-etl-at-scale.md
+43-38Lines changed: 43 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ ms.date: 04/28/2020
12
12
13
13
# Extract, transform, and load (ETL) at scale
14
14
15
-
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. Collected in a standard location, cleaned and processed. Ultimatelyloaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
15
+
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. The data is collected in a standard location, cleaned, and processed. Ultimately, the data is loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With Azure HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
16
16
17
17
The use of HDInsight in the ETL process can be summarized by this pipeline:
18
18
@@ -22,95 +22,100 @@ The following sections explore each of the ETL phases and their associated compo
22
22
23
23
## Orchestration
24
24
25
-
Orchestration spans across all phases of the ETL pipeline. ETL jobs in HDInsight often involve several different products working in conjunction with each other. You might use Hive to clean some portion of the data, while Pig cleans another portion. You might use Azure Data Factory to load data into Azure SQL Database from Azure Data Lake Store.
25
+
Orchestration spans across all phases of the ETL pipeline. ETL jobs in HDInsight often involve several different products working in conjunction with each other. For example:
26
+
27
+
- You might use Apache Hive to clean a portion of the data, and Apache Pig to clean another portion.
28
+
- You might use Azure Data Factory to load data into Azure SQL Database from Azure Data Lake Store.
26
29
27
30
Orchestration is needed to run the appropriate job at the appropriate time.
28
31
29
32
### Apache Oozie
30
33
31
-
Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also be used to schedule jobs that are specific to a system, such as Java programs or shell scripts.
34
+
Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Pig, Hive, and Sqoop. You can use Oozie to schedule jobs that are specific to a system, such as Java programs or shell scripts.
32
35
33
36
For more information, see [Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight](../hdinsight-use-oozie-linux-mac.md). See also, [Operationalize the Data Pipeline](../hdinsight-operationalize-data-pipeline.md).
34
37
35
38
### Azure Data Factory
36
39
37
-
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud. Workflows for orchestrating and automating data movement and data transformation.
40
+
Azure Data Factory provides orchestration capabilities in the form of platform as a service (PaaS). Azure Data Factory is a cloud-based data integration service. It allows you to create data-driven workflows for orchestrating and automating data movement and data transformation.
38
41
39
-
Using Azure Data Factory, you can:
42
+
Use Azure Data Factory to:
40
43
41
-
1. Create and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores.
42
-
2. Process and transform the data using compute services such as Azure HDInsight Hadoop. Or Spark, Azure Data Lake Analytics, Azure Batch, and Azure Machine Learning.
43
-
3. Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume.
44
+
1. Create and schedule data-driven workflows. These pipelines ingest data from disparate data stores.
45
+
1. Process and transform the data by using compute services such as HDInsight or Hadoop. You can also use Spark, Azure Data Lake Analytics, Azure Batch, or Azure Machine Learning for this step.
46
+
1. Publish output data to data stores, such as Azure SQL Data Warehouse, for BI applications to consume.
44
47
45
48
For more information on Azure Data Factory, see the [documentation](../../data-factory/introduction.md).
46
49
47
50
## Ingest file storage and result storage
48
51
49
-
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but are typically flat files like CSVs.
52
+
Source data files are typically loaded into a location on Azure Storage or Azure Data Lake Storage. The files are usually in a flat format, like CSV. But, they can be in any format.
50
53
51
54
### Azure Storage
52
55
53
-
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how, how large the files (as long as you are within your limits). This guarantee means you can store terabytes and still get consistent performance, whether you're using a subset of the data or all of the data.
56
+
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md) for more information. For most analytic nodes, Azure Storage scales best when dealing with many smaller files. As long as you're within your account limits, Azure Storage guarantees the same performance, no matter howlarge the files are. You can store terabytes of data and still get consistent performance. This is true whether you're using a subset or all of the data.
54
57
55
-
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
58
+
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
56
59
57
60
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
58
61
59
-
Azure Storage also has a WebHDFS API layer for the blob storage. All the services in HDInsight can access files in Azure Blob Storage for data cleaning and data processing. Similar to how those services would use Hadoop Distributed Files System (HDFS).
62
+
Azure Storage has a WebHDFS API layer for the blob storage. All HDInsight services can access files in Azure Blob storage for data cleaning and data processing. This is similar to how those services would use a Hadoop Distributed Files System (HDFS).
60
63
61
-
Data is typically ingested into Azure Storage using either PowerShell, the Azure Storage SDK, or AZCopy.
64
+
Data is typically ingested into Azure Storage by using PowerShell, the Azure Storage SDK, or AZCopy.
62
65
63
66
### Azure Data Lake Storage
64
67
65
-
Azure Data Lake Storage (ADLS) is a managed, hyperscale repository. A repository for analytics data that is compatible with HDFS. ADLS uses a design paradigm that is similar to HDFS. ADLS offers unlimited adaptability for total capacity and the size of individual files. ADLS is good when working with large files, since a large file can be stored across multiple nodes. Partitioning data in ADLS is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
68
+
Azure Data Lake Storage is a managed, hyperscale repositoryfor analytics data. It's compatible with, and uses a design paradigm that's similar to HDFS. Data Lake Storage offers unlimited adaptability for total capacity and the size of individual files. It's a good choice when working with large files, because they can be stored across multiple nodes. Partitioning data in Data Lake Storage is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors, that efficiently read and write hundreds of terabytes of data.
66
69
67
-
Data is typically ingested into ADLS using Azure Data Factory. Or ADLS SDKs, AdlCopy Service, Apache DistCp, or Apache Sqoop. Which of these services to use largely depends on where the data is. If the data is currently in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy Service, or Azure Data Factory. For data in Azure Blob Storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
70
+
Data is usually ingested into Data Lake Storage by using Azure Data Factory. Data Lake Storage SDKs, AdlCopy service, Apache DistCp, or Apache Sqoop can also be used. The service you choose depends on where the data is located. If it's in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy service, or Azure Data Factory. For data in Azure Blob storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
68
71
69
-
ADLS is also optimized for event ingestion using Azure Event Hub or Apache Storm.
72
+
Data Lake Storage is optimized for event ingestion by using Azure Event Hub or Apache Storm.
70
73
71
-
#### Considerations for both storage options
74
+
#### Considerations for using Azure Storage and Azure Data Lake Storage
72
75
73
-
For uploading datasets in the terabyte range, network latency can be a major problem, particularly if the data is coming from an on-premises location. In such cases, you can use the options below:
76
+
For uploading datasets in the terabyte range, network latency can be a major problem. This is particularly true if the data is coming from an on-premises location. In such cases, you can use the options below:
74
77
75
-
*Azure ExpressRoute: Azure ExpressRoute lets you create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
78
+
-**Azure ExpressRoute:** Create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
76
79
77
-
* "Offline" upload of data. You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Storage Blobs. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Storage blobs to Data Lake Storage.
80
+
-**Data upload from hard disk drives:**You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Blob storage. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Blob storage to Data Lake Storage.
78
81
79
82
### Azure SQL Data Warehouse
80
83
81
-
Azure SQL DW is a great choice to store prepared results. Azure HDInsight can be used to do those services for Azure SQL DW.
84
+
Azure SQL Data Warehouse is a great choice to store prepared results. Azure HDInsight can be used to perform those services for SQL Data Warehouse.
82
85
83
-
Azure SQL Data Warehouse (SQL DW) is a relational database store optimized for analytic workloads. Azure SQL DW scales based on partitioned tables. Tables can be partitioned across multiple nodes. Azure SQL DW nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
86
+
Azure SQL Data Warehouse is a relational database store optimized for analytic workloads. It scales based on partitioned tables. Tables can be partitioned across multiple nodes. The nodes are selected at the time of creation. They can scale after the fact, but that's an active process which might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
84
87
85
88
### Apache HBase
86
89
87
-
Apache HBase is a key-value store available in Azure HDInsight. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data. Data in a schemaless database organized by column families.
90
+
Apache HBase is a key-value store available in Azure HDInsight. It's an open-source, NoSQL database that's built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semi-structured data.
91
+
92
+
Because HBase is a schemaless database, columns and data types don't need to be defined before using them. Data is stored in the rows of a table, and is grouped by column family.
88
93
89
-
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database. The columns and data types stored in them don't need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
94
+
The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features which are provided by distributed applications in the Hadoop environment.
90
95
91
96
HBase is an excellent destination for sensor and log data for future analysis.
92
97
93
98
HBase adaptability is dependent on the number of nodes in the HDInsight cluster.
94
99
95
100
### Azure SQL Database and Azure Database
96
101
97
-
Azure offers three different relational databases as platform-as-a-service (PAAS).
102
+
Azure offers three PaaS relational databases:
98
103
99
-
*[Azure SQL Database](../../sql-database/sql-database-technical-overview.md) is an implementation of Microsoft SQL Server. For more information on performance, see [Tuning Performance in Azure SQL Database](../../sql-database/sql-database-performance-guidance.md).
100
-
*[Azure Database for MySQL](../../mysql/overview.md) is an implementation of Oracle MySQL.
101
-
*[Azure Database for PostgreSQL](../../postgresql/quickstart-create-server-database-portal.md) is an implementation of PostgreSQL.
104
+
-[Azure SQL Database](../../sql-database/sql-database-technical-overview.md) is an implementation of Microsoft SQL Server. For more information on performance, see [Tuning performance in Azure SQL Database](../../sql-database/sql-database-performance-guidance.md).
105
+
-[Azure Database for MySQL](../../mysql/overview.md) is an implementation of Oracle MySQL.
106
+
-[Azure Database for PostgreSQL](../../postgresql/quickstart-create-server-database-portal.md) is an implementation of PostgreSQL.
102
107
103
-
These products scale up, which means that they're scaled by adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
108
+
These products scale upby adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
104
109
105
110
## Azure Analysis Services
106
111
107
-
Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics. AAS provides the analytical data for business reports and client applications such as Power BI. Also, Excel, Reporting Services reports, and other data visualization tools.
112
+
Azure Analysis Services is an analytical data engine used in decision support and business analytics. It provides the analytical data for business reports and client applications such as Power BI. Excel, SQL Server Reporting Services reports, and other data visualization tools can also use the data provided by Azure Analysis Services.
108
113
109
-
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services Pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
114
+
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
110
115
111
116
## Extract and Load
112
117
113
-
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
118
+
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
114
119
115
120
### Apache Sqoop
116
121
@@ -120,16 +125,16 @@ Sqoop uses MapReduce to import and export the data, to provide parallel operatio
120
125
121
126
### Apache Flume
122
127
123
-
`Apache Flume` is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has a flexible architecture based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online analytic application.
128
+
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its flexible architecture is based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms. It has many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online, analytic application.
124
129
125
-
Apache Flume can't be used with Azure HDInsight. An on-premises Hadoop installation can use Flume to send data to either Azure Storage Blobs or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
130
+
Apache Flume can't be used with Azure HDInsight. But, an on-premises Hadoop installation can use Flume to send data to either Azure Blob storage or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
126
131
127
132
## Transform
128
133
129
-
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
134
+
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
130
135
131
136
## Next steps
132
137
133
-
*[Using Apache Hive as an ETL Tool](apache-hadoop-using-apache-hive-as-an-etl-tool.md)
134
-
*[Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-storage-gen2.md)
135
-
*[Move data from Azure SQL Database To Apache Hive table](./apache-hadoop-use-sqoop-mac-linux.md)
138
+
-[Using Apache Hive as an ETL tool](apache-hadoop-using-apache-hive-as-an-etl-tool.md)
139
+
-[Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-storage-gen2.md)
140
+
-[Move data from Azure SQL Database to an Apache Hive table](./apache-hadoop-use-sqoop-mac-linux.md)
0 commit comments