You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-etl-at-scale.md
+24-24Lines changed: 24 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Extract, transform, and load (ETL) at Scale - Azure HDInsight
2
+
title: Extract, transform, and load (ETL) at scale - Azure HDInsight
3
3
description: Learn how extract, transform, and load is used in HDInsight with Apache Hadoop.
4
4
author: ashishthaps
5
5
ms.author: ashishth
@@ -14,7 +14,7 @@ ms.date: 04/28/2020
14
14
15
15
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. The data is collected in a standard location, cleaned, and processed. Ultimately, the data is loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With Azure HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
16
16
17
-
The use of HDInsight in the ETL process can be summarized by this pipeline:
17
+
The use of HDInsight in the ETL process is summarized by this pipeline:
18
18
19
19

20
20
@@ -55,67 +55,67 @@ Source data files are typically loaded into a location on Azure Storage or Azure
55
55
56
56
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md) for more information. For most analytic nodes, Azure Storage scales best when dealing with many smaller files. As long as you're within your account limits, Azure Storage guarantees the same performance, no matter how large the files are. You can store terabytes of data and still get consistent performance. This statement is true whether you're using a subset or all of the data.
57
57
58
-
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
58
+
Azure Storage has several types of blobs. An *append blob* is a great option for storing web logs or sensor data.
59
59
60
-
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
60
+
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob is only served by a single server. Although blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
61
61
62
-
Azure Storage has a WebHDFS API layer for the blob storage. All HDInsight services can access files in Azure Blob storage for data cleaning and data processing. This is similar to how those services would use a Hadoop Distributed Files System (HDFS).
62
+
Azure Storage has a WebHDFS API layer for the blob storage. All HDInsight services can access files in Azure Blob storage for data cleaning and data processing. This is similar to how those services would use Hadoop Distributed File System (HDFS).
63
63
64
-
Data is typically ingested into Azure Storage by using PowerShell, the Azure Storage SDK, or AZCopy.
64
+
Data is typically ingested into Azure Storage through PowerShell, the Azure Storage SDK, or AZCopy.
65
65
66
66
### Azure Data Lake Storage
67
67
68
-
Azure Data Lake Storage is a managed, hyperscale repository for analytics data. It's compatible with, and uses a design paradigm that's similar to HDFS. Data Lake Storage offers unlimited adaptability for total capacity and the size of individual files. It's a good choice when working with large files, because they can be stored across multiple nodes. Partitioning data in Data Lake Storage is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors, that efficiently read and write hundreds of terabytes of data.
68
+
Azure Data Lake Storage is a managed, hyperscale repository for analytics data. It's compatible with and uses a design paradigm that's similar to HDFS. Data Lake Storage offers unlimited adaptability for total capacity and the size of individual files. It's a good choice when working with large files, because they can be stored across multiple nodes. Partitioning data in Data Lake Storage is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
69
69
70
-
Data is usually ingested into Data Lake Storage by using Azure Data Factory. Data Lake Storage SDKs, AdlCopy service, Apache DistCp, or Apache Sqoop can also be used. The service you choose depends on where the data is located. If it's in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy service, or Azure Data Factory. For data in Azure Blob storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
70
+
Data is usually ingested into Data Lake Storage through Azure Data Factory. You can also use Data Lake Storage SDKs, the AdlCopy service, Apache DistCp, or Apache Sqoop. The service you choose depends on where the data is. If it's in an existing Hadoop cluster, you might use Apache DistCp, the AdlCopy service, or Azure Data Factory. For data in Azure Blob storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
71
71
72
-
Data Lake Storage is optimized for event ingestion by using Azure Event Hub or Apache Storm.
72
+
Data Lake Storage is optimized for event ingestion through Azure Event Hubs or Apache Storm.
73
73
74
-
####Considerations for using Azure Storage and Azure Data Lake Storage
74
+
### Considerations for both storage options
75
75
76
-
For uploading datasets in the terabyte range, network latency can be a major problem. This is particularly true if the data is coming from an on-premises location. In such cases, you can use the options below:
76
+
For uploading datasets in the terabyte range, network latency can be a major problem. This is particularly true if the data is coming from an on-premises location. In such cases, you can use these options:
77
77
78
78
-**Azure ExpressRoute:** Create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
79
79
80
-
-**Data upload from hard disk drives:** You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Blob storage. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Blob storage to Data Lake Storage.
80
+
-**Data upload from hard disk drives:** You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure datacenter. Your data is first uploaded to Azure Blob storage. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Blob storage to Data Lake Storage.
81
81
82
82
### Azure SQL Data Warehouse
83
83
84
-
Azure SQL Data Warehouse is a great choice to store prepared results. Azure HDInsight can be used to perform those services for SQL Data Warehouse.
84
+
Azure SQL Data Warehouse is an appropriate choice to store prepared results. You can use Azure HDInsight to perform those services for SQL Data Warehouse.
85
85
86
-
Azure SQL Data Warehouse is a relational database store optimized for analytic workloads. It scales based on partitioned tables. Tables can be partitioned across multiple nodes. The nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
86
+
Azure SQL Data Warehouse is a relational database store optimized for analytic workloads. It scales based on partitioned tables. Tables can be partitioned across multiple nodes. The nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [Manage compute in SQL Data Warehouse](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
87
87
88
88
### Apache HBase
89
89
90
-
Apache HBase is a key-value store available in Azure HDInsight. It's an open-source, NoSQL database that's built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semi-structured data.
90
+
Apache HBase is a key/value store available in Azure HDInsight. It's an open-source, NoSQL database that's built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semi-structured data.
91
91
92
-
Because HBase is a schemaless database, columns and data types don't need to be defined before using them. Data is stored in the rows of a table, and is grouped by column family.
92
+
Because HBase is a schemaless database, you don't need to define columns and data types before you use them. Data is stored in the rows of a table, and is grouped by column family.
93
93
94
-
The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
94
+
The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase relies on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
95
95
96
96
HBase is an excellent destination for sensor and log data for future analysis.
97
97
98
98
HBase adaptability is dependent on the number of nodes in the HDInsight cluster.
99
99
100
-
### Azure SQL Database and Azure Database
100
+
### Azure SQL databases
101
101
102
102
Azure offers three PaaS relational databases:
103
103
104
104
-[Azure SQL Database](../../sql-database/sql-database-technical-overview.md) is an implementation of Microsoft SQL Server. For more information on performance, see [Tuning performance in Azure SQL Database](../../sql-database/sql-database-performance-guidance.md).
105
105
-[Azure Database for MySQL](../../mysql/overview.md) is an implementation of Oracle MySQL.
106
106
-[Azure Database for PostgreSQL](../../postgresql/quickstart-create-server-database-portal.md) is an implementation of PostgreSQL.
107
107
108
-
These products scale up by adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
108
+
Add more CPU and memory to scale up these products. You can also choose to use premium disks with the products for better I/O performance.
109
109
110
110
## Azure Analysis Services
111
111
112
-
Azure Analysis Services is an analytical data engine used in decision support and business analytics. It provides the analytical data for business reports and client applications such as Power BI. Excel, SQL Server Reporting Services reports, and other data visualization tools can also use the data provided by Azure Analysis Services.
112
+
Azure Analysis Services is an analytical data engine used in decision support and business analytics. It provides the analytical data for business reports and client applications such as Power BI. The analytical data also works with Excel, SQL Server Reporting Services reports, and other data visualization tools.
113
113
114
-
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
114
+
Scale analysis cubes by changing tiers for each individual cube. For more information, see [Azure Analysis Services pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
115
115
116
-
## Extract and Load
116
+
## Extract and load
117
117
118
-
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
118
+
After the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
119
119
120
120
### Apache Sqoop
121
121
@@ -131,7 +131,7 @@ Apache Flume can't be used with Azure HDInsight. But, an on-premises Hadoop inst
131
131
132
132
## Transform
133
133
134
-
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
134
+
After data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
0 commit comments