Skip to content

Commit 682aab6

Browse files
committed
edit pass: apache-hadoop-etl-at-scale
1 parent d63d33a commit 682aab6

File tree

1 file changed

+43
-38
lines changed

1 file changed

+43
-38
lines changed

articles/hdinsight/hadoop/apache-hadoop-etl-at-scale.md

Lines changed: 43 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.date: 04/28/2020
1212

1313
# Extract, transform, and load (ETL) at scale
1414

15-
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. Collected in a standard location, cleaned and processed. Ultimately loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
15+
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. The data is collected in a standard location, cleaned, and processed. Ultimately, the data is loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With Azure HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
1616

1717
The use of HDInsight in the ETL process can be summarized by this pipeline:
1818

@@ -22,95 +22,100 @@ The following sections explore each of the ETL phases and their associated compo
2222

2323
## Orchestration
2424

25-
Orchestration spans across all phases of the ETL pipeline. ETL jobs in HDInsight often involve several different products working in conjunction with each other. You might use Hive to clean some portion of the data, while Pig cleans another portion. You might use Azure Data Factory to load data into Azure SQL Database from Azure Data Lake Store.
25+
Orchestration spans across all phases of the ETL pipeline. ETL jobs in HDInsight often involve several different products working in conjunction with each other. For example:
26+
27+
- You might use Apache Hive to clean a portion of the data, and Apache Pig to clean another portion.
28+
- You might use Azure Data Factory to load data into Azure SQL Database from Azure Data Lake Store.
2629

2730
Orchestration is needed to run the appropriate job at the appropriate time.
2831

2932
### Apache Oozie
3033

31-
Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also be used to schedule jobs that are specific to a system, such as Java programs or shell scripts.
34+
Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Pig, Hive, and Sqoop. You can use Oozie to schedule jobs that are specific to a system, such as Java programs or shell scripts.
3235

3336
For more information, see [Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight](../hdinsight-use-oozie-linux-mac.md). See also, [Operationalize the Data Pipeline](../hdinsight-operationalize-data-pipeline.md).
3437

3538
### Azure Data Factory
3639

37-
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud. Workflows for orchestrating and automating data movement and data transformation.
40+
Azure Data Factory provides orchestration capabilities in the form of platform as a service (PaaS). Azure Data Factory is a cloud-based data integration service. It allows you to create data-driven workflows for orchestrating and automating data movement and data transformation.
3841

39-
Using Azure Data Factory, you can:
42+
Use Azure Data Factory to:
4043

41-
1. Create and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores.
42-
2. Process and transform the data using compute services such as Azure HDInsight Hadoop. Or Spark, Azure Data Lake Analytics, Azure Batch, and Azure Machine Learning.
43-
3. Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume.
44+
1. Create and schedule data-driven workflows. These pipelines ingest data from disparate data stores.
45+
1. Process and transform the data by using compute services such as HDInsight or Hadoop. You can also use Spark, Azure Data Lake Analytics, Azure Batch, or Azure Machine Learning for this step.
46+
1. Publish output data to data stores, such as Azure SQL Data Warehouse, for BI applications to consume.
4447

4548
For more information on Azure Data Factory, see the [documentation](../../data-factory/introduction.md).
4649

4750
## Ingest file storage and result storage
4851

49-
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but are typically flat files like CSVs.
52+
Source data files are typically loaded into a location on Azure Storage or Azure Data Lake Storage. The files are usually in a flat format, like CSV. But, they can be in any format.
5053

5154
### Azure Storage
5255

53-
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how, how large the files (as long as you are within your limits). This guarantee means you can store terabytes and still get consistent performance, whether you're using a subset of the data or all of the data.
56+
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md) for more information. For most analytic nodes, Azure Storage scales best when dealing with many smaller files. As long as you're within your account limits, Azure Storage guarantees the same performance, no matter how large the files are. You can store terabytes of data and still get consistent performance. This is true whether you're using a subset or all of the data.
5457

55-
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
58+
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
5659

5760
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
5861

59-
Azure Storage also has a WebHDFS API layer for the blob storage. All the services in HDInsight can access files in Azure Blob Storage for data cleaning and data processing. Similar to how those services would use Hadoop Distributed Files System (HDFS).
62+
Azure Storage has a WebHDFS API layer for the blob storage. All HDInsight services can access files in Azure Blob storage for data cleaning and data processing. This is similar to how those services would use a Hadoop Distributed Files System (HDFS).
6063

61-
Data is typically ingested into Azure Storage using either PowerShell, the Azure Storage SDK, or AZCopy.
64+
Data is typically ingested into Azure Storage by using PowerShell, the Azure Storage SDK, or AZCopy.
6265

6366
### Azure Data Lake Storage
6467

65-
Azure Data Lake Storage (ADLS) is a managed, hyperscale repository. A repository for analytics data that is compatible with HDFS. ADLS uses a design paradigm that is similar to HDFS. ADLS offers unlimited adaptability for total capacity and the size of individual files. ADLS is good when working with large files, since a large file can be stored across multiple nodes. Partitioning data in ADLS is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
68+
Azure Data Lake Storage is a managed, hyperscale repository for analytics data. It's compatible with, and uses a design paradigm that's similar to HDFS. Data Lake Storage offers unlimited adaptability for total capacity and the size of individual files. It's a good choice when working with large files, because they can be stored across multiple nodes. Partitioning data in Data Lake Storage is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors, that efficiently read and write hundreds of terabytes of data.
6669

67-
Data is typically ingested into ADLS using Azure Data Factory. Or ADLS SDKs, AdlCopy Service, Apache DistCp, or Apache Sqoop. Which of these services to use largely depends on where the data is. If the data is currently in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy Service, or Azure Data Factory. For data in Azure Blob Storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
70+
Data is usually ingested into Data Lake Storage by using Azure Data Factory. Data Lake Storage SDKs, AdlCopy service, Apache DistCp, or Apache Sqoop can also be used. The service you choose depends on where the data is located. If it's in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy service, or Azure Data Factory. For data in Azure Blob storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
6871

69-
ADLS is also optimized for event ingestion using Azure Event Hub or Apache Storm.
72+
Data Lake Storage is optimized for event ingestion by using Azure Event Hub or Apache Storm.
7073

71-
#### Considerations for both storage options
74+
#### Considerations for using Azure Storage and Azure Data Lake Storage
7275

73-
For uploading datasets in the terabyte range, network latency can be a major problem, particularly if the data is coming from an on-premises location. In such cases, you can use the options below:
76+
For uploading datasets in the terabyte range, network latency can be a major problem. This is particularly true if the data is coming from an on-premises location. In such cases, you can use the options below:
7477

75-
* Azure ExpressRoute: Azure ExpressRoute lets you create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
78+
- **Azure ExpressRoute:** Create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
7679

77-
* "Offline" upload of data. You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Storage Blobs. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Storage blobs to Data Lake Storage.
80+
- **Data upload from hard disk drives:** You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Blob storage. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Blob storage to Data Lake Storage.
7881

7982
### Azure SQL Data Warehouse
8083

81-
Azure SQL DW is a great choice to store prepared results. Azure HDInsight can be used to do those services for Azure SQL DW.
84+
Azure SQL Data Warehouse is a great choice to store prepared results. Azure HDInsight can be used to perform those services for SQL Data Warehouse.
8285

83-
Azure SQL Data Warehouse (SQL DW) is a relational database store optimized for analytic workloads. Azure SQL DW scales based on partitioned tables. Tables can be partitioned across multiple nodes. Azure SQL DW nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
86+
Azure SQL Data Warehouse is a relational database store optimized for analytic workloads. It scales based on partitioned tables. Tables can be partitioned across multiple nodes. The nodes are selected at the time of creation. They can scale after the fact, but that's an active process which might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
8487

8588
### Apache HBase
8689

87-
Apache HBase is a key-value store available in Azure HDInsight. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data. Data in a schemaless database organized by column families.
90+
Apache HBase is a key-value store available in Azure HDInsight. It's an open-source, NoSQL database that's built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semi-structured data.
91+
92+
Because HBase is a schemaless database, columns and data types don't need to be defined before using them. Data is stored in the rows of a table, and is grouped by column family.
8893

89-
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database. The columns and data types stored in them don't need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
94+
The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features which are provided by distributed applications in the Hadoop environment.
9095

9196
HBase is an excellent destination for sensor and log data for future analysis.
9297

9398
HBase adaptability is dependent on the number of nodes in the HDInsight cluster.
9499

95100
### Azure SQL Database and Azure Database
96101

97-
Azure offers three different relational databases as platform-as-a-service (PAAS).
102+
Azure offers three PaaS relational databases:
98103

99-
* [Azure SQL Database](../../sql-database/sql-database-technical-overview.md) is an implementation of Microsoft SQL Server. For more information on performance, see [Tuning Performance in Azure SQL Database](../../sql-database/sql-database-performance-guidance.md).
100-
* [Azure Database for MySQL](../../mysql/overview.md) is an implementation of Oracle MySQL.
101-
* [Azure Database for PostgreSQL](../../postgresql/quickstart-create-server-database-portal.md) is an implementation of PostgreSQL.
104+
- [Azure SQL Database](../../sql-database/sql-database-technical-overview.md) is an implementation of Microsoft SQL Server. For more information on performance, see [Tuning performance in Azure SQL Database](../../sql-database/sql-database-performance-guidance.md).
105+
- [Azure Database for MySQL](../../mysql/overview.md) is an implementation of Oracle MySQL.
106+
- [Azure Database for PostgreSQL](../../postgresql/quickstart-create-server-database-portal.md) is an implementation of PostgreSQL.
102107

103-
These products scale up, which means that they're scaled by adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
108+
These products scale up by adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
104109

105110
## Azure Analysis Services
106111

107-
Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics. AAS provides the analytical data for business reports and client applications such as Power BI. Also, Excel, Reporting Services reports, and other data visualization tools.
112+
Azure Analysis Services is an analytical data engine used in decision support and business analytics. It provides the analytical data for business reports and client applications such as Power BI. Excel, SQL Server Reporting Services reports, and other data visualization tools can also use the data provided by Azure Analysis Services.
108113

109-
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services Pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
114+
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
110115

111116
## Extract and Load
112117

113-
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
118+
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
114119

115120
### Apache Sqoop
116121

@@ -120,16 +125,16 @@ Sqoop uses MapReduce to import and export the data, to provide parallel operatio
120125

121126
### Apache Flume
122127

123-
`Apache Flume` is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has a flexible architecture based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online analytic application.
128+
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its flexible architecture is based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms. It has many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online, analytic application.
124129

125-
Apache Flume can't be used with Azure HDInsight. An on-premises Hadoop installation can use Flume to send data to either Azure Storage Blobs or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
130+
Apache Flume can't be used with Azure HDInsight. But, an on-premises Hadoop installation can use Flume to send data to either Azure Blob storage or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
126131

127132
## Transform
128133

129-
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
134+
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
130135

131136
## Next steps
132137

133-
* [Using Apache Hive as an ETL Tool](apache-hadoop-using-apache-hive-as-an-etl-tool.md)
134-
* [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-storage-gen2.md)
135-
* [Move data from Azure SQL Database To Apache Hive table](./apache-hadoop-use-sqoop-mac-linux.md)
138+
- [Using Apache Hive as an ETL tool](apache-hadoop-using-apache-hive-as-an-etl-tool.md)
139+
- [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-storage-gen2.md)
140+
- [Move data from Azure SQL Database to an Apache Hive table](./apache-hadoop-use-sqoop-mac-linux.md)

0 commit comments

Comments
 (0)