Skip to content

Commit 28d7f1d

Browse files
committed
edit pass: apache-hadoop-etl-at-scale
1 parent 2c55c12 commit 28d7f1d

File tree

1 file changed

+24
-24
lines changed

1 file changed

+24
-24
lines changed

articles/hdinsight/hadoop/apache-hadoop-etl-at-scale.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Extract, transform, and load (ETL) at Scale - Azure HDInsight
2+
title: Extract, transform, and load (ETL) at scale - Azure HDInsight
33
description: Learn how extract, transform, and load is used in HDInsight with Apache Hadoop.
44
author: ashishthaps
55
ms.author: ashishth
@@ -14,7 +14,7 @@ ms.date: 04/28/2020
1414

1515
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. The data is collected in a standard location, cleaned, and processed. Ultimately, the data is loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With Azure HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
1616

17-
The use of HDInsight in the ETL process can be summarized by this pipeline:
17+
The use of HDInsight in the ETL process is summarized by this pipeline:
1818

1919
![HDInsight ETL at scale overview](./media/apache-hadoop-etl-at-scale/hdinsight-etl-at-scale-overview.png)
2020

@@ -55,67 +55,67 @@ Source data files are typically loaded into a location on Azure Storage or Azure
5555

5656
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md) for more information. For most analytic nodes, Azure Storage scales best when dealing with many smaller files. As long as you're within your account limits, Azure Storage guarantees the same performance, no matter how large the files are. You can store terabytes of data and still get consistent performance. This statement is true whether you're using a subset or all of the data.
5757

58-
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
58+
Azure Storage has several types of blobs. An *append blob* is a great option for storing web logs or sensor data.
5959

60-
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
60+
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob is only served by a single server. Although blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
6161

62-
Azure Storage has a WebHDFS API layer for the blob storage. All HDInsight services can access files in Azure Blob storage for data cleaning and data processing. This is similar to how those services would use a Hadoop Distributed Files System (HDFS).
62+
Azure Storage has a WebHDFS API layer for the blob storage. All HDInsight services can access files in Azure Blob storage for data cleaning and data processing. This is similar to how those services would use Hadoop Distributed File System (HDFS).
6363

64-
Data is typically ingested into Azure Storage by using PowerShell, the Azure Storage SDK, or AZCopy.
64+
Data is typically ingested into Azure Storage through PowerShell, the Azure Storage SDK, or AZCopy.
6565

6666
### Azure Data Lake Storage
6767

68-
Azure Data Lake Storage is a managed, hyperscale repository for analytics data. It's compatible with, and uses a design paradigm that's similar to HDFS. Data Lake Storage offers unlimited adaptability for total capacity and the size of individual files. It's a good choice when working with large files, because they can be stored across multiple nodes. Partitioning data in Data Lake Storage is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors, that efficiently read and write hundreds of terabytes of data.
68+
Azure Data Lake Storage is a managed, hyperscale repository for analytics data. It's compatible with and uses a design paradigm that's similar to HDFS. Data Lake Storage offers unlimited adaptability for total capacity and the size of individual files. It's a good choice when working with large files, because they can be stored across multiple nodes. Partitioning data in Data Lake Storage is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
6969

70-
Data is usually ingested into Data Lake Storage by using Azure Data Factory. Data Lake Storage SDKs, AdlCopy service, Apache DistCp, or Apache Sqoop can also be used. The service you choose depends on where the data is located. If it's in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy service, or Azure Data Factory. For data in Azure Blob storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
70+
Data is usually ingested into Data Lake Storage through Azure Data Factory. You can also use Data Lake Storage SDKs, the AdlCopy service, Apache DistCp, or Apache Sqoop. The service you choose depends on where the data is. If it's in an existing Hadoop cluster, you might use Apache DistCp, the AdlCopy service, or Azure Data Factory. For data in Azure Blob storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
7171

72-
Data Lake Storage is optimized for event ingestion by using Azure Event Hub or Apache Storm.
72+
Data Lake Storage is optimized for event ingestion through Azure Event Hubs or Apache Storm.
7373

74-
#### Considerations for using Azure Storage and Azure Data Lake Storage
74+
### Considerations for both storage options
7575

76-
For uploading datasets in the terabyte range, network latency can be a major problem. This is particularly true if the data is coming from an on-premises location. In such cases, you can use the options below:
76+
For uploading datasets in the terabyte range, network latency can be a major problem. This is particularly true if the data is coming from an on-premises location. In such cases, you can use these options:
7777

7878
- **Azure ExpressRoute:** Create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
7979

80-
- **Data upload from hard disk drives:** You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Blob storage. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Blob storage to Data Lake Storage.
80+
- **Data upload from hard disk drives:** You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure datacenter. Your data is first uploaded to Azure Blob storage. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Blob storage to Data Lake Storage.
8181

8282
### Azure SQL Data Warehouse
8383

84-
Azure SQL Data Warehouse is a great choice to store prepared results. Azure HDInsight can be used to perform those services for SQL Data Warehouse.
84+
Azure SQL Data Warehouse is an appropriate choice to store prepared results. You can use Azure HDInsight to perform those services for SQL Data Warehouse.
8585

86-
Azure SQL Data Warehouse is a relational database store optimized for analytic workloads. It scales based on partitioned tables. Tables can be partitioned across multiple nodes. The nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
86+
Azure SQL Data Warehouse is a relational database store optimized for analytic workloads. It scales based on partitioned tables. Tables can be partitioned across multiple nodes. The nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [Manage compute in SQL Data Warehouse](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
8787

8888
### Apache HBase
8989

90-
Apache HBase is a key-value store available in Azure HDInsight. It's an open-source, NoSQL database that's built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semi-structured data.
90+
Apache HBase is a key/value store available in Azure HDInsight. It's an open-source, NoSQL database that's built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semi-structured data.
9191

92-
Because HBase is a schemaless database, columns and data types don't need to be defined before using them. Data is stored in the rows of a table, and is grouped by column family.
92+
Because HBase is a schemaless database, you don't need to define columns and data types before you use them. Data is stored in the rows of a table, and is grouped by column family.
9393

94-
The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
94+
The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase relies on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
9595

9696
HBase is an excellent destination for sensor and log data for future analysis.
9797

9898
HBase adaptability is dependent on the number of nodes in the HDInsight cluster.
9999

100-
### Azure SQL Database and Azure Database
100+
### Azure SQL databases
101101

102102
Azure offers three PaaS relational databases:
103103

104104
- [Azure SQL Database](../../sql-database/sql-database-technical-overview.md) is an implementation of Microsoft SQL Server. For more information on performance, see [Tuning performance in Azure SQL Database](../../sql-database/sql-database-performance-guidance.md).
105105
- [Azure Database for MySQL](../../mysql/overview.md) is an implementation of Oracle MySQL.
106106
- [Azure Database for PostgreSQL](../../postgresql/quickstart-create-server-database-portal.md) is an implementation of PostgreSQL.
107107

108-
These products scale up by adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
108+
Add more CPU and memory to scale up these products. You can also choose to use premium disks with the products for better I/O performance.
109109

110110
## Azure Analysis Services
111111

112-
Azure Analysis Services is an analytical data engine used in decision support and business analytics. It provides the analytical data for business reports and client applications such as Power BI. Excel, SQL Server Reporting Services reports, and other data visualization tools can also use the data provided by Azure Analysis Services.
112+
Azure Analysis Services is an analytical data engine used in decision support and business analytics. It provides the analytical data for business reports and client applications such as Power BI. The analytical data also works with Excel, SQL Server Reporting Services reports, and other data visualization tools.
113113

114-
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
114+
Scale analysis cubes by changing tiers for each individual cube. For more information, see [Azure Analysis Services pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
115115

116-
## Extract and Load
116+
## Extract and load
117117

118-
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
118+
After the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
119119

120120
### Apache Sqoop
121121

@@ -131,7 +131,7 @@ Apache Flume can't be used with Azure HDInsight. But, an on-premises Hadoop inst
131131

132132
## Transform
133133

134-
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
134+
After data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
135135

136136
## Next steps
137137

0 commit comments

Comments
 (0)