Skip to content

Commit 4bd270d

Browse files
authored
Merge pull request #113080 from dagiro/freshness_c58
freshness_c58
2 parents 632f2ea + a2123a3 commit 4bd270d

File tree

1 file changed

+19
-19
lines changed

1 file changed

+19
-19
lines changed

articles/hdinsight/hadoop/apache-hadoop-etl-at-scale.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ ms.author: ashishth
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.custom: hdinsightactive
10-
ms.date: 03/03/2020
9+
ms.custom: hdinsightactive,seoapr2020
10+
ms.date: 04/28/2020
1111
---
1212

1313
# Extract, transform, and load (ETL) at scale
1414

15-
Extract, transform, and load (ETL) is the process by which data is acquired from various sources, collected in a standard location, cleaned and processed, and ultimately loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With HDInsight, a wide variety of Apache Hadoop ecosystem components support performing ETL at scale.
15+
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. Collected in a standard location, cleaned and processed. Ultimately loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
1616

1717
The use of HDInsight in the ETL process can be summarized by this pipeline:
1818

@@ -30,41 +30,41 @@ Orchestration is needed to run the appropriate job at the appropriate time.
3030

3131
Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also be used to schedule jobs that are specific to a system, such as Java programs or shell scripts.
3232

33-
For more information, see [Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight](../hdinsight-use-oozie-linux-mac.md) For a deep dive showing how to use Oozie to drive an end-to-end pipeline, see [Operationalize the Data Pipeline](../hdinsight-operationalize-data-pipeline.md).
33+
For more information, see [Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight](../hdinsight-use-oozie-linux-mac.md). See also, [Operationalize the Data Pipeline](../hdinsight-operationalize-data-pipeline.md).
3434

3535
### Azure Data Factory
3636

37-
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
37+
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud. Workflows for orchestrating and automating data movement and data transformation.
3838

3939
Using Azure Data Factory, you can:
4040

4141
1. Create and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores.
42-
2. Process and transform the data using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, Azure Batch, and Azure Machine Learning.
42+
2. Process and transform the data using compute services such as Azure HDInsight Hadoop. Or Spark, Azure Data Lake Analytics, Azure Batch, and Azure Machine Learning.
4343
3. Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume.
4444

4545
For more information on Azure Data Factory, see the [documentation](../../data-factory/introduction.md).
4646

4747
## Ingest file storage and result storage
4848

49-
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but typically they're flat files like CSVs.
49+
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but are typically flat files like CSVs.
5050

5151
### Azure Storage
5252

53-
[Azure Storage](https://azure.microsoft.com/services/storage/blobs/) has specific scalability targets. For more information, see [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how many files or how large the files (as long as you are within your limits). This means that you can store terabytes of data and still get consistent performance, whether you're using a subset of the data or all of the data.
53+
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how, how large the files (as long as you are within your limits). This guarantee means you can store terabytes and still get consistent performance, whether you're using a subset of the data or all of the data.
5454

5555
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
5656

57-
Multiple blobs can be distributed across many servers to scale out access to them, but a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
57+
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
5858

59-
Azure Storage also has a WebHDFS API layer for the blob storage. All the services in HDInsight can access files in Azure Blob Storage for data cleaning and data processing, similarly to how those services would use Hadoop Distributed Files System (HDFS).
59+
Azure Storage also has a WebHDFS API layer for the blob storage. All the services in HDInsight can access files in Azure Blob Storage for data cleaning and data processing. Similar to how those services would use Hadoop Distributed Files System (HDFS).
6060

6161
Data is typically ingested into Azure Storage using either PowerShell, the Azure Storage SDK, or AZCopy.
6262

6363
### Azure Data Lake Storage
6464

65-
Azure Data Lake Storage (ADLS) is a managed, hyperscale repository for analytics data that is compatible with HDFS. ADLS uses a design paradigm that is similar to HDFS, and offers unlimited scalability in terms of total capacity and the size of individual files. ADLS is very good when working with large files, since a large file can be stored across multiple nodes. Partitioning data in ADLS is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
65+
Azure Data Lake Storage (ADLS) is a managed, hyperscale repository. A repository for analytics data that is compatible with HDFS. ADLS uses a design paradigm that is similar to HDFS. ADLS offers unlimited adaptability for total capacity and the size of individual files. ADLS is good when working with large files, since a large file can be stored across multiple nodes. Partitioning data in ADLS is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
6666

67-
Data is typically ingested into ADLS using Azure Data Factory, ADLS SDKs, AdlCopy Service, Apache DistCp, or Apache Sqoop. Which of these services to use largely depends on where the data is. If the data is currently in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy Service, or Azure Data Factory. If it's in Azure Blob Storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
67+
Data is typically ingested into ADLS using Azure Data Factory. Or ADLS SDKs, AdlCopy Service, Apache DistCp, or Apache Sqoop. Which of these services to use largely depends on where the data is. If the data is currently in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy Service, or Azure Data Factory. For data in Azure Blob Storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
6868

6969
ADLS is also optimized for event ingestion using Azure Event Hub or Apache Storm.
7070

@@ -74,23 +74,23 @@ For uploading datasets in the terabyte range, network latency can be a major pro
7474

7575
* Azure ExpressRoute: Azure ExpressRoute lets you create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
7676

77-
* "Offline" upload of data. You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Storage Blobs. You can then use [Azure Data Factory](../../data-factory/connector-azure-data-lake-store.md) or the [AdlCopy](../../data-lake-store/data-lake-store-copy-data-azure-storage-blob.md) tool to copy data from Azure Storage blobs to Data Lake Storage.
77+
* "Offline" upload of data. You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Storage Blobs. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Storage blobs to Data Lake Storage.
7878

7979
### Azure SQL Data Warehouse
8080

81-
Azure SQL DW is a great choice to store cleaned and prepared results for future analytics. Azure HDInsight can be used to perform those services for Azure SQL DW.
81+
Azure SQL DW is a great choice to store prepared results. Azure HDInsight can be used to do those services for Azure SQL DW.
8282

8383
Azure SQL Data Warehouse (SQL DW) is a relational database store optimized for analytic workloads. Azure SQL DW scales based on partitioned tables. Tables can be partitioned across multiple nodes. Azure SQL DW nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
8484

8585
### Apache HBase
8686

87-
Apache HBase is a key-value store available in Azure HDInsight. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families.
87+
Apache HBase is a key-value store available in Azure HDInsight. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data. Data in a schemaless database organized by column families.
8888

89-
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.
89+
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database. The columns and data types stored in them don't need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
9090

9191
HBase is an excellent destination for sensor and log data for future analysis.
9292

93-
HBase scalability is dependent on the number of nodes in the HDInsight cluster.
93+
HBase adaptability is dependent on the number of nodes in the HDInsight cluster.
9494

9595
### Azure SQL Database and Azure Database
9696

@@ -104,7 +104,7 @@ These products scale up, which means that they're scaled by adding more CPU and
104104

105105
## Azure Analysis Services
106106

107-
Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics, providing the analytical data for business reports and client applications such as Power BI, Excel, Reporting Services reports, and other data visualization tools.
107+
Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics. AAS provides the analytical data for business reports and client applications such as Power BI. Also, Excel, Reporting Services reports, and other data visualization tools.
108108

109109
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services Pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
110110

@@ -120,7 +120,7 @@ Sqoop uses MapReduce to import and export the data, to provide parallel operatio
120120

121121
### Apache Flume
122122

123-
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has a simple and flexible architecture based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online analytic application.
123+
`Apache Flume` is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has a flexible architecture based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online analytic application.
124124

125125
Apache Flume can't be used with Azure HDInsight. An on-premises Hadoop installation can use Flume to send data to either Azure Storage Blobs or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
126126

0 commit comments

Comments
 (0)