You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-etl-at-scale.md
+19-19Lines changed: 19 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,13 @@ ms.author: ashishth
6
6
ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: conceptual
9
-
ms.custom: hdinsightactive
10
-
ms.date: 03/03/2020
9
+
ms.custom: hdinsightactive,seoapr2020
10
+
ms.date: 04/28/2020
11
11
---
12
12
13
13
# Extract, transform, and load (ETL) at scale
14
14
15
-
Extract, transform, and load (ETL) is the process by which data is acquired from various sources, collected in a standard location, cleaned and processed, and ultimately loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With HDInsight, a wide variety of Apache Hadoop ecosystem components support performing ETL at scale.
15
+
Extract, transform, and load (ETL) is the process by which data is acquired from various sources. Collected in a standard location, cleaned and processed. Ultimately loaded into a datastore from which it can be queried. Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. With HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.
16
16
17
17
The use of HDInsight in the ETL process can be summarized by this pipeline:
18
18
@@ -30,41 +30,41 @@ Orchestration is needed to run the appropriate job at the appropriate time.
30
30
31
31
Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can also be used to schedule jobs that are specific to a system, such as Java programs or shell scripts.
32
32
33
-
For more information, see [Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight](../hdinsight-use-oozie-linux-mac.md) For a deep dive showing how to use Oozie to drive an end-to-end pipeline, see[Operationalize the Data Pipeline](../hdinsight-operationalize-data-pipeline.md).
33
+
For more information, see [Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight](../hdinsight-use-oozie-linux-mac.md). See also,[Operationalize the Data Pipeline](../hdinsight-operationalize-data-pipeline.md).
34
34
35
35
### Azure Data Factory
36
36
37
-
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
37
+
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud. Workflows for orchestrating and automating data movement and data transformation.
38
38
39
39
Using Azure Data Factory, you can:
40
40
41
41
1. Create and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores.
42
-
2. Process and transform the data using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, Azure Batch, and Azure Machine Learning.
42
+
2. Process and transform the data using compute services such as Azure HDInsight Hadoop. Or Spark, Azure Data Lake Analytics, Azure Batch, and Azure Machine Learning.
43
43
3. Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume.
44
44
45
45
For more information on Azure Data Factory, see the [documentation](../../data-factory/introduction.md).
46
46
47
47
## Ingest file storage and result storage
48
48
49
-
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but typically they're flat files like CSVs.
49
+
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but are typically flat files like CSVs.
50
50
51
51
### Azure Storage
52
52
53
-
[Azure Storage](https://azure.microsoft.com/services/storage/blobs/) has specific scalability targets. For more information, see [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how many files or how large the files (as long as you are within your limits). This means that you can store terabytes of data and still get consistent performance, whether you're using a subset of the data or all of the data.
53
+
Azure Storage has specific adaptability targets. See [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how, how large the files (as long as you are within your limits). This guarantee means you can store terabytes and still get consistent performance, whether you're using a subset of the data or all of the data.
54
54
55
55
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
56
56
57
-
Multiple blobs can be distributed across many servers to scale out access to them, but a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
57
+
Multiple blobs can be distributed across many servers to scale out access to them. But a single blob can only be served by a single server. While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.
58
58
59
-
Azure Storage also has a WebHDFS API layer for the blob storage. All the services in HDInsight can access files in Azure Blob Storage for data cleaning and data processing, similarly to how those services would use Hadoop Distributed Files System (HDFS).
59
+
Azure Storage also has a WebHDFS API layer for the blob storage. All the services in HDInsight can access files in Azure Blob Storage for data cleaning and data processing. Similar to how those services would use Hadoop Distributed Files System (HDFS).
60
60
61
61
Data is typically ingested into Azure Storage using either PowerShell, the Azure Storage SDK, or AZCopy.
62
62
63
63
### Azure Data Lake Storage
64
64
65
-
Azure Data Lake Storage (ADLS) is a managed, hyperscale repositoryfor analytics data that is compatible with HDFS. ADLS uses a design paradigm that is similar to HDFS, and offers unlimited scalability in terms of total capacity and the size of individual files. ADLS is very good when working with large files, since a large file can be stored across multiple nodes. Partitioning data in ADLS is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
65
+
Azure Data Lake Storage (ADLS) is a managed, hyperscale repository. A repository for analytics data that is compatible with HDFS. ADLS uses a design paradigm that is similar to HDFS. ADLS offers unlimited adaptability for total capacity and the size of individual files. ADLS is good when working with large files, since a large file can be stored across multiple nodes. Partitioning data in ADLS is done behind the scenes. You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.
66
66
67
-
Data is typically ingested into ADLS using Azure Data Factory, ADLS SDKs, AdlCopy Service, Apache DistCp, or Apache Sqoop. Which of these services to use largely depends on where the data is. If the data is currently in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy Service, or Azure Data Factory. If it's in Azure Blob Storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
67
+
Data is typically ingested into ADLS using Azure Data Factory. Or ADLS SDKs, AdlCopy Service, Apache DistCp, or Apache Sqoop. Which of these services to use largely depends on where the data is. If the data is currently in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy Service, or Azure Data Factory. For data in Azure Blob Storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.
68
68
69
69
ADLS is also optimized for event ingestion using Azure Event Hub or Apache Storm.
70
70
@@ -74,23 +74,23 @@ For uploading datasets in the terabyte range, network latency can be a major pro
74
74
75
75
* Azure ExpressRoute: Azure ExpressRoute lets you create private connections between Azure datacenters and your on-premises infrastructure. These connections provide a reliable option for transferring large amounts of data. For more information, see [Azure ExpressRoute documentation](../../expressroute/expressroute-introduction.md).
76
76
77
-
* "Offline" upload of data. You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Storage Blobs. You can then use [Azure Data Factory](../../data-factory/connector-azure-data-lake-store.md) or the [AdlCopy](../../data-lake-store/data-lake-store-copy-data-azure-storage-blob.md) tool to copy data from Azure Storage blobs to Data Lake Storage.
77
+
* "Offline" upload of data. You can use [Azure Import/Export service](../../storage/common/storage-import-export-service.md) to ship hard disk drives with your data to an Azure data center. Your data is first uploaded to Azure Storage Blobs. You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Storage blobs to Data Lake Storage.
78
78
79
79
### Azure SQL Data Warehouse
80
80
81
-
Azure SQL DW is a great choice to store cleaned and prepared results for future analytics. Azure HDInsight can be used to perform those services for Azure SQL DW.
81
+
Azure SQL DW is a great choice to store prepared results. Azure HDInsight can be used to do those services for Azure SQL DW.
82
82
83
83
Azure SQL Data Warehouse (SQL DW) is a relational database store optimized for analytic workloads. Azure SQL DW scales based on partitioned tables. Tables can be partitioned across multiple nodes. Azure SQL DW nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../synapse-analytics/sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
84
84
85
85
### Apache HBase
86
86
87
-
Apache HBase is a key-value store available in Azure HDInsight. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families.
87
+
Apache HBase is a key-value store available in Azure HDInsight. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data. Data in a schemaless database organized by column families.
88
88
89
-
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.
89
+
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database. The columns and data types stored in them don't need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.
90
90
91
91
HBase is an excellent destination for sensor and log data for future analysis.
92
92
93
-
HBase scalability is dependent on the number of nodes in the HDInsight cluster.
93
+
HBase adaptability is dependent on the number of nodes in the HDInsight cluster.
94
94
95
95
### Azure SQL Database and Azure Database
96
96
@@ -104,7 +104,7 @@ These products scale up, which means that they're scaled by adding more CPU and
104
104
105
105
## Azure Analysis Services
106
106
107
-
Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics, providing the analytical data for business reports and client applications such as Power BI, Excel, Reporting Services reports, and other data visualization tools.
107
+
Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics. AAS provides the analytical data for business reports and client applications such as Power BI. Also, Excel, Reporting Services reports, and other data visualization tools.
108
108
109
109
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services Pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
110
110
@@ -120,7 +120,7 @@ Sqoop uses MapReduce to import and export the data, to provide parallel operatio
120
120
121
121
### Apache Flume
122
122
123
-
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has a simple and flexible architecture based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online analytic application.
123
+
`Apache Flume` is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has a flexible architecture based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online analytic application.
124
124
125
125
Apache Flume can't be used with Azure HDInsight. An on-premises Hadoop installation can use Flume to send data to either Azure Storage Blobs or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
0 commit comments