You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-etl-at-scale.md
+16-17Lines changed: 16 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,13 +2,12 @@
2
2
title: Extract, transform, and load (ETL) at Scale - Azure HDInsight
3
3
description: Learn how extract, transform, and load is used in HDInsight with Apache Hadoop.
4
4
author: ashishthaps
5
+
ms.author: ashishth
5
6
ms.reviewer: jasonh
6
-
7
7
ms.service: hdinsight
8
-
ms.custom: hdinsightactive
9
8
ms.topic: conceptual
10
-
ms.date: 06/13/2019
11
-
ms.author: ashishth
9
+
ms.custom: hdinsightactive
10
+
ms.date: 01/27/2020
12
11
---
13
12
14
13
# Extract, transform, and load (ETL) at scale
@@ -35,7 +34,7 @@ For more information, see [Use Apache Oozie with Apache Hadoop to define and run
35
34
36
35
### Azure Data Factory
37
36
38
-
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
37
+
Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
39
38
40
39
Using Azure Data Factory, you can:
41
40
@@ -47,11 +46,11 @@ For more information on Azure Data Factory, see the [documentation](../../data-f
47
46
48
47
## Ingest file storage and result storage
49
48
50
-
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but typically they are flat files like CSVs.
49
+
Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. Files can be in any format, but typically they're flat files like CSVs.
51
50
52
51
### Azure Storage
53
52
54
-
[Azure Storage](https://azure.microsoft.com/services/storage/blobs/) has specific scalability targets. For more information, see [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how many files or how large the files (as long as you are within your limits). This means that you can store terabytes of data and still get consistent performance, whether you are using a subset of the data or all of the data.
53
+
[Azure Storage](https://azure.microsoft.com/services/storage/blobs/) has specific scalability targets. For more information, see [Scalability and performance targets for Blob storage](../../storage/blobs/scalability-targets.md). For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure Storage guarantees the same performance, no matter how many files or how large the files (as long as you are within your limits). This means that you can store terabytes of data and still get consistent performance, whether you're using a subset of the data or all of the data.
55
54
56
55
Azure Storage has several different types of blobs. An *append blob* is a great option for storing web logs or sensor data.
57
56
@@ -81,13 +80,13 @@ For uploading datasets in the terabyte range, network latency can be a major pro
81
80
82
81
Azure SQL DW is a great choice to store cleaned and prepared results for future analytics. Azure HDInsight can be used to perform those services for Azure SQL DW.
83
82
84
-
Azure SQL Data Warehouse (SQL DW) is a relational database store optimized for analytic workloads. Azure SQL DW scales based on partitioned tables. Tables can be partitioned across multiple nodes. Azure SQL DW nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. See [SQL Data Warehouse - Manage Compute](../../sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md) for more information.
83
+
Azure SQL Data Warehouse (SQL DW) is a relational database store optimized for analytic workloads. Azure SQL DW scales based on partitioned tables. Tables can be partitioned across multiple nodes. Azure SQL DW nodes are selected at the time of creation. They can scale after the fact, but that's an active process that might require data movement. For more information, see [SQL Data Warehouse - Manage Compute](../../sql-data-warehouse/sql-data-warehouse-manage-compute-overview.md).
85
84
86
85
### Apache HBase
87
86
88
87
Apache HBase is a key-value store available in Azure HDInsight. Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families.
89
88
90
-
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.
89
+
Data is stored in the rows of a table, and data within a row is grouped by column family. HBase is a schemaless database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.
91
90
92
91
HBase is an excellent destination for sensor and log data for future analysis.
93
92
@@ -101,36 +100,36 @@ Azure offers three different relational databases as platform-as-a-service (PAAS
101
100
*[Azure Database for MySQL](../../mysql/overview.md) is an implementation of Oracle MySQL.
102
101
*[Azure Database for PostgreSQL](../../postgresql/quickstart-create-server-database-portal.md) is an implementation of PostgreSQL.
103
102
104
-
These products scale up, which means that they are scaled by adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
103
+
These products scale up, which means that they're scaled by adding more CPU and memory. You can also choose to use premium disks with the products for better I/O performance.
105
104
106
-
## Azure Analysis Services
105
+
## Azure Analysis Services
107
106
108
107
Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics, providing the analytical data for business reports and client applications such as Power BI, Excel, Reporting Services reports, and other data visualization tools.
109
108
110
109
Analysis cubes can scale by changing tiers for each individual cube. For more information, see [Azure Analysis Services Pricing](https://azure.microsoft.com/pricing/details/analysis-services/).
111
110
112
111
## Extract and Load
113
112
114
-
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
113
+
Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight supports Sqoop and Flume.
115
114
116
115
### Apache Sqoop
117
116
118
-
Apache Sqoop is a tool designed for efficiently transferring data between structured, semi-structured, and unstructured data sources.
117
+
Apache Sqoop is a tool designed for efficiently transferring data between structured, semi-structured, and unstructured data sources.
119
118
120
119
Sqoop uses MapReduce to import and export the data, to provide parallel operation and fault tolerance.
121
120
122
121
### Apache Flume
123
122
124
123
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume has a simple and flexible architecture based on streaming data flows. Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume uses a simple extensible data model that allows for online analytic application.
125
124
126
-
Apache Flume cannot be used with Azure HDInsight. An on-premises Hadoop installation can use Flume to send data to either Azure Storage Blobs or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
125
+
Apache Flume can't be used with Azure HDInsight. An on-premises Hadoop installation can use Flume to send data to either Azure Storage Blobs or Azure Data Lake Storage. For more information, see [Using Apache Flume with HDInsight](https://web.archive.org/web/20190217104751/https://blogs.msdn.microsoft.com/bigdatasupport/2014/03/18/using-apache-flume-with-hdinsight/).
127
126
128
127
## Transform
129
128
130
-
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They are all supported on HDInsight.
129
+
Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive, Pig, and Spark SQL are all good choices for that kind of work. They're all supported on HDInsight.
131
130
132
131
## Next steps
133
132
134
-
*[Use Apache Pig with Apache Hadoop on HDInsight](hdinsight-use-pig.md)
135
-
*[Using Apache Hive as an ETL Tool](apache-hadoop-using-apache-hive-as-an-etl-tool.md)
133
+
*[Using Apache Hive as an ETL Tool](apache-hadoop-using-apache-hive-as-an-etl-tool.md)
136
134
*[Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-storage-gen2.md)
135
+
*[Move data from Azure SQL Database To Apache Hive table](./apache-hadoop-use-sqoop-mac-linux.md)
0 commit comments