|
1 | 1 | ---
|
2 | 2 | title: Using Apache Hive as an ETL Tool - Azure HDInsight
|
3 | 3 | description: Use Apache Hive to extract, transform, and load (ETL) data in Azure HDInsight.
|
4 |
| -ms.service: hdinsight |
5 | 4 | author: ashishthaps
|
6 | 5 | ms.author: ashishth
|
7 | 6 | ms.reviewer: jasonh
|
8 |
| -ms.custom: hdinsightactive |
| 7 | +ms.service: hdinsight |
9 | 8 | ms.topic: conceptual
|
10 |
| -ms.date: 11/22/2019 |
| 9 | +ms.custom: hdinsightactive,seoapr2020 |
| 10 | +ms.date: 04/28/2020 |
11 | 11 | ---
|
12 | 12 |
|
13 | 13 | # Use Apache Hive as an Extract, Transform, and Load (ETL) tool
|
14 | 14 |
|
15 |
| -You typically need to clean and transform incoming data before loading it into a destination suitable for analytics. Extract, Transform, and Load (ETL) operations are used to prepare data and load it into a data destination. Apache Hive on HDInsight can read in unstructured data, process the data as needed, and then load the data into a relational data warehouse for decision support systems. In this approach, data is extracted from the source and stored in scalable storage, such as Azure Storage blobs or Azure Data Lake Storage. The data is then transformed using a sequence of Hive queries and is finally staged within Hive in preparation for bulk loading into the destination data store. |
| 15 | +You typically need to clean and transform incoming data before loading it into a destination suitable for analytics. Extract, Transform, and Load (ETL) operations are used to prepare data and load it into a data destination. Apache Hive on HDInsight can read in unstructured data, process the data as needed, and then load the data into a relational data warehouse for decision support systems. In this approach, data is extracted from the source. Then stored in adaptable storage, such as Azure Storage blobs or Azure Data Lake Storage. The data is then transformed using a sequence of Hive queries. Then staged within Hive in preparation for bulk loading into the destination data store. |
16 | 16 |
|
17 | 17 | ## Use case and model overview
|
18 | 18 |
|
19 |
| -The following figure shows an overview of the use case and model for ETL automation. Input data is transformed to generate the appropriate output. During that transformation, the data can change shape, data type, and even language. ETL processes can convert Imperial to metric, change time zones, and improve precision to properly align with existing data in the destination. ETL processes can also combine new data with existing data to keep reporting up to date, or to provide further insight into existing data. Applications such as reporting tools and services can then consume this data in the desired format. |
| 19 | +The following figure shows an overview of the use case and model for ETL automation. Input data is transformed to generate the appropriate output. During that transformation, the data changes shape, data type, and even language. ETL processes can convert Imperial to metric, change time zones, and improve precision to properly align with existing data in the destination. ETL processes can also combine new data with existing data to keep reporting up to date, or to provide further insight into existing data. Applications such as reporting tools and services can then consume this data in the wanted format. |
20 | 20 |
|
21 | 21 | 
|
22 | 22 |
|
23 |
| -Hadoop is typically used in ETL processes that import either a massive number of text files (like CSVs) or a smaller but frequently changing number of text files, or both. Hive is a great tool to use to prepare the data before loading it into the data destination. Hive allows you to create a schema over the CSV and use a SQL-like language to generate MapReduce programs that interact with the data. |
| 23 | +Hadoop is typically used in ETL processes that import either a massive number of text files (like CSVs). Or a smaller but frequently changing number of text files, or both. Hive is a great tool to use to prepare the data before loading it into the data destination. Hive allows you to create a schema over the CSV and use a SQL-like language to generate MapReduce programs that interact with the data. |
24 | 24 |
|
25 |
| -The typical steps to using Hive to perform ETL are as follows: |
| 25 | +The typical steps to using Hive to do ETL are as follows: |
26 | 26 |
|
27 | 27 | 1. Load data into Azure Data Lake Storage or Azure Blob Storage.
|
28 | 28 | 2. Create a Metadata Store database (using Azure SQL Database) for use by Hive in storing your schemas.
|
29 | 29 | 3. Create an HDInsight cluster and connect the data store.
|
30 | 30 | 4. Define the schema to apply at read-time over data in the data store:
|
31 | 31 |
|
32 |
| - ``` |
| 32 | + ```hql |
33 | 33 | DROP TABLE IF EXISTS hvac;
|
34 | 34 |
|
35 | 35 | --create the hvac table on comma-separated sensor data stored in Azure Storage blobs
|
@@ -61,30 +61,28 @@ Data sources are typically external data that can be matched to existing data in
|
61 | 61 |
|
62 | 62 | ## Output targets
|
63 | 63 |
|
64 |
| -You can use Hive to output data to a variety of targets including: |
| 64 | +You can use Hive to output data to different kinds of targets including: |
65 | 65 |
|
66 | 66 | * A relational database, such as SQL Server or Azure SQL Database.
|
67 | 67 | * A data warehouse, such as Azure SQL Data Warehouse.
|
68 | 68 | * Excel.
|
69 | 69 | * Azure table and blob storage.
|
70 | 70 | * Applications or services that require data to be processed into specific formats, or as files that contain specific types of information structure.
|
71 |
| -* A JSON Document Store like [Azure Cosmos DB](https://azure.microsoft.com/services/cosmos-db/). |
| 71 | +* A JSON Document Store like Azure Cosmos DB. |
72 | 72 |
|
73 | 73 | ## Considerations
|
74 | 74 |
|
75 | 75 | The ETL model is typically used when you want to:
|
76 | 76 |
|
77 |
| -* Load stream data or large volumes of semi-structured or unstructured data from external sources into an existing database or information system. |
78 |
| -* Clean, transform, and validate the data before loading it, perhaps by using more than one transformation pass through the cluster. |
79 |
| -* Generate reports and visualizations that are regularly updated. For example, if the report takes too long to generate during the day, you can schedule the report to run at night. To automatically run a Hive query, you can use [Azure Logic Apps](../../logic-apps/logic-apps-overview.md) and PowerShell. |
| 77 | +`*` Load stream data or large volumes of semi-structured or unstructured data from external sources into an existing database or information system. |
| 78 | +`*` Clean, transform, and validate the data before loading it, perhaps by using more than one transformation pass through the cluster. |
| 79 | +`*` Generate reports and visualizations that are regularly updated. For example, if the report takes too long to generate during the day, you can schedule the report to run at night. To automatically run a Hive query, you can use [Azure Logic Apps](../../logic-apps/logic-apps-overview.md) and PowerShell. |
80 | 80 |
|
81 | 81 | If the target for the data isn't a database, you can generate a file in the appropriate format within the query, for example a CSV. This file can then be imported into Excel or Power BI.
|
82 | 82 |
|
83 |
| -If you need to execute several operations on the data as part of the ETL process, consider how you manage them. If the operations are controlled by an external program, rather than as a workflow within the solution, you need to decide whether some operations can be executed in parallel, and to detect when each job completes. Using a workflow mechanism such as Oozie within Hadoop may be easier than trying to orchestrate a sequence of operations using external scripts or custom programs. For more information about Oozie, see [Workflow and job orchestration](https://msdn.microsoft.com/library/dn749829.aspx). |
| 83 | +If you need to execute several operations on the data as part of the ETL process, consider how you manage them. With operations controlled by an external program, rather than as a workflow within the solution, decide whether some operations can be executed in parallel. And to detect when each job completes. Using a workflow mechanism such as Oozie within Hadoop may be easier than trying to orchestrate a sequence of operations using external scripts or custom programs. |
84 | 84 |
|
85 | 85 | ## Next steps
|
86 | 86 |
|
87 | 87 | * [ETL at scale](apache-hadoop-etl-at-scale.md)
|
88 |
| -* [Operationalize a data pipeline](../hdinsight-operationalize-data-pipeline.md) |
89 |
| -
|
90 |
| -<!-- * [ETL Deep Dive](../hdinsight-etl-deep-dive.md) --> |
| 88 | +* [`Operationalize a data pipeline`](../hdinsight-operationalize-data-pipeline.md) |
0 commit comments