Skip to content

Commit 5e1b25a

Browse files
authored
Merge pull request #113129 from dagiro/freshness_c60
freshness_c60
2 parents 8ecb599 + 1c3bad3 commit 5e1b25a

File tree

1 file changed

+15
-17
lines changed

1 file changed

+15
-17
lines changed

articles/hdinsight/hadoop/apache-hadoop-using-apache-hive-as-an-etl-tool.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,35 @@
11
---
22
title: Using Apache Hive as an ETL Tool - Azure HDInsight
33
description: Use Apache Hive to extract, transform, and load (ETL) data in Azure HDInsight.
4-
ms.service: hdinsight
54
author: ashishthaps
65
ms.author: ashishth
76
ms.reviewer: jasonh
8-
ms.custom: hdinsightactive
7+
ms.service: hdinsight
98
ms.topic: conceptual
10-
ms.date: 11/22/2019
9+
ms.custom: hdinsightactive,seoapr2020
10+
ms.date: 04/28/2020
1111
---
1212

1313
# Use Apache Hive as an Extract, Transform, and Load (ETL) tool
1414

15-
You typically need to clean and transform incoming data before loading it into a destination suitable for analytics. Extract, Transform, and Load (ETL) operations are used to prepare data and load it into a data destination. Apache Hive on HDInsight can read in unstructured data, process the data as needed, and then load the data into a relational data warehouse for decision support systems. In this approach, data is extracted from the source and stored in scalable storage, such as Azure Storage blobs or Azure Data Lake Storage. The data is then transformed using a sequence of Hive queries and is finally staged within Hive in preparation for bulk loading into the destination data store.
15+
You typically need to clean and transform incoming data before loading it into a destination suitable for analytics. Extract, Transform, and Load (ETL) operations are used to prepare data and load it into a data destination. Apache Hive on HDInsight can read in unstructured data, process the data as needed, and then load the data into a relational data warehouse for decision support systems. In this approach, data is extracted from the source. Then stored in adaptable storage, such as Azure Storage blobs or Azure Data Lake Storage. The data is then transformed using a sequence of Hive queries. Then staged within Hive in preparation for bulk loading into the destination data store.
1616

1717
## Use case and model overview
1818

19-
The following figure shows an overview of the use case and model for ETL automation. Input data is transformed to generate the appropriate output. During that transformation, the data can change shape, data type, and even language. ETL processes can convert Imperial to metric, change time zones, and improve precision to properly align with existing data in the destination. ETL processes can also combine new data with existing data to keep reporting up to date, or to provide further insight into existing data. Applications such as reporting tools and services can then consume this data in the desired format.
19+
The following figure shows an overview of the use case and model for ETL automation. Input data is transformed to generate the appropriate output. During that transformation, the data changes shape, data type, and even language. ETL processes can convert Imperial to metric, change time zones, and improve precision to properly align with existing data in the destination. ETL processes can also combine new data with existing data to keep reporting up to date, or to provide further insight into existing data. Applications such as reporting tools and services can then consume this data in the wanted format.
2020

2121
![Apache Hive as ETL architecture](./media/apache-hadoop-using-apache-hive-as-an-etl-tool/hdinsight-etl-architecture.png)
2222

23-
Hadoop is typically used in ETL processes that import either a massive number of text files (like CSVs) or a smaller but frequently changing number of text files, or both. Hive is a great tool to use to prepare the data before loading it into the data destination. Hive allows you to create a schema over the CSV and use a SQL-like language to generate MapReduce programs that interact with the data.
23+
Hadoop is typically used in ETL processes that import either a massive number of text files (like CSVs). Or a smaller but frequently changing number of text files, or both. Hive is a great tool to use to prepare the data before loading it into the data destination. Hive allows you to create a schema over the CSV and use a SQL-like language to generate MapReduce programs that interact with the data.
2424

25-
The typical steps to using Hive to perform ETL are as follows:
25+
The typical steps to using Hive to do ETL are as follows:
2626

2727
1. Load data into Azure Data Lake Storage or Azure Blob Storage.
2828
2. Create a Metadata Store database (using Azure SQL Database) for use by Hive in storing your schemas.
2929
3. Create an HDInsight cluster and connect the data store.
3030
4. Define the schema to apply at read-time over data in the data store:
3131

32-
```
32+
```hql
3333
DROP TABLE IF EXISTS hvac;
3434
3535
--create the hvac table on comma-separated sensor data stored in Azure Storage blobs
@@ -61,30 +61,28 @@ Data sources are typically external data that can be matched to existing data in
6161
6262
## Output targets
6363
64-
You can use Hive to output data to a variety of targets including:
64+
You can use Hive to output data to different kinds of targets including:
6565
6666
* A relational database, such as SQL Server or Azure SQL Database.
6767
* A data warehouse, such as Azure SQL Data Warehouse.
6868
* Excel.
6969
* Azure table and blob storage.
7070
* Applications or services that require data to be processed into specific formats, or as files that contain specific types of information structure.
71-
* A JSON Document Store like [Azure Cosmos DB](https://azure.microsoft.com/services/cosmos-db/).
71+
* A JSON Document Store like Azure Cosmos DB.
7272
7373
## Considerations
7474
7575
The ETL model is typically used when you want to:
7676
77-
* Load stream data or large volumes of semi-structured or unstructured data from external sources into an existing database or information system.
78-
* Clean, transform, and validate the data before loading it, perhaps by using more than one transformation pass through the cluster.
79-
* Generate reports and visualizations that are regularly updated. For example, if the report takes too long to generate during the day, you can schedule the report to run at night. To automatically run a Hive query, you can use [Azure Logic Apps](../../logic-apps/logic-apps-overview.md) and PowerShell.
77+
`*` Load stream data or large volumes of semi-structured or unstructured data from external sources into an existing database or information system.
78+
`*` Clean, transform, and validate the data before loading it, perhaps by using more than one transformation pass through the cluster.
79+
`*` Generate reports and visualizations that are regularly updated. For example, if the report takes too long to generate during the day, you can schedule the report to run at night. To automatically run a Hive query, you can use [Azure Logic Apps](../../logic-apps/logic-apps-overview.md) and PowerShell.
8080
8181
If the target for the data isn't a database, you can generate a file in the appropriate format within the query, for example a CSV. This file can then be imported into Excel or Power BI.
8282
83-
If you need to execute several operations on the data as part of the ETL process, consider how you manage them. If the operations are controlled by an external program, rather than as a workflow within the solution, you need to decide whether some operations can be executed in parallel, and to detect when each job completes. Using a workflow mechanism such as Oozie within Hadoop may be easier than trying to orchestrate a sequence of operations using external scripts or custom programs. For more information about Oozie, see [Workflow and job orchestration](https://msdn.microsoft.com/library/dn749829.aspx).
83+
If you need to execute several operations on the data as part of the ETL process, consider how you manage them. With operations controlled by an external program, rather than as a workflow within the solution, decide whether some operations can be executed in parallel. And to detect when each job completes. Using a workflow mechanism such as Oozie within Hadoop may be easier than trying to orchestrate a sequence of operations using external scripts or custom programs.
8484
8585
## Next steps
8686
8787
* [ETL at scale](apache-hadoop-etl-at-scale.md)
88-
* [Operationalize a data pipeline](../hdinsight-operationalize-data-pipeline.md)
89-
90-
<!-- * [ETL Deep Dive](../hdinsight-etl-deep-dive.md) -->
88+
* [`Operationalize a data pipeline`](../hdinsight-operationalize-data-pipeline.md)

0 commit comments

Comments
 (0)