You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
7. The **Hive Job Summary** appears and displays information about the running job. Use the **Refresh** link to refresh the job information, until the **Job Status** changes to **Completed**.
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-using-apache-hive-as-an-etl-tool.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,8 +8,8 @@ ms.reviewer: jasonh
8
8
ms.custom: hdinsightactive
9
9
ms.topic: conceptual
10
10
ms.date: 11/14/2017
11
-
12
11
---
12
+
13
13
# Use Apache Hive as an Extract, Transform, and Load (ETL) tool
14
14
15
15
You typically need to clean and transform incoming data before loading it into a destination suitable for analytics. Extract, Transform, and Load (ETL) operations are used to prepare data and load it into a data destination. Apache Hive on HDInsight can read in unstructured data, process the data as needed, and then load the data into a relational data warehouse for decision support systems. In this approach, data is extracted from the source and stored in scalable storage, such as Azure Storage blobs or Azure Data Lake Storage. The data is then transformed using a sequence of Hive queries and is finally staged within Hive in preparation for bulk loading into the destination data store.
@@ -18,7 +18,7 @@ You typically need to clean and transform incoming data before loading it into a
18
18
19
19
The following figure shows an overview of the use case and model for ETL automation. Input data is transformed to generate the appropriate output. During that transformation, the data can change shape, data type, and even language. ETL processes can convert Imperial to metric, change time zones, and improve precision to properly align with existing data in the destination. ETL processes can also combine new data with existing data to keep reporting up-to-date, or to provide further insight into existing data. Applications such as reporting tools and services can then consume this data in the desired format.
20
20
21
-

21
+

22
22
23
23
Hadoop is typically used in ETL processes that import either a massive number of text files (like CSVs) or a smaller but frequently changing number of text files, or both. Hive is a great tool to use to prepare the data before loading it into the data destination. Hive allows you to create a schema over the CSV and use a SQL-like language to generate MapReduce programs that interact with the data.
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-visual-studio-tools-get-started.md
+8-9Lines changed: 8 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,7 +78,7 @@ To connect to your Azure subscription:
78
78
79
79
4. From Server Explorer, a list of existing HDInsight clusters appears. If you don't have any clusters, you can create one by using the Azure portal, Azure PowerShell, or the HDInsight SDK. For more information, see [Create HDInsight clusters](../hdinsight-hadoop-provision-linux-clusters.md).
80
80
81
-

81
+

82
82
83
83
5. Expand an HDInsight cluster. **Hive Databases**, a default storage account, linked storage accounts, and **Hadoop Service Log** appear. You can further expand the entities.
84
84
@@ -108,11 +108,11 @@ Right click on the linked cluster, select **Edit**, user could update the cluste
108
108
## Explore linked resources
109
109
From Server Explorer, you can see the default storage account and any linked storage accounts. If you expand the default storage account, you can see the containers on the storage account. The default storage account and the default container are marked. Right-click any of the containers to view the container contents.
110
110
111
-

111
+

112
112
113
113
After opening a container, you can use the following buttons to upload, delete, and download blobs:
114
114
115
-

115
+

116
116
117
117
## Run interactive Apache Hive queries
118
118
[Apache Hive](https://hive.apache.org) is a data warehouse infrastructure that's built on Hadoop. Hive is used for data summarization, queries, and analysis. You can use Data Lake Tools for Visual Studio to run Hive queries from Visual Studio. For more information about Hive, see [Use Apache Hive with HDInsight](hdinsight-use-hive.md).
@@ -196,7 +196,7 @@ To create, and run ad-hoc queries:
196
196
197
197
Ensure **Batch** is selected and then select**Submit**. If you select the advanced submit option, configure **Job Name**, **Arguments**, **Additional Configurations**, and**Status Directory** for the script.
198
198
199
-

199
+

200
200
201
201

202
202
@@ -219,15 +219,15 @@ To create and run a Hive solution:
219
219
220
220
The job summary varies slightly between **Batch**and**Interactive** mode.

231
231
232
232
### View job graph
233
233
@@ -237,15 +237,14 @@ To view all the operators inside the vertex, double-click on the vertices of the
237
237
238
238
The job graph may not appear even if Tez is specified as the execution engine if no Tez application is launched. This might happen because the job does not contain DML statements, or the DML statements can return without launching a Tez application. For example, `SELECT * FROM table1` will not launch the Tez application.

242
241
243
242
### Task Execution Detail
244
243
245
244
From the job graph, you can select**Task Execution Detail** to get structured and visualized information for Hive jobs. You can also get more job details. If performance issues occur, you can use the view to get more details about the issue. For example, you can get information about how each task operates, and detailed information about each task
246
245
(data read/write, schedule/start/end time, and so on). Use the information to tune job configurations or system architecture based on the visualized information.
247
246
248
-

247
+

0 commit comments