Skip to content

Commit 11716e1

Browse files
authored
Merge pull request #110387 from dagiro/freshness37
freshness37
2 parents c751440 + 5ef6379 commit 11716e1

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

articles/hdinsight/hadoop/using-json-in-hive.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 10/29/2019
9+
ms.date: 04/06/2020
1010
---
1111

1212
# Process and analyze JSON documents by using Apache Hive in Azure HDInsight
@@ -54,9 +54,12 @@ The file can be found at `wasb://[email protected]
5454

5555
In this article, you use the Apache Hive console. For instructions on how to open the Hive console, see [Use Apache Ambari Hive View with Apache Hadoop in HDInsight](apache-hadoop-use-hive-ambari-view.md).
5656

57+
> [!NOTE]
58+
> Hive View is no longer available in HDInsight 4.0.
59+
5760
## Flatten JSON documents
5861

59-
The methods listed in the next section require that the JSON document be composed of a single row. So, you must flatten the JSON document to a string. If your JSON document is already flattened, you can skip this step and go straight to the next section on analyzing JSON data. To flatten the JSON document, run the following script:
62+
The methods listed in the next section require the JSON document to be composed of a single row. So, you must flatten the JSON document to a string. If your JSON document is already flattened, you can skip this step and go straight to the next section on analyzing JSON data. To flatten the JSON document, run the following script:
6063

6164
```sql
6265
DROP TABLE IF EXISTS StudentsRaw;
@@ -100,7 +103,7 @@ Hive provides three different mechanisms to run queries on JSON documents, or yo
100103

101104
### Use the get_json_object UDF
102105

103-
Hive provides a built-in UDF called [get_json_object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object) that can perform JSON querying during runtime. This method takes two arguments--the table name and method name, which has the flattened JSON document and the JSON field that needs to be parsed. Lets look at an example to see how this UDF works.
106+
Hive provides a built-in UDF called [get_json_object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object) that queries JSON during runtime. This method takes two arguments: the table name and method name. The method name has the flattened JSON document and the JSON field that needs to be parsed. Let's look at an example to see how this UDF works.
104107

105108
The following query returns the first name and last name for each student:
106109

@@ -113,18 +116,18 @@ FROM StudentsOneLine;
113116

114117
Here is the output when you run this query in the console window:
115118

116-
![Apache Hive get json object UDF](./media/using-json-in-hive/hdinsight-get-json-object.png)
119+
![Apache Hive gets json object UDF](./media/using-json-in-hive/hdinsight-get-json-object.png)
117120

118121
There are limitations of the get_json_object UDF:
119122

120123
* Because each field in the query requires reparsing of the query, it affects the performance.
121124
* **GET\_JSON_OBJECT()** returns the string representation of an array. To convert this array to a Hive array, you have to use regular expressions to replace the square brackets "[" and "]", and then you also have to call split to get the array.
122125

123-
This is why the Hive wiki recommends that you use **json_tuple**.
126+
This conversion is why the Hive wiki recommends that you use **json_tuple**.
124127

125128
### Use the json_tuple UDF
126129

127-
Another UDF provided by Hive is called [json_tuple](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple), which performs better than [get_ json _object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object). This method takes a set of keys and a JSON string, and returns a tuple of values by using one function. The following query returns the student ID and the grade from the JSON document:
130+
Another UDF provided by Hive is called [json_tuple](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple), which does better than [get_ json _object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object). This method takes a set of keys and a JSON string. Then returns a tuple of values. The following query returns the student ID and the grade from the JSON document:
128131

129132
```sql
130133
SELECT q1.StudentId, q1.Grade
@@ -137,15 +140,15 @@ The output of this script in the Hive console:
137140

138141
![Apache Hive json query results](./media/using-json-in-hive/hdinsight-json-tuple.png)
139142

140-
The json_tuple UDF uses the [lateral view](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView) syntax in Hive, which enables json\_tuple to create a virtual table by applying the UDT function to each row of the original table. Complex JSONs become too unwieldy because of the repeated use of **LATERAL VIEW**. Furthermore, **JSON_TUPLE** can't handle nested JSONs.
143+
The `json_tuple` UDF uses the [lateral view](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView) syntax in Hive, which enables json\_tuple to create a virtual table by applying the UDT function to each row of the original table. Complex JSONs become too unwieldy because of the repeated use of **LATERAL VIEW**. Furthermore, **JSON_TUPLE** can't handle nested JSONs.
141144

142145
### Use a custom SerDe
143146

144147
SerDe is the best choice for parsing nested JSON documents. It lets you define the JSON schema, and then you can use the schema to parse the documents. For instructions, see [How to use a custom JSON SerDe with Microsoft Azure HDInsight](https://web.archive.org/web/20190217104719/https://blogs.msdn.microsoft.com/bigdatasupport/2014/06/18/how-to-use-a-custom-json-serde-with-microsoft-azure-hdinsight/).
145148

146149
## Summary
147150

148-
In conclusion, the type of JSON operator in Hive that you choose depends on your scenario. If you have a simple JSON document and you have only one field to look up on, you can choose to use the Hive UDF **get_json_object**. If you've more than one key to look up on, then you can use **json_tuple**. If you have a nested document, then you should use the **JSON SerDe**.
151+
The type of JSON operator in Hive that you choose depends on your scenario. With a simple JSON document and one field to look up, choose the Hive UDF **get_json_object**. If you've more than one key to look up on, then you can use **json_tuple**. For nested documents, use the **JSON SerDe**.
149152

150153
## Next steps
151154

0 commit comments

Comments
 (0)