You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/using-json-in-hive.md
+11-8Lines changed: 11 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ ms.author: hrasheed
6
6
ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: conceptual
9
-
ms.date: 10/29/2019
9
+
ms.date: 04/06/2020
10
10
---
11
11
12
12
# Process and analyze JSON documents by using Apache Hive in Azure HDInsight
@@ -54,9 +54,12 @@ The file can be found at `wasb://[email protected]
54
54
55
55
In this article, you use the Apache Hive console. For instructions on how to open the Hive console, see [Use Apache Ambari Hive View with Apache Hadoop in HDInsight](apache-hadoop-use-hive-ambari-view.md).
56
56
57
+
> [!NOTE]
58
+
> Hive View is no longer available in HDInsight 4.0.
59
+
57
60
## Flatten JSON documents
58
61
59
-
The methods listed in the next section require that the JSON document be composed of a single row. So, you must flatten the JSON document to a string. If your JSON document is already flattened, you can skip this step and go straight to the next section on analyzing JSON data. To flatten the JSON document, run the following script:
62
+
The methods listed in the next section require the JSON document to be composed of a single row. So, you must flatten the JSON document to a string. If your JSON document is already flattened, you can skip this step and go straight to the next section on analyzing JSON data. To flatten the JSON document, run the following script:
60
63
61
64
```sql
62
65
DROPTABLE IF EXISTS StudentsRaw;
@@ -100,7 +103,7 @@ Hive provides three different mechanisms to run queries on JSON documents, or yo
100
103
101
104
### Use the get_json_object UDF
102
105
103
-
Hive provides a built-in UDF called [get_json_object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object) that can perform JSON querying during runtime. This method takes two arguments--the table name and method name, which has the flattened JSON document and the JSON field that needs to be parsed. Let’s look at an example to see how this UDF works.
106
+
Hive provides a built-in UDF called [get_json_object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object) that queries JSON during runtime. This method takes two arguments: the table name and method name. The method name has the flattened JSON document and the JSON field that needs to be parsed. Let's look at an example to see how this UDF works.
104
107
105
108
The following query returns the first name and last name for each student:
106
109
@@ -113,18 +116,18 @@ FROM StudentsOneLine;
113
116
114
117
Here is the output when you run this query in the console window:
115
118
116
-

* Because each field in the query requires reparsing of the query, it affects the performance.
121
124
***GET\_JSON_OBJECT()** returns the string representation of an array. To convert this array to a Hive array, you have to use regular expressions to replace the square brackets "[" and "]", and then you also have to call split to get the array.
122
125
123
-
This is why the Hive wiki recommends that you use **json_tuple**.
126
+
This conversion is why the Hive wiki recommends that you use **json_tuple**.
124
127
125
128
### Use the json_tuple UDF
126
129
127
-
Another UDF provided by Hive is called [json_tuple](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple), which performs better than [get_ json _object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object). This method takes a set of keys and a JSON string, and returns a tuple of values by using one function. The following query returns the student ID and the grade from the JSON document:
130
+
Another UDF provided by Hive is called [json_tuple](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-json_tuple), which does better than [get_ json _object](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object). This method takes a set of keys and a JSON string. Then returns a tuple of values. The following query returns the student ID and the grade from the JSON document:
128
131
129
132
```sql
130
133
SELECTq1.StudentId, q1.Grade
@@ -137,15 +140,15 @@ The output of this script in the Hive console:
The json_tuple UDF uses the [lateral view](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView) syntax in Hive, which enables json\_tuple to create a virtual table by applying the UDT function to each row of the original table. Complex JSONs become too unwieldy because of the repeated use of **LATERAL VIEW**. Furthermore, **JSON_TUPLE** can't handle nested JSONs.
143
+
The `json_tuple` UDF uses the [lateral view](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView) syntax in Hive, which enables json\_tuple to create a virtual table by applying the UDT function to each row of the original table. Complex JSONs become too unwieldy because of the repeated use of **LATERAL VIEW**. Furthermore, **JSON_TUPLE** can't handle nested JSONs.
141
144
142
145
### Use a custom SerDe
143
146
144
147
SerDe is the best choice for parsing nested JSON documents. It lets you define the JSON schema, and then you can use the schema to parse the documents. For instructions, see [How to use a custom JSON SerDe with Microsoft Azure HDInsight](https://web.archive.org/web/20190217104719/https://blogs.msdn.microsoft.com/bigdatasupport/2014/06/18/how-to-use-a-custom-json-serde-with-microsoft-azure-hdinsight/).
145
148
146
149
## Summary
147
150
148
-
In conclusion, the type of JSON operator in Hive that you choose depends on your scenario. If you have a simple JSON document and you have only one field to look up on, you can choose to use the Hive UDF **get_json_object**. If you've more than one key to look up on, then you can use **json_tuple**. If you have a nested document, then you should use the **JSON SerDe**.
151
+
The type of JSON operator in Hive that you choose depends on your scenario. With a simple JSON document and one field to look up, choose the Hive UDF **get_json_object**. If you've more than one key to look up on, then you can use **json_tuple**. For nested documents, use the **JSON SerDe**.
0 commit comments