Skip to content

Commit 60a799a

Browse files
authored
Merge pull request #212305 from sreekzz/patch-104
New page
2 parents ec6ea9e + 07371df commit 60a799a

File tree

5 files changed

+261
-0
lines changed

5 files changed

+261
-0
lines changed

articles/hdinsight/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -877,6 +877,8 @@ items:
877877
href: ./interactive-query/hive-workload-management.md
878878
- name: Manage
879879
items:
880+
- name: Significant version changes in HDInsight 4.0 and advantages
881+
href: ./benefits-of-migrating-to-hdinsight-40.md
880882
- name: Copy Hive tables across Storage Accounts
881883
href: ./interactive-query/hive-migration-across-storage-accounts.md
882884
- name: Migrate default Hive metastore to external metastore on Azure HDInsight
Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
---
2+
title: Benefits of migrating to Azure HDInsight 4.0.
3+
description: Learn the benefits of migrating to Azure HDInsight 4.0.
4+
ms.service: hdinsight
5+
ms.topic: conceptual
6+
ms.date: 09/23/2022
7+
---
8+
# Significant version changes in HDInsight 4.0 and advantages
9+
10+
HDInsight 4.0 has several advantages over HDInsight 3.6. Here's an overview of what's new in Azure HDInsight 4.0.
11+
12+
| # | OSS component | HDInsight 4.0 Version | HDInsight 3.6 Version |
13+
| --- | --- | --- | --- |
14+
| 1 | Apache Hadoop | 3.1.1 | 2.7.3 |
15+
| 2 | Apache HBase | 2.1.6 | 1.1.2 |
16+
| 3 | Apache Hive | 3.1.0 | 1.2.1, 2.1 (LLAP) |
17+
| 4 | Apache Kafka | 2.1.1, 2.4(GA) | 1.1 |
18+
| 5 | Apache Phoenix | 5 | 4.7.0 |
19+
| 6 | Apache Spark | 2.4.4, 3.0.0(Preview) | 2.2 |
20+
| 7 | Apache TEZ | 0.9.1 | 0.7.0 |
21+
| 8 | Apache ZooKeeper | 3.4.6 | 3.4.6 |
22+
| 9 | Apache Kafka | 2.1.1, 2.4.1(Preview) | 1.1 |
23+
| 10 | Apache Ranger | 1.1.0 | 0.7.0 |
24+
25+
## Workloads and Features
26+
27+
**Hive**
28+
- Advanced features
29+
- LLAP workload management
30+
- LLAP Support JDBC, Druid and Kafka connectors
31+
- Better SQL features – Constraints and default values
32+
- Surrogate Keys
33+
- Information schema.
34+
- Performance advantage
35+
- Result caching - Caching query results allow a previously computed query result to be reused
36+
- Dynamic materialized views - Pre-computation of summaries
37+
- ACID V2 performance improvements in both storage format and execution engine
38+
- Security
39+
- GDPR compliance enabled on Apache Hive transactions
40+
- Hive UDF execution authorization in ranger
41+
42+
**HBase**
43+
- Advanced features
44+
- Procedure 2. Procedure V2, or procv2, is an updated framework for executing multistep HBase administrative operations.
45+
- Fully off-heap read/write path.
46+
- In-memory compactions
47+
- HBase cluster supports Premium ADLS Gen2
48+
- Performance advantage
49+
- Accelerated Writes uses Azure premium SSD managed disks to improve performance of the Apache HBase Write Ahead Log (WAL).
50+
- Security
51+
- Hardening of both secondary indexes, which include Local and Global
52+
-
53+
**Kafka**
54+
- Advanced features
55+
- Kafka partition distribution on Azure fault domains
56+
- Zstd compression support
57+
- Kafka Consumer Incremental Rebalance
58+
- Support MirrorMaker 2.0
59+
- Performance advantage
60+
- Improved windowed aggregation performance in Kafka Streams
61+
- Improved broker resiliency by reducing memory footprint of message conversion
62+
- Replication protocol improvements for fast leader failover
63+
- Security
64+
- Access control for topic creation for specific topics/topic prefix
65+
- Hostname verification to prevent SSL configuration man-in-the- middle attacks
66+
- Improved encryption support with faster Transport Layer Security (TLS) and CRC32C implementation
67+
68+
**Spark**
69+
- Advanced features
70+
- Structured streaming support for ORC
71+
- Capability to integrate with new Metastore Catalog feature
72+
- Structured Streaming support for Hive Streaming library
73+
- Transparent write to Hive warehouse
74+
- Spark Cruise - an automatic computation reuse system for Spark.
75+
- Performance advantage
76+
- Result caching - Caching query results allow a previously computed query result to be reused
77+
- Dynamic materialized views - Pre-computation of summaries
78+
- Security
79+
- GDPR compliance enabled for Spark transactions
80+
81+
## Hive Partition Discovery and Repair
82+
83+
Hive automatically discovers and synchronizes the metadata of the partition in Hive Metastore.
84+
The `discover.partitions` table property enables and disables synchronization of the file system with partitions. In external partitioned tables, this property is enabled (true) by default.
85+
When Hive Metastore Service (HMS) is started in remote service mode, a background thread `(PartitionManagementTask)` gets scheduled periodically every 300 s (configurable via `metastore.partition.management.task.frequency config`) that looks for tables with `discover.partitions` table property set to true and performs `msck` repair in sync mode.
86+
87+
If the table is a transactional table, then Exclusive Lock is obtained for that table before performing `msck repair`. With this table property, `MSCK REPAIR TABLE table_name SYNC PARTITIONS` is no longer required to be run manually.
88+
Assuming you have an external table created using a version of Hive that doesn't support partition discovery, enable partition discovery for the table.
89+
90+
```ALTER TABLE exttbl SET TBLPROPERTIES ('discover.partitions' = 'true');```
91+
92+
Set synchronization of partitions to occur every 10 minutes expressed in seconds: In Ambari > Hive > Configs, `set metastore.partition.management.task.frequency` to 3600 or more.
93+
94+
:::image type="content" source="./media/hdinsight-migrate-to-40/ambari-hive-config.png" alt-text=" Screenshot showing Ambari Hive config file with frequency value.":::
95+
96+
97+
> [!WARNING]
98+
> With the `management.task` running every 10 minutes, there will be pressure on the SQL server DTU.
99+
>
100+
You can verify the output from Microsoft Azure portal.
101+
102+
:::image type="content" source="./media/hdinsight-migrate-to-40/hive-verify-output.png" alt-text="Screenshot showing compute utilization graph.":::
103+
104+
Hive drops the metadata and corresponding data in any partition created after the retention period. You express the retention time using a numeral and the following character or characters.
105+
Hive drops the metadata and corresponding data in any partition created after the retention period. You express the retention time using a numeral and the following character(s).
106+
107+
```
108+
ms (milliseconds)
109+
s (seconds)
110+
m (minutes)
111+
d (days)
112+
```
113+
114+
To configure a partition retention period for one week.
115+
116+
```
117+
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');
118+
```
119+
120+
The partition metadata and the actual data for employees in Hive is automatically dropped after a week.
121+
122+
## Hive 3
123+
124+
### Performance optimizations available under Hive 3
125+
126+
OLAP Vectorization Dynamic Semijoin reduction Parquet support for vectorization with LLAP Automatic query cache.
127+
128+
**New SQL features**
129+
130+
Materialized Views Surrogate Keys Constraints Metastore CachedStore.
131+
132+
**OLAP Vectorization**
133+
134+
Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage.
135+
Vectorized execution of PTF, roll up and grouping sets.
136+
137+
**Dynamic `Semijoin` reduction**
138+
139+
Dramatically improves performance for selective joins.
140+
It builds a bloom filter from one side of join and filters rows from other side.
141+
Skips scan and further evaluation of rows that wouldn't qualify the join.
142+
143+
**Parquet support for vectorization with LLAP**
144+
145+
Vectorized query execution is a feature that greatly reduces the CPU usage for typical query operations such as
146+
147+
* scans
148+
* filters
149+
* aggregate
150+
* joins
151+
152+
Vectorization is also implemented for the ORC format. Spark also uses Whole Stage Codegen and this vectorization (for Parquet) since Spark 2.0.
153+
Added timestamp column for Parquet vectorization and format under LLAP.
154+
155+
> [!WARNING]
156+
> Parquet writes are slow when conversion to zoned times from timestamp. For more information, see [**here**](https://issues.apache.org/jira/browse/HIVE-24693).
157+
158+
159+
### Automatic query cache
160+
1. With `hive.query.results.cache.enabled=true`, every query that runs in Hive 3 stores its result in a cache.
161+
1. If the input table changes, Hive evicts invalid data from the cache. For example, if you perform aggregation and the base table changes, queries you run most frequently stay in cache, but stale queries are evicted.
162+
1. The query result cache works with managed tables only because Hive can't track changes to an external table.
163+
1. If you join external and managed tables, Hive falls back to executing the full query. The query result cache works with ACID tables. If you update an ACID table, Hive reruns the query automatically.
164+
1. You can enable and disable the query result cache from command line. You might want to do so to debug a query.
165+
1. Disable the query result cache by setting the following parameter to false: `hive.query.results.cache.enabled=false`
166+
1. Hive stores the query result cache in `/tmp/hive/__resultcache__/`. By default, Hive allocates 2 GB for the query result cache. You can change this setting by configuring the following parameter in bytes: `hive.query.results.cache.max.size`
167+
1. Changes to query processing: During query compilation, check the results cache to see if it already has the query results. If there's a cache hit, then the query plan will be set to a `FetchTask` that will read from the cached location.
168+
169+
During query execution:
170+
171+
Parquet `DataWriteableWriter` relies on `NanoTimeUtils` to convert a timestamp object into a binary value. This query calls `toString()` on the timestamp object, and then parses the String.
172+
173+
1. If the results cache can be used for this query
174+
1. The query will be the `FetchTask` reading from the cached results directory.
175+
1. No cluster tasks will be required.
176+
1. If the results cache can't be used, run the cluster tasks as normal
177+
1. Check if the query results that have been computed are eligible to add to the results cache.
178+
1. If results can be cached, the temporary results generated for the query will be saved to the results cache. Steps may need to be done here to ensure the query results directory isn't deleted by query clean-up.
179+
180+
## SQL features
181+
182+
**Materialized Views**
183+
184+
The initial implementation introduced in Apache Hive 3.0.0 focuses on introducing materialized views and automatic query rewriting based on those materializations in the project. Materialized views can be stored natively in Hive or in other custom storage handlers (ORC), and they can seamlessly exploit exciting new Hive features such as LLAP acceleration.
185+
186+
More information, see [Hive - Materialized Views - Microsoft Tech Community](https://techcommunity.microsoft.com/t5/analytics-on-azure-blog/hive-materialized-views/ba-p/2502785)
187+
188+
## Surrogate Keys
189+
190+
Use the built-in `SURROGATE_KEY` user-defined function (UDF) to automatically generate numerical Ids for rows as you enter data into a table. The generated surrogate keys can replace wide, multiple composite keys.
191+
192+
Hive supports the surrogate keys on ACID tables only. The table you want to join using surrogate keys can't have column types that need casting. These data types must be primitives, such as INT or `STRING`.
193+
194+
Joins using the generated keys are faster than joins using strings. Using generated keys doesn't force data into a single node by a row number. You can generate keys as abstractions of natural keys. Surrogate keys have an advantage over UUIDs, which are slower and probabilistic.
195+
196+
The `SURROGATE_KEY UDF` generates a unique ID for every row that you insert into a table.
197+
It generates keys based on the execution environment in a distributed system, which includes many factors, such as
198+
199+
1. Internal data structures
200+
2. State of a table
201+
3. Last transaction ID.
202+
203+
Surrogate key generation doesn't require any coordination between compute tasks. The UDF takes no arguments, or two arguments are
204+
205+
1. Write ID bits
206+
1. Task ID bits
207+
208+
### Constraints
209+
210+
SQL constraints to enforce data integrity and improve performance. The optimizer uses the constraint information to make smart decisions. Constraints can make data predictable and easy to locate.
211+
212+
|Constraints|Description|
213+
|---|---|
214+
|Check|Limits the range of values you can place in a column.|
215+
|PRIMARY KEY|Identifies each row in a table using a unique identifier.|
216+
|FOREIGN KEY|Identifies a row in another table using a unique identifier.|
217+
|UNIQUE KEY|Checks that values stored in a column are different.|
218+
|NOT NULL|Ensures that a column can't be set to NULL.|
219+
|ENABLE|Ensures that all incoming data conforms to the constraint.|
220+
|DISABLE|Doesn't ensure that all incoming data conforms to the constraint.|
221+
|VALIDATEC|hecks that all existing data in the table conforms to the constraint.|
222+
|NOVALIDATE|Doesn't check that all existing data in the table conforms to the constraint
223+
|ENFORCED|Maps to ENABLE NOVALIDATE.|
224+
|NOT ENFORCED|Maps to DISABLE NOVALIDATE.|
225+
|RELY|Specifies abiding by a constraint; used by the optimizer to apply further optimizations.|
226+
|NORELY|Specifies not abiding by a constraint.|
227+
228+
For more information, see https://cwiki.apache.org/confluence/display/Hive/Supported+Features%3A++Apache+Hive+3.1
229+
230+
### Metastore `CachedStore`
231+
232+
Hive metastore operation takes much time and thus slow down Hive compilation. In some extreme case, it takes much longer than the actual query run time. Especially, we find the latency of cloud db is high and 90% of total query runtime is waiting for metastore SQL database operations. Based on this observation, the metastore operation performance will be greatly enhanced, if we have a memory structure which cache the database query result.
233+
234+
`hive.metastore.rawstore.impl=org.apache.hadoop.hive.metastore.cache.CachedStore`
235+
236+
:::image type="content" source="./media/hdinsight-migrate-to-40/hive-metastore-properties.png" alt-text=" Screenshot showing Hive metastore property file value against 'hive.metastore.rawstore.impl field.":::
237+
238+
## Troubleshooting guide
239+
240+
[HDInsight 3.6 to 4.0 troubleshooting guide for Hive workloads](/azure/hdinsight/interactive-query/interactive-query-troubleshoot-migrate-36-to-40.md) provides answers to common issues faced when migrating Hive workloads from HDInsight 3.6 to HDInsight 4.0.
241+
242+
## References
243+
244+
**Hive 3.1.0**
245+
246+
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/hive-overview/content/hive_whats_new_in_this_release_hive.html
247+
248+
**HBase 2.1.6**
249+
250+
https://apache.googlesource.com/hbase/+/ba26a3e1fd5bda8a84f99111d9471f62bb29ed1d/RELEASENOTES.md
251+
252+
**Hadoop 3.1.1**
253+
254+
https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/release/3.1.1/RELEASENOTES.3.1.1.html
255+
256+
## Further reading
257+
258+
* [HDInsight 4.0 Announcement](/azure/hdinsight/hdinsight-version-release.md)
259+
* [HDInsight 4.0 deep dive](https://azure.microsoft.com/blog/deep-dive-into-azure-hdinsight-4-0.md)
43.6 KB
Loading
19.9 KB
Loading
39.3 KB
Loading

0 commit comments

Comments
 (0)