You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/benefits-of-migrating-to-hdinsight-40.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ HDInsight 4.0 has several advantages over HDInsight 3.6. Here's an overview of w
43
43
### HBase
44
44
45
45
- Advanced features:
46
-
- Procedure V2 (`procv2`), an updated framework for executing multistep HBase administrative operations.
46
+
- Procedure V2 (procv2), an updated framework for executing multistep HBase administrative operations.
47
47
- Fully off-heap read/write path.
48
48
- In-memory compactions.
49
49
- HBase cluster support of the Azure Data Lake Storage Gen2 Premium tier.
@@ -56,7 +56,7 @@ HDInsight 4.0 has several advantages over HDInsight 3.6. Here's an overview of w
56
56
57
57
- Advanced features:
58
58
- Kafka partition distribution on Azure fault domains.
59
-
- Zstandard (`zstd`) compression support.
59
+
- Zstandard (zstd) compression support.
60
60
- Kafka Consumer Incremental Rebalance.
61
61
- Support for MirrorMaker 2.0.
62
62
- Performance advantage:
@@ -145,7 +145,7 @@ Vectorized query execution is a feature that greatly reduces the CPU usage for t
145
145
- Aggregate
146
146
- Join
147
147
148
-
Vectorization is also implemented for the ORC format. Spark also uses whole-stage code generation and this vectorization (for Parquet) since Spark 2.0. There's an added timestamp column for Parquet vectorization and format under LLAP.
148
+
Vectorization is also implemented for the ORC format. Spark also uses whole-stage code generation and this vectorization (for Parquet) since Spark 2.0. There's an added time-stamp column for Parquet vectorization and format under LLAP.
149
149
150
150
> [!WARNING]
151
151
> Parquet writes are slow when you convert to zoned times from the time stamp. For more information, see the [issue details](https://issues.apache.org/jira/browse/HIVE-24693) on the Apache Hive site.
@@ -183,17 +183,17 @@ For more information, see the [Azure blog post on Hive materialized views](https
183
183
184
184
## Surrogate keys
185
185
186
-
Use the built-in `SURROGATE_KEY`user-defined function (UDF) to automatically generate numerical IDs for rows as you enter data into a table. The generated surrogate keys can replace wide, multiple composite keys.
186
+
Use the built-in `SURROGATE_KEY` UDF to automatically generate numerical IDs for rows as you enter data into a table. The generated surrogate keys can replace wide, multiple composite keys.
187
187
188
188
Hive supports the surrogate keys on ACID tables only. The table that you want to join by using surrogate keys can't have column types that need to cast. These data types must be primitives, such as `INT` or `STRING`.
189
189
190
190
Joins that use the generated keys are faster than joins that use strings. Using generated keys doesn't force data into a single node by a row number. You can generate keys as abstractions of natural keys. Surrogate keys have an advantage over universally unique identifiers (UUIDs), which are slower and probabilistic.
191
191
192
192
The `SURROGATE_KEY` UDF generates a unique ID for every row that you insert into a table. It generates keys based on the execution environment in a distributed system, which includes many factors such as:
193
193
194
-
- Internal data structures.
195
-
- State of a table.
196
-
- Last transaction ID.
194
+
- Internal data structures
195
+
- State of a table
196
+
- Last transaction ID
197
197
198
198
Surrogate key generation doesn't require any coordination between compute tasks. The UDF takes no arguments, or two arguments are:
0 commit comments