You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hbase/apache-hbase-phoenix-performance.md
+14-15Lines changed: 14 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,15 +2,14 @@
2
2
title: Phoenix performance in Azure HDInsight
3
3
description: Best practices to optimize Apache Phoenix performance for Azure HDInsight clusters
4
4
author: ashishthaps
5
+
ms.author: ashishth
5
6
ms.reviewer: jasonh
6
-
7
7
ms.service: hdinsight
8
-
ms.custom: hdinsightactive
9
8
ms.topic: conceptual
10
-
ms.date: 01/22/2018
11
-
ms.author: ashishth
12
-
9
+
ms.custom: hdinsightactive
10
+
ms.date: 12/27/2019
13
11
---
12
+
14
13
# Apache Phoenix performance best practices
15
14
16
15
The most important aspect of [Apache Phoenix](https://phoenix.apache.org/) performance is to optimize the underlying [Apache HBase](https://hbase.apache.org/). Phoenix creates a relational data model atop HBase that converts SQL queries into HBase operations, such as scans. The design of your table schema, the selection and ordering of the fields in your primary key, and your use of indexes all affect Phoenix performance.
@@ -23,7 +22,7 @@ The schema design of a Phoenix table includes the primary key design, column fam
23
22
24
23
### Primary key design
25
24
26
-
The primary key defined on a table in Phoenix determines how data is stored within the rowkey of the underlying HBase table. In HBase, the only way to access a particular row is with the rowkey. In addition, data stored in an HBase table is sorted by the rowkey. Phoenix builds the rowkey value by concatenating the values of each of the columns in the row, in the order they are defined in the primary key.
25
+
The primary key defined on a table in Phoenix determines how data is stored within the rowkey of the underlying HBase table. In HBase, the only way to access a particular row is with the rowkey. In addition, data stored in an HBase table is sorted by the rowkey. Phoenix builds the rowkey value by concatenating the values of each of the columns in the row, in the order they're defined in the primary key.
27
26
28
27
For example, a table for contacts has the first name, last name, phone number, and address, all in the same column family. You could define a primary key based on an increasing sequence number:
29
28
@@ -48,13 +47,13 @@ With this new primary key the row keys generated by Phoenix would be:
48
47
49
48
In the first row above, the data for the rowkey is represented as shown:
50
49
51
-
|rowkey| key| value|
50
+
|rowkey| key| value|
52
51
|------|--------------------|---|
53
52
| Dole-John-111|address |1111 San Gabriel Dr.|
54
53
| Dole-John-111|phone |1-425-000-0002|
55
54
| Dole-John-111|firstName |John|
56
55
| Dole-John-111|lastName |Dole|
57
-
| Dole-John-111|socialSecurityNum |111|
56
+
| Dole-John-111|socialSecurityNum |111|
58
57
59
58
This rowkey now stores a duplicate copy of the data. Consider the size and number of columns you include in your primary key, because this value is included with every cell in the underlying HBase table.
60
59
@@ -68,8 +67,8 @@ Also, if certain columns tend to be accessed together, put those columns in the
68
67
69
68
### Column design
70
69
71
-
* Keep VARCHAR columns under about 1 MB due to the I/O costs of large columns. When processing queries, HBase materializes cells in full before sending them over to the client, and the client receives them in full before handing them off to the application code.
72
-
* Store column values using a compact format such as protobuf, Avro, msgpack, or BSON. JSON is not recommended, as it is larger.
70
+
* Keep VARCHAR columns under about 1 MB because of the I/O costs of large columns. When processing queries, HBase materializes cells in full before sending them over to the client, and the client receives them in full before handing them off to the application code.
71
+
* Store column values using a compact format such as protobuf, Avro, msgpack, or BSON. JSON isn't recommended, as it's larger.
73
72
* Consider compressing data before storage to cut latency and I/O costs.
74
73
75
74
### Partition data
@@ -105,7 +104,7 @@ Secondary indexes can improve read performance by turning what would be a full t
105
104
106
105
### Use covered indexes
107
106
108
-
Covered indexes are indexes that include data from the row in addition to the values that are indexed. After finding the desired index entry, there is no need to access the primary table.
107
+
Covered indexes are indexes that include data from the row in addition to the values that are indexed. After finding the desired index entry, there's no need to access the primary table.
109
108
110
109
For example, in the example contact table you could create a secondary index on just the socialSecurityNum column. This secondary index would speed up queries that filter by socialSecurityNum values, but retrieving other field values will require another read against the main table.
111
110
@@ -149,7 +148,7 @@ In [SQLLine](http://sqlline.sourceforge.net/), use EXPLAIN followed by your SQL
149
148
150
149
As an example, say you have a table called FLIGHTS that stores flight delay information.
151
150
152
-
To select all the flights with an airlineid of `19805`, where airlineid is a field that is not in the primary key or in any index:
151
+
To select all the flights with an airlineid of `19805`, where airlineid is a field that isn't in the primary key or in any index:
153
152
154
153
select * from "FLIGHTS" where airlineid = '19805';
155
154
@@ -204,15 +203,15 @@ The following guidelines describe some common patterns.
204
203
205
204
### Read-heavy workloads
206
205
207
-
For read-heavy use cases, make sure you are using indexes. Additionally, to save read-time overhead, consider creating covered indexes.
206
+
For read-heavy use cases, make sure you're using indexes. Additionally, to save read-time overhead, consider creating covered indexes.
208
207
209
208
### Write-heavy workloads
210
209
211
-
For write-heavy workloads where the primary key is monotonically increasing, create salt buckets to help avoid write hotspots, at the expense of overall read throughput due to the additional scans needed. Also, when using UPSERT to write a large number of records, turn off autoCommit and batch up the records.
210
+
For write-heavy workloads where the primary key is monotonically increasing, create salt buckets to help avoid write hotspots, at the expense of overall read throughput because of the additional scans needed. Also, when using UPSERT to write a large number of records, turn off autoCommit and batch up the records.
212
211
213
212
### Bulk deletes
214
213
215
-
When deleting a large data set, turn on autoCommit before issuing the DELETE query, so that the client does not need to remember the row keys for all deleted rows. AutoCommit prevents the client from buffering the rows affected by the DELETE, so that Phoenix can delete them directly on the region servers without the expense of returning them to the client.
214
+
When deleting a large data set, turn on autoCommit before issuing the DELETE query, so that the client doesn't need to remember the row keys for all deleted rows. AutoCommit prevents the client from buffering the rows affected by the DELETE, so that Phoenix can delete them directly on the region servers without the expense of returning them to the client.
0 commit comments