You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/kafka/apache-kafka-performance-tuning.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,27 +8,27 @@ ms.date: 09/15/2023
8
8
9
9
# Performance optimization for Apache Kafka HDInsight clusters
10
10
11
-
This article gives some suggestions for optimizing the performance of your Apache Kafka workloads in HDInsight. The focus is on adjusting producer, broker and consumer configuration. Sometimes, you also need to adjust OS settings to tune the performance with heavy workload. There are different ways of measuring performance, and the optimizations that you apply will depend on your business needs.
11
+
This article gives some suggestions for optimizing the performance of your Apache Kafka workloads in HDInsight. The focus is on adjusting producer, broker and consumer configuration. Sometimes, you also need to adjust OS settings to tune the performance with heavy workload. There are different ways of measuring performance, and the optimizations that you apply depends on your business needs.
12
12
13
13
## Architecture overview
14
14
15
-
Kafka topics are used to organize records. Records are produced by producers, and consumed by consumers. Producers send records to Kafka brokers, which then store the data. Each worker node in your HDInsight cluster is a Kafka broker.
15
+
Kafka topics are used to organize records. Producers produce records, and consumers consume them. Producers send records to Kafka brokers, which then store the data. Each worker node in your HDInsight cluster is a Kafka broker.
16
16
17
17
Topics partition records across brokers. When consuming records, you can use up to one consumer per partition to achieve parallel processing of the data.
18
18
19
-
Replication is used to duplicate partitions across nodes. This protects against node (broker) outages. A single partition among the group of replicas is designated as the partition leader. Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.
19
+
Replication is used to duplicate partitions across nodes. This partition protects against node (broker) outages. A single partition among the group of replicas is designated as the partition leader. Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.
20
20
21
21
## Identify your scenario
22
22
23
-
Apache Kafka performance has two main aspects – throughput and latency. Throughput is the maximum rate at which data can be processed. Higher throughput is usually better. Latency is the time it takes for data to be stored or retrieved. Lower latency is usually better. Finding the right balance between throughput, latency and the cost of the application's infrastructure can be challenging. Your performance requirements will likely match one of the following three common situations, based on whether you require high throughput, low latency, or both:
23
+
Apache Kafka performance has two main aspects – throughput and latency. Throughput is the maximum rate at which data can be processed. Higher throughput is better. Latency is the time it takes for data to be stored or retrieved. Lower latency is better. Finding the right balance between throughput, latency and the cost of the application's infrastructure can be challenging. Your performance requirements should match with one of the following three common situations, based on whether you require high throughput, low latency, or both:
24
24
25
25
* High throughput, low latency. This scenario requires both high throughput and low latency (~100 milliseconds). An example of this type of application is service availability monitoring.
26
26
* High throughput, high latency. This scenario requires high throughput (~1.5 GBps) but can tolerate higher latency (< 250 ms). An example of this type of application is telemetry data ingestion for near real-time processes like security and intrusion detection applications.
27
27
* Low throughput, low latency. This scenario requires low latency (< 10 ms) for real-time processing, but can tolerate lower throughput. An example of this type of application is online spelling and grammar checks.
28
28
29
29
## Producer configurations
30
30
31
-
The following sections will highlight some of the most important generic configuration properties to optimize performance of your Kafka producers. For a detailed explanation of all configuration properties, see [Apache Kafka documentation on producer configurations](https://kafka.apache.org/documentation/#producerconfigs).
31
+
The following sections highlight some of the most important generic configuration properties to optimize performance of your Kafka producers. For a detailed explanation of all configuration properties, see [Apache Kafka documentation on producer configurations](https://kafka.apache.org/documentation/#producerconfigs).
32
32
33
33
### Batch size
34
34
@@ -44,11 +44,11 @@ A Kafka producer can be configured to compress messages before sending them to b
44
44
45
45
Among the two commonly used compression codecs, `gzip` and `snappy`, `gzip` has a higher compression ratio, which results in lower disk usage at the cost of higher CPU load. The `snappy` codec provides less compression with less CPU overhead. You can decide which codec to use based on broker disk or producer CPU limitations. `gzip` can compress data at a rate five times higher than `snappy`.
46
46
47
-
Using data compression will increase the number of records that can be stored on a disk. It may also increase CPU overhead in cases where there's a mismatch between the compression formats being used by the producer and the broker. as the data must be compressed before sending and then decompressed before processing.
47
+
Data compression increases the number of records that can be stored on a disk. It may also increase CPU overhead in cases where there's a mismatch between the compression formats being used by the producer and the broker. as the data must be compressed before sending and then decompressed before processing.
48
48
49
49
## Broker settings
50
50
51
-
The following sections will highlight some of the most important settings to optimize performance of your Kafka brokers. For a detailed explanation of all broker settings, see [Apache Kafka documentation on broker configurations](https://kafka.apache.org/documentation/#brokerconfigs).
51
+
The following sections highlight some of the most important settings to optimize performance of your Kafka brokers. For a detailed explanation of all broker settings, see [Apache Kafka documentation on broker configurations](https://kafka.apache.org/documentation/#brokerconfigs).
52
52
53
53
### Number of disks
54
54
@@ -62,7 +62,7 @@ Each Kafka partition is a log file on the system, and producer threads can write
62
62
63
63
Increasing the partition density (the number of partitions per broker) adds an overhead related to metadata operations and per partition request/response between the partition leader and its followers. Even in the absence of data flowing through, partition replicas still fetch data from leaders, which results in extra processing for send and receive requests over the network.
64
64
65
-
For Apache Kafka clusters 2.1 and 2.4 and above in HDInsight, we recommend you to have a maximum of 2000 partitions per broker, including replicas. Increasing the number of partitions per broker decreases throughput and may also cause topic unavailability. For more information on Kafka partition support, see [the official Apache Kafka blog post on the increase in the number of supported partitions in version 1.1.0](https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions). For details on modifying topics, see [Apache Kafka: modifying topics](https://kafka.apache.org/documentation/#basic_ops_modify_topic).
65
+
For Apache Kafka clusters 2.1 and 2.4 and as noted before in HDInsight, we recommend you to have a maximum of 2000 partitions per broker, including replicas. Increasing the number of partitions per broker decreases throughput and may also cause topic unavailability. For more information on Kafka partition support, see [the official Apache Kafka blog post on the increase in the number of supported partitions in version 1.1.0](https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions). For details on modifying topics, see [Apache Kafka: modifying topics](https://kafka.apache.org/documentation/#basic_ops_modify_topic).
66
66
67
67
### Number of replicas
68
68
@@ -74,25 +74,25 @@ For more information on replication, see [Apache Kafka: replication](https://kaf
74
74
75
75
## Consumer configurations
76
76
77
-
The following section will highlight some important generic configurations to optimize the performance of your Kafka consumers. For a detailed explanation of all configurations, see [Apache Kafka documentation on consumer configurations](https://kafka.apache.org/documentation/#consumerconfigs).
77
+
The following section highlight some important generic configurations to optimize the performance of your Kafka consumers. For a detailed explanation of all configurations, see [Apache Kafka documentation on consumer configurations](https://kafka.apache.org/documentation/#consumerconfigs).
78
78
79
79
### Number of consumers
80
80
81
-
It is a good practice to have the number of partitions equal to the number of consumers. If the number of consumers is less than the number of partitions then a few of the consumers will read from multiple partitions, increasing consumer latency.
81
+
It is a good practice to have the number of partitions equal to the number of consumers. If the number of consumers is less than the number of partitions, then a few of the consumers read from multiple partitions, increasing consumer latency.
82
82
83
-
If the number of consumers is greater than the number of partitions, then you will be wasting your consumer resources since those consumers will be idle.
83
+
If the number of consumers is greater than the number of partitions, then you are wasting your consumer resources since those consumers are idle.
84
84
85
85
### Avoid frequent consumer rebalance
86
86
87
87
Consumer rebalance is triggered by partition ownership change (i.e., consumers scales out or scales down), a broker crash (since brokers are group coordinator for consumer groups), a consumer crash, adding a new topic or adding new partitions. During rebalancing, consumers cannot consume, hence increasing the latency.
88
88
89
-
Consumers are considered alive if it can send a heartbeat to a broker within `session.timeout.ms`. Otherwise, the consumer will be considered dead or failed. This will lead to a consumer rebalance. The lower the consumer `session.timeout.ms` the faster we will be able to detect those failures.
89
+
Consumers are considered alive if it can send a heartbeat to a broker within `session.timeout.ms`. Otherwise, the consumer is considered dead or failed. This delay leads to a consumer rebalance. Lower the consumer `session.timeout.ms`, faster we can detect those failures.
90
90
91
91
If the `session.timeout.ms` is too low, a consumer could experience repeated unnecessary rebalances, due to scenarios such as when a batch of messages takes longer to process or when a JVM GC pause takes too long. If you have a consumer that spends too much time processing messages, you can address this either by increasing the upper bound on the amount of time that a consumer can be idle before fetching more records with `max.poll.interval.ms` or by reducing the maximum size of batches returned with the configuration parameter `max.poll.records`.
92
92
93
93
### Batching
94
94
95
-
Like producers, we can add batching for consumers. The amount of data consumers can get in each fetch request can be configured by changing the configuration `fetch.min.bytes`. This parameter defines the minimum bytes expected from a fetch response of a consumer. Increasing this value will reduce the number of fetch requests made to the broker, therefore reducing extra overhead. By default, this value is 1. Similarly, there is another configuration `fetch.max.wait.ms`. If a fetch request doesn’t have enough messages as per the size of `fetch.min.bytes`, it will wait until the expiration of the wait time based on this config `fetch.max.wait.ms`.
95
+
Like producers, we can add batching for consumers. The amount of data consumers can get in each fetch request can be configured by changing the configuration `fetch.min.bytes`. This parameter defines the minimum bytes expected from a fetch response of a consumer. Increasing this value reduces the number of fetch requests made to the broker, therefore reducing extra overhead. By default, this value is 1. Similarly, there is another configuration `fetch.max.wait.ms`. If a fetch request does not have enough messages as per the size of `fetch.min.bytes`, it waits until the expiration of the wait time based on this config `fetch.max.wait.ms`.
96
96
97
97
> [!NOTE]
98
98
> In few scenarios, consumers may seem to be slow, when it fails to process the message. If you are not committing the offset after an exception, consumer will be stuck at a particular offset in an infinite loop and will not move forward, increasing the lag on consumer side as a result.
@@ -103,7 +103,7 @@ Like producers, we can add batching for consumers. The amount of data consumers
103
103
104
104
`vm.max_map_count` defines maximum number of mmap a process can have. By default, on HDInsight Apache Kafka cluster linux VM, the value is 65535.
105
105
106
-
In Apache Kafka, each log segment requires a pair of index/timeindex files, and each of these files consumes 1 mmap. In other words, each log segment uses 2 mmap. Thus, if each partition hosts a single log segment, it requires minimum 2 mmap. The number of log segments per partition varies depending on the **segment size, load intensity, retention policy, rolling period** and, generally tends to be more than one. `Mmap value = 2*((partition size)/(segment size))*(partitions)`
106
+
In Apache Kafka, each log segment requires a pair of index/timeindex files, and each of these files consumes one mmap. In other words, each log segment uses two mmap. Thus, if each partition hosts a single log segment, it requires minimum two mmap. The number of log segments per partition varies depending on the **segment size, load intensity, retention policy, rolling period** and, generally tends to be more than one. `Mmap value = 2*((partition size)/(segment size))*(partitions)`
107
107
108
108
If required mmap value exceeds the `vm.max_map_count`, broker would raise **"Map failed"** exception.
0 commit comments