Skip to content

Commit 35460a7

Browse files
authored
Merge pull request #103802 from piyush-gupta1999/users/piyushgupta/kafka-docs-2.4
Kafka Document updated for HDInsight 5.0 and Kafka 2.4.1
2 parents 5c9cc92 + 2bcc708 commit 35460a7

10 files changed

+60
-60
lines changed

articles/hdinsight/hdinsight-40-component-versioning.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ The Open-source component versions associated with HDInsight 4.0 are listed in t
2828
| Apache Phoenix | 5 |
2929
| Apache Spark | 2.4.4 |
3030
| Apache Livy | 0.5 |
31-
| Apache Kafka | 2.1.1, 2.4.1 |
31+
| Apache Kafka | 2.1.1 |
3232
| Apache Ambari | 2.7.0 |
3333
| Apache Zeppelin | 0.8.0 |
3434

articles/hdinsight/hdinsight-50-component-versioning.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -16,21 +16,21 @@ Starting June 1, 2022, we have started rolling out a new version of HDInsight 5.
1616

1717
The Open-source component versions associated with HDInsight 5.0 are listed in the following table.
1818

19-
| Component | HDInsight 5.0 | HDInsight 4.0 |
20-
|------------------------|---------------| --------------|
21-
|Apache Spark | 3.1.2 | 2.4.4|
22-
|Apache Hive | 3.1.2 | 3.1.2 |
23-
|Apache Kafka | - |2.1.1 and 2.4.1|
24-
|Apache Hadoop |3.1.1 | 3.1.1 |
25-
|Apache Tez |0.9.1 | 0.9.1 |
26-
|Apache Pig | 0.16.1 | 0.16.1 |
27-
|Apache Ranger | 1.1.0 | 1.1.0 |
28-
|Apache Sqoop | 1.5.0 | 1.5.0 |
29-
|Apache Oozie | 4.3.1 | 4.3.1 |
30-
|Apache Zookeeper | 3.4.6 | 3.4.6 |
31-
|Apache Livy | 0.5 | 0.5 |
32-
|Apache Ambari | 2.7.0 | 2.7.0 |
33-
|Apache Zeppelin | 0.8.0 | 0.8.0 |
19+
| Component | HDInsight 5.0 | HDInsight 4.0 |
20+
|------------------|---------------|---------------|
21+
| Apache Spark | 3.1.2 | 2.4.4 |
22+
| Apache Hive | 3.1.2 | 3.1.2 |
23+
| Apache Kafka | 2.4.1 | 2.1.1 |
24+
| Apache Hadoop | 3.1.1 | 3.1.1 |
25+
| Apache Tez | 0.9.1 | 0.9.1 |
26+
| Apache Pig | 0.16.1 | 0.16.1 |
27+
| Apache Ranger | 1.1.0 | 1.1.0 |
28+
| Apache Sqoop | 1.5.0 | 1.5.0 |
29+
| Apache Oozie | 4.3.1 | 4.3.1 |
30+
| Apache Zookeeper | 3.4.6 | 3.4.6 |
31+
| Apache Livy | 0.5 | 0.5 |
32+
| Apache Ambari | 2.7.0 | 2.7.0 |
33+
| Apache Zeppelin | 0.8.0 | 0.8.0 |
3434

3535
This table lists certain HDInsight 4.0 cluster types that have retired or will be retired soon.
3636

@@ -44,8 +44,8 @@ This table lists certain HDInsight 4.0 cluster types that have retired or will b
4444

4545
> [!NOTE]
4646
> * If you are using Azure User Interface to create a Spark Cluster for HDInsight, you will see from the dropdown list an additional version Spark 3.1.(HDI 5.0) along with the older versions. This version is a renamed version of Spark 3.1.(HDI 4.0) and it is backward compatible.
47-
> * This is only an UI level change, which doesn’t impact anything for the existing users and users who are already using the ARM template to build their clusters.
48-
> * For backward compatibility, ARM supports creating Spark 3.1 with HDI 4.0 and 5.0 versions which maps to same versions Sspark 3.1 (HDI 5.0)
47+
> * This is only a UI level change, which doesn’t impact anything for the existing users and users who are already using the ARM template to build their clusters.
48+
> * For backward compatibility, ARM supports creating Spark 3.1 with HDI 4.0 and 5.0 versions which maps to same versions Spark 3.1 (HDI 5.0)
4949
> * Spark 3.1 (HDI 5.0) cluster comes with HWC 2.0 which works well together with Interactive Query (HDI 5.0) cluster.
5050
5151
## Interactive Query
@@ -61,11 +61,11 @@ you need to select this version Interactive Query 3.1 (HDI 5.0).
6161

6262
## Kafka
6363

64-
**Known Issue –** Current ARM template supports only 4.0 even though it shows 5.0 image in portal Cluster creation may fail with the following error message if you select version 5.0 in the UI.
64+
Current ARM template supports HDI 5.0 for Kafka 2.4.1
6565

66-
`HDI Version'5.0" is not supported for clusterType ''Kafka" and component Version 2.4'.,Cluster component version is not applicable for HDI version: 5.0 cluster type: KAFKA (Code: BadRequest)`
66+
`HDI Version '5.0' is supported for clusterType "Kafka" and component Version '2.4'.`
6767

68-
We're working on this issue, and a fix will be rolled out shortly.
68+
We have fixed the arm templated issue.
6969

7070
### Upcoming version upgrades.
7171
HDInsight team is working on upgrading other open-source components.

articles/hdinsight/hdinsight-apache-kafka-spark-structured-streaming.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -82,15 +82,15 @@ kafkaStreamDF.select(from_json(col("value").cast("string"), schema) as "trip")
8282

8383
In both snippets, data is read from Kafka and written to file. The differences between the examples are:
8484

85-
| Batch | Streaming |
86-
| --- | --- |
87-
| `read` | `readStream` |
85+
| Batch | Streaming |
86+
|---------|---------------|
87+
| `read` | `readStream` |
8888
| `write` | `writeStream` |
89-
| `save` | `start` |
89+
| `save` | `start` |
9090

9191
The streaming operation also uses `awaitTermination(30000)`, which stops the stream after 30,000 ms.
9292

93-
To use Structured Streaming with Kafka, your project must have a dependency on the `org.apache.spark : spark-sql-kafka-0-10_2.11` package. The version of this package should match the version of Spark on HDInsight. For Spark 2.2.0 (available in HDInsight 3.6), you can find the dependency information for different project types at [https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar](https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar).
93+
To use Structured Streaming with Kafka, your project must have a dependency on the `org.apache.spark : spark-sql-kafka-0-10_2.11` package. The version of this package should match the version of Spark on HDInsight. For Spark 2.4 (available in HDInsight 4.0), you can find the dependency information for different project types at [https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar](https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar).
9494

9595
For the Jupyter Notebook used with this tutorial, the following cell loads this package dependency:
9696

@@ -125,26 +125,26 @@ To create an Azure Virtual Network, and then create the Kafka and Spark clusters
125125

126126
This template creates the following resources:
127127

128-
* A Kafka on HDInsight 3.6 cluster.
129-
* A Spark 2.2.0 on HDInsight 3.6 cluster.
128+
* A Kafka on HDInsight 4.0 or 5.0 cluster.
129+
* A Spark 2.4 or 3.1 on HDInsight 4.0 or 5.0 cluster.
130130
* An Azure Virtual Network, which contains the HDInsight clusters.
131131

132132
> [!IMPORTANT]
133-
> The structured streaming notebook used in this tutorial requires Spark 2.2.0 on HDInsight 3.6. If you use an earlier version of Spark on HDInsight, you receive errors when using the notebook.
133+
> The structured streaming notebook used in this tutorial requires Spark 2.4 or 3.1 on HDInsight 4.0 or 5.0. If you use an earlier version of Spark on HDInsight, you receive errors when using the notebook.
134134
135135
2. Use the following information to populate the entries on the **Customized template** section:
136136

137-
| Setting | Value |
138-
| --- | --- |
139-
| Subscription | Your Azure subscription |
140-
| Resource group | The resource group that contains the resources. |
141-
| Location | The Azure region that the resources are created in. |
142-
| Spark Cluster Name | The name of the Spark cluster. The first six characters must be different than the Kafka cluster name. |
143-
| Kafka Cluster Name | The name of the Kafka cluster. The first six characters must be different than the Spark cluster name. |
144-
| Cluster Login User Name | The admin user name for the clusters. |
145-
| Cluster Login Password | The admin user password for the clusters. |
146-
| SSH User Name | The SSH user to create for the clusters. |
147-
| SSH Password | The password for the SSH user. |
137+
| Setting | Value |
138+
|-------------------------| ------------------------------------------------------------------------------------------------------ |
139+
| Subscription | Your Azure subscription |
140+
| Resource group | The resource group that contains the resources. |
141+
| Location | The Azure region that the resources are created in. |
142+
| Spark Cluster Name | The name of the Spark cluster. The first six characters must be different than the Kafka cluster name. |
143+
| Kafka Cluster Name | The name of the Kafka cluster. The first six characters must be different than the Spark cluster name. |
144+
| Cluster Login User Name | The admin user name for the clusters. |
145+
| Cluster Login Password | The admin user password for the clusters. |
146+
| SSH User Name | The SSH user to create for the clusters. |
147+
| SSH Password | The password for the SSH user. |
148148

149149
:::image type="content" source="./media/hdinsight-apache-kafka-spark-structured-streaming/spark-kafka-template.png" alt-text="Screenshot of the customized template":::
150150

articles/hdinsight/kafka/apache-kafka-get-started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ To create an Apache Kafka cluster on HDInsight, use the following steps:
8989

9090
1. Review the configuration for the cluster. Change any settings that are incorrect. Finally, select **Create** to create the cluster.
9191

92-
:::image type="content" source="./media/apache-kafka-get-started/azure-hdinsight-40-portal-cluster-review-create-kafka.png" alt-text="Screenshot showing kafka cluster configuration summary for HDI version 4.0." border="true":::
92+
:::image type="content" source="./media/apache-kafka-get-started/azure-hdinsight-50-portal-cluster-review-create-kafka.png" alt-text="Screenshot showing kafka cluster configuration summary for HDI version 5.0." border="true":::
9393

9494

9595
It can take up to 20 minutes to create the cluster.

articles/hdinsight/kafka/apache-kafka-introduction.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -54,17 +54,17 @@ Replication is employed to duplicate partitions across nodes, protecting against
5454

5555
The following are common tasks and patterns that can be performed using Kafka on HDInsight:
5656

57-
|Use |Description |
58-
|---|---|
59-
|Replication of Apache Kafka data|Kafka provides the MirrorMaker utility, which replicates data between Kafka clusters. For information on using MirrorMaker, see [Replicate Apache Kafka topics with Apache Kafka on HDInsight](apache-kafka-mirroring.md).|
60-
|Publish-subscribe messaging pattern|Kafka provides a Producer API for publishing records to a Kafka topic. The Consumer API is used when subscribing to a topic. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
61-
|Stream processing|Kafka is often used with Spark for real-time stream processing. Kafka 0.10.0.0 (HDInsight version 3.5 and 3.6) introduced a streaming API that allows you to build streaming solutions without requiring Spark. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
62-
|Horizontal scale|Kafka partitions streams across the nodes in the HDInsight cluster. Consumer processes can be associated with individual partitions to provide load balancing when consuming records. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
63-
|In-order delivery|Within each partition, records are stored in the stream in the order that they were received. By associating one consumer process per partition, you can guarantee that records are processed in-order. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
64-
|Messaging|Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker.|
65-
|Activity tracking|Since Kafka provides in-order logging of records, it can be used to track and re-create activities. For example, user actions on a web site or within an application.|
66-
|Aggregation|Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data.|
67-
|Transformation|Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics.|
57+
|Use | Description |
58+
|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
59+
|Replication of Apache Kafka data| Kafka provides the MirrorMaker utility, which replicates data between Kafka clusters. For information on using MirrorMaker, see [Replicate Apache Kafka topics with Apache Kafka on HDInsight](apache-kafka-mirroring.md). |
60+
|Publish-subscribe messaging pattern| Kafka provides a Producer API for publishing records to a Kafka topic. The Consumer API is used when subscribing to a topic. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md). |
61+
|Stream processing| Kafka is often used with Spark for real-time stream processing. Kafka 2.1.1 and 2.4.1 (HDInsight version 4.0 and 5.0) support streaming API's that allows you to build streaming solutions without requiring Spark. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md). |
62+
|Horizontal scale| Kafka partitions streams across the nodes in the HDInsight cluster. Consumer processes can be associated with individual partitions to provide load balancing when consuming records. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md). |
63+
|In-order delivery| Within each partition, records are stored in the stream in the order that they were received. By associating one consumer process per partition, you can guarantee that records are processed in-order. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md). |
64+
|Messaging| Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker. |
65+
|Activity tracking| Since Kafka provides in-order logging of records, it can be used to track and re-create activities. For example, user actions on a web site or within an application. |
66+
|Aggregation| Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data. |
67+
|Transformation| Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics. |
6868

6969
## Next steps
7070

articles/hdinsight/kafka/apache-kafka-performance-tuning.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ Each Kafka partition is a log file on the system, and producer threads can write
6262

6363
Increasing the partition density (the number of partitions per broker) adds an overhead related to metadata operations and per partition request/response between the partition leader and its followers. Even in the absence of data flowing through, partition replicas still fetch data from leaders, which results in extra processing for send and receive requests over the network.
6464

65-
For Apache Kafka clusters 1.1 and above in HDInsight, we recommend you to have a maximum of 1000 partitions per broker, including replicas. Increasing the number of partitions per broker decreases throughput and may also cause topic unavailability. For more information on Kafka partition support, see [the official Apache Kafka blog post on the increase in the number of supported partitions in version 1.1.0](https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions). For details on modifying topics, see [Apache Kafka: modifying topics](https://kafka.apache.org/documentation/#basic_ops_modify_topic).
65+
For Apache Kafka clusters 2.1 and 2.4 and above in HDInsight, we recommend you to have a maximum of 2000 partitions per broker, including replicas. Increasing the number of partitions per broker decreases throughput and may also cause topic unavailability. For more information on Kafka partition support, see [the official Apache Kafka blog post on the increase in the number of supported partitions in version 1.1.0](https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions). For details on modifying topics, see [Apache Kafka: modifying topics](https://kafka.apache.org/documentation/#basic_ops_modify_topic).
6666

6767
### Number of replicas
6868

@@ -74,7 +74,7 @@ For more information on replication, see [Apache Kafka: replication](https://kaf
7474

7575
## Consumer configurations
7676

77-
The following section will highlight some of the important generic configurations to optimize the performance of your Kafka consumers. For a detailed explanation of all configurations, see [Apache Kafka documentation on consumer configurations](https://kafka.apache.org/documentation/#consumerconfigs).
77+
The following section will highlight some important generic configurations to optimize the performance of your Kafka consumers. For a detailed explanation of all configurations, see [Apache Kafka documentation on consumer configurations](https://kafka.apache.org/documentation/#consumerconfigs).
7878

7979
### Number of consumers
8080

articles/hdinsight/kafka/apache-kafka-quickstart-powershell.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,18 +94,18 @@ New-AzStorageContainer -Name $containerName -Context $storageContext
9494
Create an Apache Kafka on HDInsight cluster with [New-AzHDInsightCluster](/powershell/module/az.HDInsight/New-azHDInsightCluster).
9595

9696
```azurepowershell-interactive
97-
# Create a Kafka 2.4.0 cluster
97+
# Create a Kafka 2.4.1 cluster
9898
$clusterName = Read-Host -Prompt "Enter the name of the Kafka cluster"
9999
$httpCredential = Get-Credential -Message "Enter the cluster login credentials" -UserName "admin"
100100
$sshCredentials = Get-Credential -Message "Enter the SSH user credentials" -UserName "sshuser"
101101
102102
$numberOfWorkerNodes = "4"
103-
$clusterVersion = "4.0"
103+
$clusterVersion = "5.0"
104104
$clusterType="Kafka"
105105
$disksPerNode=2
106106
107107
$kafkaConfig = New-Object "System.Collections.Generic.Dictionary``2[System.String,System.String]"
108-
$kafkaConfig.Add("kafka", "2.4.0")
108+
$kafkaConfig.Add("kafka", "2.4.1")
109109
110110
New-AzHDInsightCluster `
111111
-ResourceGroupName $resourceGroup `

0 commit comments

Comments
 (0)