Merge pull request #103802 from piyush-gupta1999/users/piyushgupta/kafka-docs-2.4

Court72 · web-flow · commit 35460a7f90d6 · 2023-01-27T08:24:10.000-07:00
Kafka Document updated for HDInsight 5.0 and Kafka 2.4.1
diff --git a/articles/hdinsight/hdinsight-40-component-versioning.md b/articles/hdinsight/hdinsight-40-component-versioning.md
@@ -28,7 +28,7 @@ The Open-source component versions associated with HDInsight 4.0 are listed in t
 | Apache Phoenix         | 5             |
 | Apache Spark           | 2.4.4         |
 | Apache Livy            | 0.5           |
-| Apache Kafka           | 2.1.1, 2.4.1  |
+| Apache Kafka           | 2.1.1         |
 | Apache Ambari          | 2.7.0         |
 | Apache Zeppelin        | 0.8.0         |
 
diff --git a/articles/hdinsight/hdinsight-50-component-versioning.md b/articles/hdinsight/hdinsight-50-component-versioning.md
@@ -16,21 +16,21 @@ Starting June 1, 2022, we have started rolling out a new version of HDInsight 5.
 
 The Open-source component versions associated with HDInsight 5.0 are listed in the following table.
 
-| Component              | HDInsight 5.0 | HDInsight 4.0 |
-|------------------------|---------------| --------------|
-|Apache Spark | 3.1.2 | 2.4.4|
-|Apache Hive | 3.1.2 | 3.1.2 |
-|Apache Kafka | - |2.1.1 and 2.4.1|
-|Apache Hadoop |3.1.1 | 3.1.1 |
-|Apache Tez |0.9.1 | 0.9.1 |
-|Apache Pig	| 0.16.1 | 0.16.1 |
-|Apache Ranger | 1.1.0 | 1.1.0 |
-|Apache Sqoop | 1.5.0 | 1.5.0 |
-|Apache Oozie | 4.3.1 | 4.3.1 |
-|Apache Zookeeper | 3.4.6 | 3.4.6 |
-|Apache Livy | 0.5 | 0.5 |
-|Apache Ambari | 2.7.0 | 2.7.0 |
-|Apache Zeppelin | 0.8.0 | 0.8.0 |
+| Component        | HDInsight 5.0 | HDInsight 4.0 |
+|------------------|---------------|---------------|
+| Apache Spark     | 3.1.2         | 2.4.4         |
+| Apache Hive      | 3.1.2         | 3.1.2         |
+| Apache Kafka     | 2.4.1         | 2.1.1         |
+| Apache Hadoop    | 3.1.1         | 3.1.1         |
+| Apache Tez       | 0.9.1         | 0.9.1         |
+| Apache Pig       | 0.16.1        | 0.16.1        |
+| Apache Ranger    | 1.1.0         | 1.1.0         |
+| Apache Sqoop     | 1.5.0         | 1.5.0         |
+| Apache Oozie     | 4.3.1         | 4.3.1         |
+| Apache Zookeeper | 3.4.6         | 3.4.6         |
+| Apache Livy      | 0.5           | 0.5           |
+| Apache Ambari    | 2.7.0         | 2.7.0         |
+| Apache Zeppelin  | 0.8.0         | 0.8.0         |
 
 This table lists certain HDInsight 4.0 cluster types that have retired or will be retired soon.
 
@@ -44,8 +44,8 @@ This table lists certain HDInsight 4.0 cluster types that have retired or will b
 
 > [!NOTE]
 > * If you are using Azure User Interface to create a Spark Cluster for HDInsight, you will see from the dropdown list an additional version Spark 3.1.(HDI 5.0) along with the older versions. This version is a renamed version of Spark 3.1.(HDI 4.0) and it is backward compatible.  
-> * This is only an UI level change, which doesn’t impact anything for the existing users and users who are already using the ARM template to build their clusters.
-> * For backward compatibility, ARM supports creating Spark 3.1 with HDI 4.0 and 5.0 versions which maps to same versions Sspark 3.1 (HDI 5.0)
+> * This is only a UI level change, which doesn’t impact anything for the existing users and users who are already using the ARM template to build their clusters.
+> * For backward compatibility, ARM supports creating Spark 3.1 with HDI 4.0 and 5.0 versions which maps to same versions Spark 3.1 (HDI 5.0)
 > * Spark 3.1 (HDI 5.0) cluster comes with HWC 2.0 which works well together with Interactive Query (HDI 5.0) cluster.
 
 ## Interactive Query
@@ -61,11 +61,11 @@ you need to select this version Interactive Query 3.1 (HDI 5.0).
 
 ## Kafka 
 
-**Known Issue –** Current ARM template supports only 4.0 even though it shows 5.0 image in portal Cluster creation may fail with the following error message if you select version 5.0 in the UI.
+Current ARM template supports HDI 5.0 for Kafka 2.4.1
 
-`HDI Version'5.0" is not supported for clusterType ''Kafka" and component Version ‘2.4'.,Cluster component version is not applicable for HDI version: 5.0 cluster type: KAFKA (Code: BadRequest)`
+`HDI Version '5.0' is supported for clusterType "Kafka" and component Version '2.4'.`
 
-We're working on this issue, and a fix will be rolled out shortly.
+We have fixed the arm templated issue.
 
 ### Upcoming version upgrades. 
 HDInsight team is working on upgrading other open-source components.
diff --git a/articles/hdinsight/hdinsight-apache-kafka-spark-structured-streaming.md b/articles/hdinsight/hdinsight-apache-kafka-spark-structured-streaming.md
@@ -82,15 +82,15 @@ kafkaStreamDF.select(from_json(col("value").cast("string"), schema) as "trip")
 
 In both snippets, data is read from Kafka and written to file. The differences between the examples are:
 
-| Batch | Streaming |
-| --- | --- |
-| `read` | `readStream` |
+| Batch   | Streaming     |
+|---------|---------------|
+| `read`  | `readStream`  |
 | `write` | `writeStream` |
-| `save` | `start` |
+| `save`  | `start`       |
 
 The streaming operation also uses `awaitTermination(30000)`, which stops the stream after 30,000 ms.
 
-To use Structured Streaming with Kafka, your project must have a dependency on the `org.apache.spark : spark-sql-kafka-0-10_2.11` package. The version of this package should match the version of Spark on HDInsight. For Spark 2.2.0 (available in HDInsight 3.6), you can find the dependency information for different project types at [https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar](https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar).
+To use Structured Streaming with Kafka, your project must have a dependency on the `org.apache.spark : spark-sql-kafka-0-10_2.11` package. The version of this package should match the version of Spark on HDInsight. For Spark 2.4 (available in HDInsight 4.0), you can find the dependency information for different project types at [https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar](https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-sql-kafka-0-10_2.11%7C2.2.0%7Cjar).
 
 For the Jupyter Notebook used with this tutorial, the following cell loads this package dependency:
 
@@ -125,26 +125,26 @@ To create an Azure Virtual Network, and then create the Kafka and Spark clusters
 
     This template creates the following resources:
 
-   * A Kafka on HDInsight 3.6 cluster.
-   * A Spark 2.2.0 on HDInsight 3.6 cluster.
+   * A Kafka on HDInsight 4.0 or 5.0 cluster.
+   * A Spark 2.4 or 3.1 on HDInsight 4.0 or 5.0 cluster.
    * An Azure Virtual Network, which contains the HDInsight clusters.
 
      > [!IMPORTANT]  
-     > The structured streaming notebook used in this tutorial requires Spark 2.2.0 on HDInsight 3.6. If you use an earlier version of Spark on HDInsight, you receive errors when using the notebook.
+     > The structured streaming notebook used in this tutorial requires Spark 2.4 or 3.1 on HDInsight 4.0 or 5.0. If you use an earlier version of Spark on HDInsight, you receive errors when using the notebook.
 
 2. Use the following information to populate the entries on the **Customized template** section:
 
-    | Setting | Value |
-    | --- | --- |
-    | Subscription | Your Azure subscription |
-    | Resource group | The resource group that contains the resources. |
-    | Location | The Azure region that the resources are created in. |
-    | Spark Cluster Name | The name of the Spark cluster. The first six characters must be different than the Kafka cluster name. |
-    | Kafka Cluster Name | The name of the Kafka cluster. The first six characters must be different than the Spark cluster name. |
-    | Cluster Login User Name | The admin user name for the clusters. |
-    | Cluster Login Password | The admin user password for the clusters. |
-    | SSH User Name | The SSH user to create for the clusters. |
-    | SSH Password | The password for the SSH user. |
+    | Setting                 | Value                                                                                                  |
+    |-------------------------| ------------------------------------------------------------------------------------------------------ |
+    | Subscription            | Your Azure subscription                                                                                |
+    | Resource group          | The resource group that contains the resources.                                                        |
+    | Location                | The Azure region that the resources are created in.                                                    |
+    | Spark Cluster Name      | The name of the Spark cluster. The first six characters must be different than the Kafka cluster name. |
+    | Kafka Cluster Name      | The name of the Kafka cluster. The first six characters must be different than the Spark cluster name. |
+    | Cluster Login User Name | The admin user name for the clusters.                                                                  |
+    | Cluster Login Password  | The admin user password for the clusters.                                                              |
+    | SSH User Name           | The SSH user to create for the clusters.                                                               |
+    | SSH Password            | The password for the SSH user.                                                                         |
 
     :::image type="content" source="./media/hdinsight-apache-kafka-spark-structured-streaming/spark-kafka-template.png" alt-text="Screenshot of the customized template":::
 
diff --git a/articles/hdinsight/kafka/apache-kafka-get-started.md b/articles/hdinsight/kafka/apache-kafka-get-started.md
@@ -89,7 +89,7 @@ To create an Apache Kafka cluster on HDInsight, use the following steps:
 
 1. Review the configuration for the cluster. Change any settings that are incorrect. Finally, select **Create** to create the cluster.
 
-    :::image type="content" source="./media/apache-kafka-get-started/azure-hdinsight-40-portal-cluster-review-create-kafka.png" alt-text="Screenshot showing kafka cluster configuration summary for HDI version 4.0." border="true":::
+    :::image type="content" source="./media/apache-kafka-get-started/azure-hdinsight-50-portal-cluster-review-create-kafka.png" alt-text="Screenshot showing kafka cluster configuration summary for HDI version 5.0." border="true":::
 
 
     It can take up to 20 minutes to create the cluster.
diff --git a/articles/hdinsight/kafka/apache-kafka-introduction.md b/articles/hdinsight/kafka/apache-kafka-introduction.md
@@ -54,17 +54,17 @@ Replication is employed to duplicate partitions across nodes, protecting against
 
 The following are common tasks and patterns that can be performed using Kafka on HDInsight:
 
-|Use |Description |
-|---|---|
-|Replication of Apache Kafka data|Kafka provides the MirrorMaker utility, which replicates data between Kafka clusters. For information on using MirrorMaker, see [Replicate Apache Kafka topics with Apache Kafka on HDInsight](apache-kafka-mirroring.md).|
-|Publish-subscribe messaging pattern|Kafka provides a Producer API for publishing records to a Kafka topic. The Consumer API is used when subscribing to a topic. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
-|Stream processing|Kafka is often used with Spark for real-time stream processing. Kafka 0.10.0.0 (HDInsight version 3.5 and 3.6) introduced a streaming API that allows you to build streaming solutions without requiring Spark. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
-|Horizontal scale|Kafka partitions streams across the nodes in the HDInsight cluster. Consumer processes can be associated with individual partitions to provide load balancing when consuming records. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
-|In-order delivery|Within each partition, records are stored in the stream in the order that they were received. By associating one consumer process per partition, you can guarantee that records are processed in-order. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).|
-|Messaging|Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker.|
-|Activity tracking|Since Kafka provides in-order logging of records, it can be used to track and re-create activities. For example, user actions on a web site or within an application.|
-|Aggregation|Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data.|
-|Transformation|Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics.|
+|Use | Description                                                                                                                                                                                                                                                                                                        |
+|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|Replication of Apache Kafka data| Kafka provides the MirrorMaker utility, which replicates data between Kafka clusters. For information on using MirrorMaker, see [Replicate Apache Kafka topics with Apache Kafka on HDInsight](apache-kafka-mirroring.md).                                                                                         |
+|Publish-subscribe messaging pattern| Kafka provides a Producer API for publishing records to a Kafka topic. The Consumer API is used when subscribing to a topic. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).                                                                                        |
+|Stream processing| Kafka is often used with Spark for real-time stream processing. Kafka 2.1.1 and 2.4.1 (HDInsight version 4.0 and 5.0) support streaming API's that allows you to build streaming solutions without requiring Spark. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md). |
+|Horizontal scale| Kafka partitions streams across the nodes in the HDInsight cluster. Consumer processes can be associated with individual partitions to provide load balancing when consuming records. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).                               |
+|In-order delivery| Within each partition, records are stored in the stream in the order that they were received. By associating one consumer process per partition, you can guarantee that records are processed in-order. For more information, see [Start with Apache Kafka on HDInsight](apache-kafka-get-started.md).             |
+|Messaging| Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker.                                                                                                                                                                                                                  |
+|Activity tracking| Since Kafka provides in-order logging of records, it can be used to track and re-create activities. For example, user actions on a web site or within an application.                                                                                                                                              |
+|Aggregation| Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data.                                                                                                                                                                     |
+|Transformation| Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics.                                                                                                                                                                                                |
 
 ## Next steps
 
diff --git a/articles/hdinsight/kafka/apache-kafka-performance-tuning.md b/articles/hdinsight/kafka/apache-kafka-performance-tuning.md
@@ -62,7 +62,7 @@ Each Kafka partition is a log file on the system, and producer threads can write
 
 Increasing the partition density (the number of partitions per broker) adds an overhead related to metadata operations and per partition request/response between the partition leader and its followers. Even in the absence of data flowing through, partition replicas still fetch data from leaders, which results in extra processing for send and receive requests over the network.
 
-For Apache Kafka clusters 1.1 and above in HDInsight, we recommend you to have a maximum of 1000 partitions per broker, including replicas. Increasing the number of partitions per broker decreases throughput and may also cause topic unavailability. For more information on Kafka partition support, see [the official Apache Kafka blog post on the increase in the number of supported partitions in version 1.1.0](https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions). For details on modifying topics, see [Apache Kafka: modifying topics](https://kafka.apache.org/documentation/#basic_ops_modify_topic).
+For Apache Kafka clusters 2.1 and 2.4 and above in HDInsight, we recommend you to have a maximum of 2000 partitions per broker, including replicas. Increasing the number of partitions per broker decreases throughput and may also cause topic unavailability. For more information on Kafka partition support, see [the official Apache Kafka blog post on the increase in the number of supported partitions in version 1.1.0](https://blogs.apache.org/kafka/entry/apache-kafka-supports-more-partitions). For details on modifying topics, see [Apache Kafka: modifying topics](https://kafka.apache.org/documentation/#basic_ops_modify_topic).
 
 ### Number of replicas
 
@@ -74,7 +74,7 @@ For more information on replication, see [Apache Kafka: replication](https://kaf
 
 ## Consumer configurations
 
-The following section will highlight some of the important generic configurations to optimize the performance of your Kafka consumers. For a detailed explanation of all configurations, see [Apache Kafka documentation on consumer configurations](https://kafka.apache.org/documentation/#consumerconfigs).
+The following section will highlight some important generic configurations to optimize the performance of your Kafka consumers. For a detailed explanation of all configurations, see [Apache Kafka documentation on consumer configurations](https://kafka.apache.org/documentation/#consumerconfigs).
 
 ### Number of consumers
 
diff --git a/articles/hdinsight/kafka/apache-kafka-quickstart-powershell.md b/articles/hdinsight/kafka/apache-kafka-quickstart-powershell.md
@@ -94,18 +94,18 @@ New-AzStorageContainer -Name $containerName -Context $storageContext
 Create an Apache Kafka on HDInsight cluster with [New-AzHDInsightCluster](/powershell/module/az.HDInsight/New-azHDInsightCluster).
 
 ```azurepowershell-interactive
-# Create a Kafka 2.4.0 cluster
+# Create a Kafka 2.4.1 cluster
 $clusterName = Read-Host -Prompt "Enter the name of the Kafka cluster"
 $httpCredential = Get-Credential -Message "Enter the cluster login credentials" -UserName "admin"
 $sshCredentials = Get-Credential -Message "Enter the SSH user credentials" -UserName "sshuser"
 
 $numberOfWorkerNodes = "4"
-$clusterVersion = "4.0"
+$clusterVersion = "5.0"
 $clusterType="Kafka"
 $disksPerNode=2
 
 $kafkaConfig = New-Object "System.Collections.Generic.Dictionary``2[System.String,System.String]"
-$kafkaConfig.Add("kafka", "2.4.0")
+$kafkaConfig.Add("kafka", "2.4.1")
 
 New-AzHDInsightCluster `
         -ResourceGroupName $resourceGroup `
diff --git a/articles/hdinsight/kafka/apache-kafka-streams-api.md b/articles/hdinsight/kafka/apache-kafka-streams-api.md
diff --git a/articles/hdinsight/kafka/media/apache-kafka-get-started/azure-hdinsight-40-portal-cluster-review-create-kafka.png b/articles/hdinsight/kafka/media/apache-kafka-get-started/azure-hdinsight-40-portal-cluster-review-create-kafka.png
diff --git a/articles/hdinsight/kafka/media/apache-kafka-get-started/azure-hdinsight-50-portal-cluster-review-create-kafka.png b/articles/hdinsight/kafka/media/apache-kafka-get-started/azure-hdinsight-50-portal-cluster-review-create-kafka.png