Skip to content

Commit ffe831c

Browse files
authored
Merge pull request #79536 from dagiro/mvc10
mvc10
2 parents b1ac6bd + 6315c15 commit ffe831c

File tree

1 file changed

+18
-32
lines changed

1 file changed

+18
-32
lines changed

articles/hdinsight/kafka/apache-kafka-get-started.md

Lines changed: 18 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,7 @@ In this quickstart, you learn how to create an [Apache Kafka](https://kafka.apac
1818

1919
[!INCLUDE [delete-cluster-warning](../../../includes/hdinsight-delete-cluster-warning.md)]
2020

21-
> [!IMPORTANT]
22-
> The Apache Kafka API can only be accessed by resources inside the same virtual network. In this quickstart, you access the cluster directly using SSH. To connect other services, networks, or virtual machines to Apache Kafka, you must first create a virtual network and then create the resources within the network.
23-
>
24-
> For more information, see the [Connect to Apache Kafka using a virtual network](apache-kafka-connect-vpn-gateway.md) document.
21+
The Apache Kafka API can only be accessed by resources inside the same virtual network. In this quickstart, you access the cluster directly using SSH. To connect other services, networks, or virtual machines to Apache Kafka, you must first create a virtual network and then create the resources within the network. For more information, see the [Connect to Apache Kafka using a virtual network](apache-kafka-connect-vpn-gateway.md) document.
2522

2623
If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin.
2724

@@ -71,10 +68,9 @@ To create an Apache Kafka on HDInsight cluster, use the following steps:
7168
| Resource Group | The resource group to create the cluster in. |
7269
| Location | The Azure region to create the cluster in. |
7370

74-
> [!TIP]
75-
> Each Azure region (location) provides _fault domains_. A fault domain is a logical grouping of underlying hardware in an Azure data center. Each fault domain shares a common power source and network switch. The virtual machines and managed disks that implement the nodes within an HDInsight cluster are distributed across these fault domains. This architecture limits the potential impact of physical hardware failures.
76-
>
77-
> For high availability of data, select a region (location) that contains __three fault domains__. For information on the number of fault domains in a region, see the [Availability of Linux virtual machines](../../virtual-machines/windows/manage-availability.md#use-managed-disks-for-vms-in-an-availability-set) document.
71+
Each Azure region (location) provides _fault domains_. A fault domain is a logical grouping of underlying hardware in an Azure data center. Each fault domain shares a common power source and network switch. The virtual machines and managed disks that implement the nodes within an HDInsight cluster are distributed across these fault domains. This architecture limits the potential impact of physical hardware failures.
72+
73+
For high availability of data, select a region (location) that contains __three fault domains__. For information on the number of fault domains in a region, see the [Availability of Linux virtual machines](../../virtual-machines/windows/manage-availability.md#use-managed-disks-for-vms-in-an-availability-set) document.
7874

7975
![Select subscription](./media/apache-kafka-get-started/hdinsight-basic-configuration-2.png)
8076

@@ -94,22 +90,19 @@ To create an Apache Kafka on HDInsight cluster, use the following steps:
9490

9591
9. From __Cluster size__, select __Next__ to continue with the default settings.
9692

97-
> [!IMPORTANT]
98-
> To guarantee availability of Apache Kafka on HDInsight, the __number of worker nodes__ entry must be set to 3 or greater. The default value is 4.
99-
100-
> [!TIP]
101-
> The **disks per worker node** entry configures the scalability of Apache Kafka on HDInsight. Apache Kafka on HDInsight uses the local disk of the virtual machines in the cluster to store data. Apache Kafka is I/O heavy, so [Azure Managed Disks](../../virtual-machines/windows/managed-disks-overview.md) are used to provide high throughput and more storage per node. The type of managed disk can be either __Standard__ (HDD) or __Premium__ (SSD). The type of disk depends on the VM size used by the worker nodes (Apache Kafka brokers). Premium disks are used automatically with DS and GS series VMs. All other VM types use standard.
93+
To guarantee availability of Apache Kafka on HDInsight, the __number of worker nodes__ entry must be set to 3 or greater. The default value is 4.
94+
95+
The **disks per worker node** entry configures the scalability of Apache Kafka on HDInsight. Apache Kafka on HDInsight uses the local disk of the virtual machines in the cluster to store data. Apache Kafka is I/O heavy, so [Azure Managed Disks](../../virtual-machines/windows/managed-disks-overview.md) are used to provide high throughput and more storage per node. The type of managed disk can be either __Standard__ (HDD) or __Premium__ (SSD). The type of disk depends on the VM size used by the worker nodes (Apache Kafka brokers). Premium disks are used automatically with DS and GS series VMs. All other VM types use standard.
10296

10397
![Set the Apache Kafka cluster size](./media/apache-kafka-get-started/kafka-cluster-size.png)
10498

10599
10. From __Advanced settings__, select __Next__ to continue with the default settings.
106100

107101
11. From the **Summary**, review the configuration for the cluster. Use the __Edit__ links to change any settings that are incorrect. Finally, select **Create** to create the cluster.
108-
102+
109103
![Cluster configuration summary](./media/apache-kafka-get-started/kafka-configuration-summary.png)
110-
111-
> [!NOTE]
112-
> It can take up to 20 minutes to create the cluster.
104+
105+
It can take up to 20 minutes to create the cluster.
113106

114107
## Connect to the cluster
115108

@@ -172,17 +165,13 @@ In this section, you get the host information from the Apache Ambari REST API on
172165
echo $clusterName, $clusterNameA
173166
```
174167

175-
4. To set an environment variable with Zookeeper host information, use the following command:
176-
168+
4. To set an environment variable with Zookeeper host information, use the command below. The command retrieves all Zookeeper hosts, then returns only the first two entries. This is because you want some redundancy in case one host is unreachable.
169+
177170
```bash
178171
export KAFKAZKHOSTS=`curl -sS -u admin:$password -G http://headnodehost:8080/api/v1/clusters/$clusterName/services/ZOOKEEPER/components/ZOOKEEPER_SERVER | jq -r '["\(.host_components[].HostRoles.host_name):2181"] | join(",")' | cut -d',' -f1,2`
179172
```
180173

181-
> [!TIP]
182-
> This command directly queries the Ambari service on the cluster head node. You can also access Ambari using the public address of `https://$CLUSTERNAME.azurehdinsight.net:80/`. Some network configurations can prevent access to the public address. For example, using Network Security Groups (NSG) to restrict access to HDInsight in a virtual network.
183-
184-
> [!NOTE]
185-
> This command retrieves all Zookeeper hosts, then returns only the first two entries. This is because you want some redundancy in case one host is unreachable.
174+
This command directly queries the Ambari service on the cluster head node. You can also access Ambari using the public address of `https://$CLUSTERNAME.azurehdinsight.net:80/`. Some network configurations can prevent access to the public address. For example, using Network Security Groups (NSG) to restrict access to HDInsight in a virtual network.
186175

187176
5. To verify that the environment variable is set correctly, use the following command:
188177

@@ -226,15 +215,13 @@ Kafka stores streams of data in *topics*. You can use the `kafka-topics.sh` util
226215

227216
* Each partition is replicated across three worker nodes in the cluster.
228217

229-
> [!IMPORTANT]
230-
> If you created the cluster in an Azure region that provides three fault domains, use a replication factor of 3. Otherwise, use a replication factor of 4.
218+
If you created the cluster in an Azure region that provides three fault domains, use a replication factor of 3. Otherwise, use a replication factor of 4.
231219

232220
In regions with three fault domains, a replication factor of 3 allows replicas to be spread across the fault domains. In regions with two fault domains, a replication factor of four spreads the replicas evenly across the domains.
233221

234222
For information on the number of fault domains in a region, see the [Availability of Linux virtual machines](../../virtual-machines/windows/manage-availability.md#use-managed-disks-for-vms-in-an-availability-set) document.
235223

236-
> [!IMPORTANT]
237-
> Apache Kafka is not aware of Azure fault domains. When creating partition replicas for topics, it may not distribute replicas properly for high availability.
224+
Apache Kafka is not aware of Azure fault domains. When creating partition replicas for topics, it may not distribute replicas properly for high availability.
238225

239226
To ensure high availability, use the [Apache Kafka partition rebalance tool](https://github.com/hdinsight/hdinsight-kafka-tools). This tool must be ran from an SSH connection to the head node of your Apache Kafka cluster.
240227

@@ -286,15 +273,14 @@ To store records into the test topic you created earlier, and then read them usi
286273
2. Type a text message on the empty line and hit enter. Enter a few messages this way, and then use **Ctrl + C** to return to the normal prompt. Each line is sent as a separate record to the Apache Kafka topic.
287274

288275
3. To read records from the topic, use the `kafka-console-consumer.sh` utility from the SSH connection:
289-
276+
290277
```bash
291278
/usr/hdp/current/kafka-broker/bin/kafka-console-consumer.sh --bootstrap-server $KAFKABROKERS --topic test --from-beginning
292279
```
293-
280+
294281
This command retrieves the records from the topic and displays them. Using `--from-beginning` tells the consumer to start from the beginning of the stream, so all records are retrieved.
295282

296-
> [!NOTE]
297-
> If you are using an older version of Kafka, replace `--bootstrap-server $KAFKABROKERS` with `--zookeeper $KAFKAZKHOSTS`.
283+
If you are using an older version of Kafka, replace `--bootstrap-server $KAFKABROKERS` with `--zookeeper $KAFKAZKHOSTS`.
298284

299285
4. Use __Ctrl + C__ to stop the consumer.
300286

0 commit comments

Comments
 (0)