Skip to content

Commit afd1baa

Browse files
Merge pull request #284480 from sreekzz/aug-freshness
August Freshness Date Change
2 parents c909362 + 1293fc2 commit afd1baa

9 files changed

+29
-29
lines changed

articles/hdinsight/hadoop/apache-hadoop-deep-dive-advanced-analytics.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how advanced analytics uses algorithms to process big data in
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive
7-
ms.date: 08/22/2023
7+
ms.date: 08/13/2023
88
---
99

1010
# Deep dive - advanced analytics
@@ -17,17 +17,17 @@ HDInsight provides the ability to obtain valuable insight from large amounts of
1717

1818
:::image type="content" source="./media/apache-hadoop-deep-dive-advanced-analytics/hdinsight-analytic-process.png" alt-text="Advanced analytics process flow." border="false":::
1919

20-
After you've identified the business problem and have started collecting and processing your data, you need to create a model that represents the question you wish to predict. Your model will use one or more machine learning algorithms to make the type of prediction that best fits your business needs. The majority of your data should be used to train your model, with the rest used to test or evaluate it.
20+
After you identified the business problem and started collecting and processing your data, you need to create a model that represents the question you wish to predict. Your model uses one or more machine learning algorithms to make the type of prediction that best fits your business needs. Most your data should be used to train your model, with the rest used to test or evaluate it.
2121

2222
After you create, load, test, and evaluate your model, the next step is to deploy your model so that it begins supplying answers to your questions. The last step is to monitor your model's performance and tune it as necessary.
2323

2424
## Common types of algorithms
2525

26-
Advanced analytics solutions provide a set of machine learning algorithms. Here is a summary of the categories of algorithms and associated common business use cases.
26+
Advanced analytics solutions provide a set of machine learning algorithms. Here's a summary of the categories of algorithms and associated common business use cases.
2727

2828
:::image type="content" source="./media/apache-hadoop-deep-dive-advanced-analytics/machine-learning-use-cases.png" alt-text="Machine Learning category summaries." border="false":::
2929

30-
Along with selecting the best-fitting algorithm(s), you need to consider whether or not you need to provide data for training. Machine learning algorithms are categorized as follows:
30+
Along with selecting one or more best-fitting algorithms, you need to consider whether or not you need to provide data for training. Machine learning algorithms are categorized as follows:
3131

3232
* Supervised - algorithm needs to be trained on a set of labeled data before it can provide results
3333
* Semi-supervised - algorithm can be augmented by extra targets through interactive query by a trainer, which weren't available during initial stage of training
@@ -58,7 +58,7 @@ There are three scalable machine learning libraries that bring algorithmic model
5858

5959
* [**MLlib**](https://spark.apache.org/docs/latest/ml-guide.html) - MLlib contains the original API built on top of Spark RDDs.
6060
* [**SparkML**](https://spark.apache.org/docs/1.2.2/ml-guide.html) - SparkML is a newer package that provides a higher-level API built on top of Spark DataFrames for constructing ML pipelines.
61-
* [**MMLSpark**](https://github.com/Azure/mmlspark) - The Microsoft Machine Learning library for Apache Spark (MMLSpark) is designed to make data scientists more productive on Spark, to increase the rate of experimentation, and to leverage cutting-edge machine learning techniques, including deep learning, on very large datasets. The MMLSpark library simplifies common modeling tasks for building models in PySpark.
61+
* [**MMLSpark**](https://github.com/Azure/mmlspark) - The Microsoft Machine Learning library for Apache Spark (MMLSpark) is designed to make data scientists more productive on Spark, to increase the rate of experimentation, and to leverage cutting-edge machine learning techniques, including deep learning, on large datasets. The MMLSpark library simplifies common modeling tasks for building models in PySpark.
6262

6363
### Azure Machine Learning and Apache Hive
6464

@@ -72,28 +72,28 @@ There are three scalable machine learning libraries that bring algorithmic model
7272

7373
Let's review an example of an advanced analytics machine learning pipeline using HDInsight.
7474

75-
In this scenario you'll see how DNNs produced in a deep learning framework, Microsoft's Cognitive Toolkit (CNTK), can be operationalized for scoring large image collections stored in an Azure Blob Storage account using PySpark on an HDInsight Spark cluster. This approach is applied to a common DNN use case, aerial image classification, and can be used to identify recent patterns in urban development. You'll use a pre-trained image classification model. The model is pre-trained on the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) and has been applied to 10,000 withheld images.
75+
In this scenario you see how DNNs produced in a deep learning framework, Microsoft's Cognitive Toolkit (CNTK), can be operationalized for scoring large image collections stored in an Azure Blob Storage account using PySpark on an HDInsight Spark cluster. This approach is applied to a common DNN use case, aerial image classification, and can be used to identify recent patterns in urban development. You use a pretrained image classification model. The model is pretrained on the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) and has been applied to 10,000 withheld images.
7676

7777
There are three key tasks in this advanced analytics scenario:
7878

7979
1. Create an Azure HDInsight Hadoop cluster with an Apache Spark 2.1.0 distribution.
8080
2. Run a custom script to install Microsoft Cognitive Toolkit on all nodes of an Azure HDInsight Spark cluster.
81-
3. Upload a pre-built Jupyter Notebook to your HDInsight Spark cluster to apply a trained Microsoft Cognitive Toolkit deep learning model to files in an Azure Blob Storage Account using the Spark Python API (PySpark).
81+
3. Upload a prebuilt Jupyter Notebook to your HDInsight Spark cluster to apply a trained Microsoft Cognitive Toolkit deep learning model to files in an Azure Blob Storage Account using the Spark Python API (PySpark).
8282

83-
This example uses the CIFAR-10 image set compiled and distributed by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset contains 60,000 32×32 color images belonging to 10 mutually exclusive classes:
83+
This example uses the CIFAR-10 image set compiled and distributed by Alex Krizhevsky, Vinod Kumar, and Geoffrey Hinton. The CIFAR-10 dataset contains 60,000 32×32 color images belonging to 10 mutually exclusive classes:
8484

8585
:::image type="content" source="./media/apache-hadoop-deep-dive-advanced-analytics/machine-learning-images.png" alt-text="Machine Learning example images." border="false":::
8686

8787
For more information on the dataset, see Alex Krizhevsky's [Learning Multiple Layers of Features from Tiny Images](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf).
8888

89-
The dataset was partitioned into a training set of 50,000 images and a test set of 10,000 images. The first set was used to train a twenty-layer-deep convolutional residual network (ResNet) model using Microsoft Cognitive Toolkit by following [this tutorial](https://github.com/Microsoft/CNTK/tree/master/Examples/Image/Classification/ResNet) from the Cognitive Toolkit GitHub repository. The remaining 10,000 images were used for testing the model's accuracy. This is where distributed computing comes into play: the task of pre-processing and scoring the images is highly parallelizable. With the saved trained model in hand, we used:
89+
The dataset was partitioned into a training set of 50,000 images and a test set of 10,000 images. The first set was used to train a twenty-layer-deep convolutional residual network (ResNet) model using Microsoft Cognitive Toolkit by following [this tutorial](https://github.com/Microsoft/CNTK/tree/master/Examples/Image/Classification/ResNet) from the Cognitive Toolkit GitHub repository. The remaining 10,000 images were used for testing the model's accuracy. This is where distributed computing comes into play: the task of preprocessing and scoring the images is highly parallelizable. With the saved trained model in hand, we used:
9090

9191
* PySpark to distribute the images and trained model to the cluster's worker nodes.
92-
* Python to pre-process the images on each node of the HDInsight Spark cluster.
93-
* Cognitive Toolkit to load the model and score the pre-processed images on each node.
92+
* Python to preprocess the images on each node of the HDInsight Spark cluster.
93+
* Cognitive Toolkit to load the model and score the preprocessed images on each node.
9494
* Jupyter Notebooks to run the PySpark script, aggregate the results, and use [Matplotlib](https://matplotlib.org/) to visualize the model performance.
9595

96-
The entire preprocessing/scoring of the 10,000 images takes less than one minute on a cluster with 4 worker nodes. The model accurately predicts the labels of ~9,100 (91%) images. A confusion matrix illustrates the most common classification errors. For example, the matrix shows that mislabeling dogs as cats and vice versa occurs more frequently than for other label pairs.
96+
The entire preprocessing/scoring of the 10,000 images takes less than one minute on a cluster with four worker nodes. The model accurately predicts the labels of ~9,100 (91%) images. A confusion matrix illustrates the most common classification errors. For example, the matrix shows that mislabeling dogs as cats and vice versa occurs more frequently than for other label pairs.
9797

9898
:::image type="content" source="./media/apache-hadoop-deep-dive-advanced-analytics/machine-learning-results.png" alt-text="Machine Learning results chart." border="false":::
9999

articles/hdinsight/hadoop/apache-hadoop-use-sqoop-mac-linux.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to use Apache Sqoop to import and export between Apache H
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive, linux-related-content
7-
ms.date: 08/21/2023
7+
ms.date: 08/13/2023
88
---
99

1010
# Use Apache Sqoop to import and export data between Apache Hadoop on HDInsight and Azure SQL Database
@@ -138,9 +138,9 @@ From SQL to Azure storage.
138138
139139
* Both HDInsight and SQL Server must be on the same Azure Virtual Network.
140140
141-
For an example, see the [Connect HDInsight to your on-premises network](./../connect-on-premises-network.md) document.
141+
For an example, see [How to connect HDInsight to your on-premises network](./../connect-on-premises-network.md) document.
142142
143-
For more information on using HDInsight with an Azure Virtual Network, see the [Extend HDInsight with Azure Virtual Network](../hdinsight-plan-virtual-network-deployment.md) document. For more information on Azure Virtual Network, see the [Virtual Network Overview](../../virtual-network/virtual-networks-overview.md) document.
143+
For more information on using HDInsight with an Azure Virtual Network, see [how to extend HDInsight with Azure Virtual Network](../hdinsight-plan-virtual-network-deployment.md) document. For more information on Azure Virtual Network, see the [Virtual Network Overview](../../virtual-network/virtual-networks-overview.md) document.
144144
145145
* SQL Server must be configured to allow SQL authentication. For more information, see the [Choose an Authentication Mode](/sql/relational-databases/security/choose-an-authentication-mode) document.
146146

articles/hdinsight/hbase/apache-hbase-accelerated-writes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Azure HDInsight Accelerated Writes for Apache HBase
33
description: Gives an overview of the Azure HDInsight Accelerated Writes feature, which uses premium managed disks to improve performance of the Apache HBase Write Ahead Log.
44
ms.service: azure-hdinsight
55
ms.topic: how-to
6-
ms.date: 08/21/2023
6+
ms.date: 08/13/2023
77
---
88

99
# Azure HDInsight Accelerated Writes for Apache HBase

articles/hdinsight/hdinsight-hadoop-create-linux-clusters-arm-templates.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to create clusters for HDInsight by using Resource Manage
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive, devx-track-azurecli, linux-related-content
7-
ms.date: 08/22/2023
7+
ms.date: 08/13/2023
88
---
99

1010
# Create Apache Hadoop clusters in HDInsight by using Resource Manager templates

articles/hdinsight/hdinsight-hadoop-linux-information.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Get implementation tips for using Linux-based HDInsight (Hadoop) cl
44
ms.service: azure-hdinsight
55
ms.custom: hdinsightactive, linux-related-content
66
ms.topic: conceptual
7-
ms.date: 07/23/2023
7+
ms.date: 08/13/2023
88
---
99

1010
# Information about using HDInsight on Linux

articles/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2-portal.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,12 @@ ms.author: sairamyeturi
66
ms.service: azure-hdinsight
77
ms.topic: how-to
88
ms.custom: hdinsightactive, subject-rbac-steps
9-
ms.date: 08/22/2023
9+
ms.date: 08/13/2023
1010
---
1111

1212
# Create a cluster with Data Lake Storage Gen2 using the Azure portal
1313

14-
The Azure portal is a web-based management tool for services and resources hosted in the Microsoft Azure cloud. In this article, you learn how to create Linux-based Azure HDInsight clusters by using the portal. Additional details are available from [Create HDInsight clusters](./hdinsight-hadoop-provision-linux-clusters.md).
14+
The Azure portal is a web-based management tool for services and resources hosted in the Microsoft Azure cloud. In this article, you learn how to create Linux-based Azure HDInsight clusters by using the portal. More details are available from [Create HDInsight clusters](./hdinsight-hadoop-provision-linux-clusters.md).
1515

1616
[!INCLUDE [delete-cluster-warning](includes/hdinsight-delete-cluster-warning.md)]
1717

@@ -42,7 +42,7 @@ Create a storage account to use with Azure Data Lake Storage Gen2.
4242
1. In the upper-left click **Create a resource**.
4343
1. In the search box, type **storage** and click **storage account**.
4444
1. Click **Create**.
45-
1. On the **Create storage account** screen:
45+
1. On the **`Create storage account`** screen:
4646
1. Select the correct subscription and resource group.
4747
1. Enter a name for your storage account with Data Lake Storage Gen2.
4848
1. Click on the **Advanced** tab.

articles/hdinsight/interactive-query/apache-hive-warehouse-connector-operations.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@ author: apurbasroy
55
ms.author: apsinhar
66
ms.service: azure-hdinsight
77
ms.topic: how-to
8-
ms.date: 08/21/2023
8+
ms.date: 08/13/2023
99
---
1010

1111
# Apache Spark operations supported by Hive Warehouse Connector in Azure HDInsight
1212

13-
This article shows spark-based operations supported by Hive Warehouse Connector (HWC). All examples shown below will be executed through the Apache Spark shell.
13+
This article shows spark-based operations supported by Hive Warehouse Connector (HWC). All examples shown will be executed through the Apache Spark shell.
1414

1515
## Prerequisite
1616

@@ -20,7 +20,7 @@ Complete the [Hive Warehouse Connector setup](./apache-hive-warehouse-connector.
2020

2121
To start a spark-shell session, do the following steps:
2222

23-
1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your Apache Spark cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
23+
1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your Apache Spark cluster. Edit the command by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
2424

2525
```cmd
2626
@@ -32,15 +32,15 @@ To start a spark-shell session, do the following steps:
3232
ls /usr/hdp/current/hive_warehouse_connector
3333
```
3434
35-
1. Edit the code below with the `hive-warehouse-connector-assembly` version identified above. Then execute the command to start the spark shell:
35+
1. Edit the code with the `hive-warehouse-connector-assembly` version identified above. Then execute the command to start the spark shell:
3636
3737
```bash
3838
spark-shell --master yarn \
3939
--jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<STACK_VERSION>.jar \
4040
--conf spark.security.credentials.hiveserver2.enabled=false
4141
```
4242
43-
1. After starting the spark-shell, a Hive Warehouse Connector instance can be started using the following commands:
43+
1. After you start the spark-shell, a Hive Warehouse Connector instance can be started using the following commands:
4444
4545
```scala
4646
import com.hortonworks.hwc.HiveWarehouseSession
@@ -91,7 +91,7 @@ Using Hive Warehouse Connector, you can use Spark streaming to write data into H
9191
> [!IMPORTANT]
9292
> Structured streaming writes are not supported in ESP enabled Spark 4.0 clusters.
9393

94-
Follow the steps below to ingest data from a Spark stream on localhost port 9999 into a Hive table via. Hive Warehouse Connector.
94+
Follow the steps to ingest data from a Spark stream on localhost port 9999 into a Hive table via. Hive Warehouse Connector.
9595

9696
1. From your open Spark shell, begin a spark stream with the following command:
9797

articles/hdinsight/kafka/connect-kafka-with-vnet.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Connect HDInsight Kafka cluster with client VM in different VNet on Azure
33
description: Learn how to connect HDInsight Kafka cluster with Client VM in different VNet on Azure HDInsight
44
ms.service: azure-hdinsight
55
ms.topic: tutorial
6-
ms.date: 08/10/2023
6+
ms.date: 08/13/2023
77
---
88

99
# Connect HDInsight Kafka cluster with client VM in different VNet

articles/hdinsight/spark/apache-spark-job-debugging.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Use YARN UI, Spark UI, and Spark History server to track and debug
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive
7-
ms.date: 08/22/2023
7+
ms.date: 08/13/2023
88
---
99

1010
# Debug Apache Spark jobs running on Azure HDInsight

0 commit comments

Comments
 (0)