Skip to content

Commit 7245fe3

Browse files
author
Sreekanth Iyer (Ushta Te Consultancy Services)
committed
Improved Correcteness Score
1 parent 320ad81 commit 7245fe3

7 files changed

+25
-26
lines changed

articles/hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,9 @@ In this section, you create a Hadoop cluster in HDInsight using the Azure portal
4141
|Region | From the drop-down list, select a region where the cluster is created. Choose a location closer to you for better performance. |
4242
|Cluster type| Select **Select cluster type**. Then select **Hadoop** as the cluster type.|
4343
|Version|From the drop-down list, select a **version**. Use the default version if you don't know what to choose.|
44-
|Cluster login username and password | The default login name is **admin**. The password must be at least 10 characters in length and must contain at least one digit, one uppercase, and one lower case letter, one non-alphanumeric character (except characters ```' ` "```). Make sure you **do not provide** common passwords such as "Pass@word1".|
44+
|Cluster sign in username and password | The default sign in name is **admin**. The password must be at least 10 characters in length and must contain at least one digit, one uppercase, and one lower case letter, one nonalphanumeric character (except characters ```' ` "```). Make sure you **do not provide** common passwords such as "Pass@word1".|
4545
|Secure Shell (SSH) username | The default username is `sshuser`. You can provide another name for the SSH username. |
46-
|Use cluster login password for SSH| Select this check box to use the same password for SSH user as the one you provided for the cluster login user.|
46+
|Use cluster sign in password for SSH| Select this check box to use the same password for SSH user as the one you provided for the cluster sign in user.|
4747

4848
:::image type="content" source="./media/apache-hadoop-linux-create-cluster-get-started-portal/azure-portal-cluster-basics.png" alt-text="HDInsight Linux get started provide cluster basic values." border="true":::
4949

@@ -115,7 +115,7 @@ In this section, you create a Hadoop cluster in HDInsight using the Azure portal
115115

116116
:::image type="content" source="./media/apache-hadoop-linux-create-cluster-get-started-portal/hdinsight-linux-hive-view-save-results.png" alt-text="Save result of Apache Hive query." border="true":::
117117

118-
After you've completed a Hive job, you can [export the results to Azure SQL Database or SQL Server database](apache-hadoop-use-sqoop-mac-linux.md), you can also [visualize the results using Excel](apache-hadoop-connect-excel-power-query.md). For more information about using Hive in HDInsight, see [Use Apache Hive and HiveQL with Apache Hadoop in HDInsight to analyze a sample Apache log4j file](hdinsight-use-hive.md).
118+
After you've completed a Hive job, you can [export the results to Azure SQL Database or SQL Server database](apache-hadoop-use-sqoop-mac-linux.md), you can also [visualize the results using Excel](apache-hadoop-connect-excel-power-query.md). For more information about using Hive in HDInsight, see [Use Apache Hive and HiveQL with Apache Hadoop in HDInsight to analyze a sample Apache Log4j file](hdinsight-use-hive.md).
119119

120120
## Clean up resources
121121

@@ -130,7 +130,7 @@ After you complete the quickstart, you may want to delete the cluster. With HDIn
130130

131131
:::image type="content" source="./media/apache-hadoop-linux-create-cluster-get-started-portal/hdinsight-delete-cluster.png" alt-text="Azure HDInsight delete cluster." border="true":::
132132

133-
2. If you want to delete the cluster as well as the default storage account, select the resource group name (highlighted in the previous screenshot) to open the resource group page.
133+
2. If you want to delete the cluster and the default storage account, select the resource group name (highlighted in the previous screenshot) to open the resource group page.
134134

135135
3. Select **Delete resource group** to delete the resource group, which contains the cluster and the default storage account. Note deleting the resource group deletes the storage account. If you want to keep the storage account, choose to delete the cluster only.
136136

articles/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2-azure-cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.author: sairamyeturi
66
ms.service: hdinsight
77
ms.topic: how-to
88
ms.custom: hdinsightactive, devx-track-azurecli
9-
ms.date: 08/21/2023
9+
ms.date: 07/24/2024
1010
---
1111

1212
# Create a cluster with Data Lake Storage Gen2 using Azure CLI

articles/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Use the following links for detailed instructions on how to create HDInsight clu
3232

3333
## Access control for Data Lake Storage Gen2 in HDInsight
3434

35-
### What kinds of permissions does Data Lake Storage Gen2 support?
35+
### What kinds of permissions do Data Lake Storage Gen2 support?
3636

3737
Data Lake Storage Gen2 uses an access control model that supports both Azure role-based access control (Azure RBAC) and POSIX-like access control lists (ACLs).
3838

articles/hdinsight/hdinsight-upload-data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ Because the default file system for HDInsight is in Azure Storage, /example/data
5555

5656
`wasbs:///example/data/data.txt`
5757

58-
or
58+
Or
5959

6060
`wasbs://<ContainerName>@<StorageAccountName>.blob.core.windows.net/example/data/davinci.txt`
6161

articles/hdinsight/interactive-query/apache-hive-migrate-workloads.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ To convert external table (non-ACID) to Managed (ACID) table,
183183

184184
**Scenario 1**
185185

186-
Consider table rt is external table (non-ACID). If the table is non-ORC table,
186+
Consider table `rt` is external table (non-ACID). If the table is non-ORC table,
187187

188188
```
189189
alter table rt set TBLPROPERTIES ('transactional'='true');
@@ -199,7 +199,7 @@ ERROR:
199199
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. work.rt can't be declared transactional because it's an external table (state=08S01,code=1)
200200
```
201201

202-
This error is occurring because the table rt is external table and you can't convert external table to ACID.
202+
This error is occurring because the table `rt` is external table and you can't convert external table to ACID.
203203

204204
**Scenario 3**
205205

@@ -432,13 +432,13 @@ In certain situations when running a Hive query, you might receive `java.lang.Cl
432432
```
433433
The update command is to update the details manually in the backend DB and the alter command is used to alter the table with the new SerDe class from beeline or Hive.
434434
435-
### Hive Backend DB schema compare Script
435+
### Hive Backend DB schema compares Script
436436
437437
You can run the following script after completing the migration.
438438
439439
There's a chance of missing few columns in the backend DB, which causes the query failures. If the schema upgrade wasn't happened properly, then there's chance that we may hit the invalid column name issue. The below script fetches the column name and datatype from customer backend DB and provides the output if there's any missing column or incorrect datatype.
440440
441-
The following path contains the schemacompare_final.py and test.csv file. The script is present in "schemacompare_final.py" file and the file "test.csv" contains all the column name and the datatype for all the tables, which should be present in the hive backend DB.
441+
The following path contains the schemacompare_final.py and test.csv file. The script is present in `schemacompare_final.py` file and the file "test.csv" contains all the column name and the datatype for all the tables, which should be present in the hive backend DB.
442442
443443
https://hdiconfigactions2.blob.core.windows.net/hiveschemacompare/schemacompare_final.py
444444
@@ -448,11 +448,11 @@ Download these two files from the link. And copy these files to one of the head
448448
449449
**Steps to execute the script:**
450450
451-
Create a directory called "schemacompare" under "/tmp" directory.
451+
Create a directory called `schemacompare` under "/tmp" directory.
452452
453453
Put the "schemacompare_final.py" and "test.csv" into the folder "/tmp/schemacompare". Do "ls -ltrh /tmp/schemacompare/" and verify whether the files are present.
454454
455-
To execute the Python script, use the command "python schemacompare_final.py". This script starts executing the script and it takes less than five minutes to complete. The above script automatically connects to your backend DB and fetches the details from each and every table, which Hive uses and update the details in the new csv file called "return.csv". After creating the file return.csv, it compares the data with the file "test.csv" and prints the column name or datatype if there's anything missing under the tablename.
455+
To execute the Python script, use the command "python schemacompare_final.py". This script starts executing the script and it takes less than five minutes to complete. The above script automatically connects to your backend DB and fetches the details from each and every table, which Hive uses and update the details in the new csv file called "return.csv". After you create the file return.csv, it compares the data with the file "test.csv" and prints the column name or datatype if there's anything missing under the tablename.
456456

457457
Once after executing the script you can see the following lines, which indicate that the details are fetched for the tables and the script is in progressing
458458

@@ -550,7 +550,7 @@ Tune Metastore to reduce their CPU usage.
550550
1. New value: `false`
551551
552552
1. Optimize the partition repair feature
553-
1. Disable partition repair - This feature is used to synchronize the partitions of Hive tables in storage location with Hive metastore. You may disable this feature if msck repair is used after the data ingestion.
553+
1. Disable partition repair - This feature is used to synchronize the partitions of Hive tables in storage location with Hive metastore. You may disable this feature if `msck repair` is used after the data ingestion.
554554
1. To disable the feature **add "discover.partitions=false"** under table properties using ALTER TABLE.
555555
OR (if the feature can't be disabled)
556556
1. Increase the partition repair frequency.

articles/hdinsight/interactive-query/hive-default-metastore-export-import.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ This article shows how to migrate metadata from a [default metastore DB](../hdin
1414

1515
## Why migrate to external metastore DB
1616

17-
* Default metastore DB is limited to basic SKU and cannot handle production scale workloads.
17+
* Default metastore DB is limited to basic SKU and can't handle production scale workloads.
1818

1919
* External metastore DB enables customer to horizontally scale Hive compute resources by adding new HDInsight clusters sharing the same metastore DB.
2020

21-
* For HDInsight 3.6 to 4.0 migration, it is mandatory to migrate metadata to external metastore DB before upgrading the Hive schema version. See [migrating workloads from HDInsight 3.6 to HDInsight 4.0](./apache-hive-migrate-workloads.md).
21+
* For HDInsight 3.6 to 4.0 migration, it's mandatory to migrate metadata to external metastore DB before upgrading the Hive schema version. See [migrating workloads from HDInsight 3.6 to HDInsight 4.0](./apache-hive-migrate-workloads.md).
2222

23-
Because the default metastore DB has limited compute capacity, we recommend low utilization from other jobs on the cluster while migrating metadata.
23+
Because the default metastore DB with limited compute capacity, we recommend low utilization from other jobs on the cluster while migrating metadata.
2424

2525
Source and target DBs must use the same HDInsight version and the same Storage Accounts. If upgrading HDInsight versions from 3.6 to 4.0, complete the steps in this article first. Then, follow the official upgrade steps [here](./apache-hive-migrate-workloads.md).
2626

@@ -33,7 +33,7 @@ The action is similar to replacing symlinks with their full paths.
3333
|Property | Value |
3434
|---|---|
3535
|Bash script URI|`https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/hive-adl-expand-location-v01.sh`|
36-
|Node type(s)|Head|
36+
|Node types|Head|
3737
|Parameters|""|
3838

3939
## Migrate with Export/Import using sqlpackage
@@ -51,8 +51,7 @@ An HDInsight cluster created only after 2020-10-15 supports SQL Export/Import fo
5151
sudo python hive_metastore_tool.py --sqlpackagefile $SQLPACKAGE_FILE --targetfile $TARGET_FILE
5252
```
5353

54-
3. Save the BACPAC file. Below is an option.
55-
54+
3. Save the BACPAC file.
5655
```bash
5756
hdfs dfs -mkdir -p /bacpacs
5857
hdfs dfs -put $TARGET_FILE /bacpacs/
@@ -64,13 +63,13 @@ An HDInsight cluster created only after 2020-10-15 supports SQL Export/Import fo
6463

6564
## Migrate using Hive script
6665

67-
Clusters created before 2020-10-15 do not support export/import of the default metastore DB.
66+
Clusters created before 2020-10-15 don't support export/import of the default metastore DB.
6867
6968
For such clusters, follow the guide [Copy Hive tables across Storage Accounts](./hive-migration-across-storage-accounts.md), using a second cluster with an [external Hive metastore DB](../hdinsight-use-external-metadata-stores.md#select-a-custom-metastore-during-cluster-creation). The second cluster can use the same storage account but must use a new default filesystem.
7069
7170
### Option to "shallow" copy
72-
Storage consumption would double when tables are "deep" copied using the above guide. You need to manually clean the data in the source storage container.
73-
We can, instead, "shallow" copy the tables if they are non-transactional. All Hive tables in HDInsight 3.6 are non-transactional by default, but only external tables are non-transactional in HDInsight 4.0. Transactional tables must be deep copied. Follow these steps to shallow copy non-transactional tables:
71+
Storage consumption would double when tables are "deep" copied using the guide. You need to manually clean the data in the source storage container.
72+
We can, instead, "shallow" copy the tables if they're nontransactional. All Hive tables in HDInsight 3.6 are nontransactional by default, but only external tables are nontransactional in HDInsight 4.0. Transactional tables must be deep copied. Follow these steps to shallow copy nontransactional tables:
7473

7574
1. Execute script [hive-ddls.sh](https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/hive-ddls.sh) on the source cluster's primary headnode to generate the DDL for every Hive table.
7675
2. The DDL is written to a local Hive script named `/tmp/hdi_hive_ddls.hql`. Execute this on the target cluster that uses an external Hive metastore DB.

articles/hdinsight/spark/apache-spark-overview.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ Spark clusters in HDInsight offer a fully managed Spark service. Benefits of cre
4343
| Ease creation |You can create a new Spark cluster in HDInsight in minutes using the Azure portal, Azure PowerShell, or the HDInsight .NET SDK. See [Get started with Apache Spark cluster in HDInsight](apache-spark-jupyter-spark-sql-use-portal.md). |
4444
| Ease of use |Spark cluster in HDInsight include Jupyter Notebooks and Apache Zeppelin Notebooks. You can use these notebooks for interactive data processing and visualization. See [Use Apache Zeppelin notebooks with Apache Spark](apache-spark-zeppelin-notebook.md) and [Load data and run queries on an Apache Spark cluster](apache-spark-load-data-run-query.md).|
4545
| REST APIs |Spark clusters in HDInsight include [Apache Livy](https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server), a REST API-based Spark job server to remotely submit and monitor jobs. See [Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster](apache-spark-livy-rest-interface.md).|
46-
| Support for Azure Storage | Spark clusters in HDInsight can use Azure Data Lake Storage Gen2 as both the primary storage or additional storage. . For more information on Data Lake Storage Gen2, see [Azure Data Lake Storage Gen2](../../storage/blobs/data-lake-storage-introduction.md).|
46+
| Support for Azure Storage | Spark clusters in HDInsight can use Azure Data Lake Storage Gen2 as both the primary storage or additional storage. For more information on Data Lake Storage Gen2, see [Azure Data Lake Storage Gen2](../../storage/blobs/data-lake-storage-introduction.md).|
4747
| Integration with Azure services |Spark cluster in HDInsight comes with a connector to Azure Event Hubs. You can build streaming applications using the Event Hubs. Including Apache Kafka, which is already available as part of Spark. |
4848
| Integration with third-party IDEs | HDInsight provides several IDE plugins that are useful to create and submit applications to an HDInsight Spark cluster. For more information, see [Use Azure Toolkit for IntelliJ IDEA](apache-spark-intellij-tool-plugin.md), [Use Spark & Hive Tools for VSCode](../hdinsight-for-vscode.md), and [Use Azure Toolkit for Eclipse](apache-spark-eclipse-tool-plugin.md).|
4949
| Concurrent Queries |Spark clusters in HDInsight support concurrent queries. This capability enables multiple queries from one user or multiple queries from various users and applications to share the same cluster resources. |
@@ -75,15 +75,15 @@ The SparkContext can connect to several types of cluster managers, which give re
7575

7676
The SparkContext runs the user's main function and executes the various parallel operations on the worker nodes. Then, the SparkContext collects the results of the operations. The worker nodes read and write data from and to the Hadoop distributed file system. The worker nodes also cache transformed data in-memory as Resilient Distributed Datasets (RDDs).
7777

78-
The SparkContext connects to the Spark master and is responsible for converting an application to a directed graph (DAG) of individual tasks. Tasks that get executed within an executor process on the worker nodes. Each application gets its own executor processes. Which stay up during the whole application and run tasks in multiple threads.
78+
The SparkContext connects to the Spark master and is responsible for converting an application to a directed graph (DAG) of individual tasks. Tasks that get executed within an executor process on the worker nodes. Each application gets its own executor processes, which stay up during the whole application and run tasks in multiple threads.
7979

8080
## Spark in HDInsight use cases
8181

8282
Spark clusters in HDInsight enable the following key scenarios:
8383

8484
### Interactive data analysis and BI
8585

86-
Apache Spark in HDInsight stores data in Azure Blob Storage, Azure Data Lake Azure Data Lake Storage Gen2. Business experts and key decision makers can analyze and build reports over that data. And use Microsoft Power BI to build interactive reports from the analyzed data. Analysts can start from unstructured/semi structured data in cluster storage, define a schema for the data using notebooks, and then build data models using Microsoft Power BI. Spark clusters in HDInsight also support many third-party BI tools. Such as Tableau, making it easier for data analysts, business experts, and key decision makers.
86+
Apache Spark in HDInsight stores data in Azure Blob Storage and Azure Data Lake Storage Gen2. Business experts and key decision makers can analyze and build reports over that data. And use Microsoft Power BI to build interactive reports from the analyzed data. Analysts can start from unstructured/semi structured data in cluster storage, define a schema for the data using notebooks, and then build data models using Microsoft Power BI. Spark clusters in HDInsight also support many third-party BI tools. Such as Tableau, making it easier for data analysts, business experts, and key decision makers.
8787

8888
* [Tutorial: Visualize Spark data using Power BI](apache-spark-use-bi-tools.md)
8989

0 commit comments

Comments
 (0)