You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,9 +41,9 @@ In this section, you create a Hadoop cluster in HDInsight using the Azure portal
41
41
|Region | From the drop-down list, select a region where the cluster is created. Choose a location closer to you for better performance. |
42
42
|Cluster type| Select **Select cluster type**. Then select **Hadoop** as the cluster type.|
43
43
|Version|From the drop-down list, select a **version**. Use the default version if you don't know what to choose.|
44
-
|Cluster login username and password | The default login name is **admin**. The password must be at least 10 characters in length and must contain at least one digit, one uppercase, and one lower case letter, one non-alphanumeric character (except characters ```' ` "```). Make sure you **do not provide** common passwords such as "Pass@word1".|
44
+
|Cluster sign in username and password | The default sign in name is **admin**. The password must be at least 10 characters in length and must contain at least one digit, one uppercase, and one lower case letter, one nonalphanumeric character (except characters ```' ` "```). Make sure you **do not provide** common passwords such as "Pass@word1".|
45
45
|Secure Shell (SSH) username | The default username is `sshuser`. You can provide another name for the SSH username. |
46
-
|Use cluster login password for SSH| Select this check box to use the same password for SSH user as the one you provided for the cluster login user.|
46
+
|Use cluster sign in password for SSH| Select this check box to use the same password for SSH user as the one you provided for the cluster sign in user.|
47
47
48
48
:::image type="content" source="./media/apache-hadoop-linux-create-cluster-get-started-portal/azure-portal-cluster-basics.png" alt-text="HDInsight Linux get started provide cluster basic values." border="true":::
49
49
@@ -115,7 +115,7 @@ In this section, you create a Hadoop cluster in HDInsight using the Azure portal
115
115
116
116
:::image type="content" source="./media/apache-hadoop-linux-create-cluster-get-started-portal/hdinsight-linux-hive-view-save-results.png" alt-text="Save result of Apache Hive query." border="true":::
117
117
118
-
After you've completed a Hive job, you can [export the results to Azure SQL Database or SQL Server database](apache-hadoop-use-sqoop-mac-linux.md), you can also [visualize the results using Excel](apache-hadoop-connect-excel-power-query.md). For more information about using Hive in HDInsight, see [Use Apache Hive and HiveQL with Apache Hadoop in HDInsight to analyze a sample Apache log4j file](hdinsight-use-hive.md).
118
+
After you've completed a Hive job, you can [export the results to Azure SQL Database or SQL Server database](apache-hadoop-use-sqoop-mac-linux.md), you can also [visualize the results using Excel](apache-hadoop-connect-excel-power-query.md). For more information about using Hive in HDInsight, see [Use Apache Hive and HiveQL with Apache Hadoop in HDInsight to analyze a sample Apache Log4j file](hdinsight-use-hive.md).
119
119
120
120
## Clean up resources
121
121
@@ -130,7 +130,7 @@ After you complete the quickstart, you may want to delete the cluster. With HDIn
2. If you want to delete the cluster as well as the default storage account, select the resource group name (highlighted in the previous screenshot) to open the resource group page.
133
+
2. If you want to delete the cluster and the default storage account, select the resource group name (highlighted in the previous screenshot) to open the resource group page.
134
134
135
135
3. Select **Delete resource group** to delete the resource group, which contains the cluster and the default storage account. Note deleting the resource group deletes the storage account. If you want to keep the storage account, choose to delete the cluster only.
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ Use the following links for detailed instructions on how to create HDInsight clu
32
32
33
33
## Access control for Data Lake Storage Gen2 in HDInsight
34
34
35
-
### What kinds of permissions does Data Lake Storage Gen2 support?
35
+
### What kinds of permissions do Data Lake Storage Gen2 support?
36
36
37
37
Data Lake Storage Gen2 uses an access control model that supports both Azure role-based access control (Azure RBAC) and POSIX-like access control lists (ACLs).
Copy file name to clipboardExpand all lines: articles/hdinsight/interactive-query/apache-hive-migrate-workloads.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -183,7 +183,7 @@ To convert external table (non-ACID) to Managed (ACID) table,
183
183
184
184
**Scenario 1**
185
185
186
-
Consider table rt is external table (non-ACID). If the table is non-ORC table,
186
+
Consider table `rt` is external table (non-ACID). If the table is non-ORC table,
187
187
188
188
```
189
189
alter table rt set TBLPROPERTIES ('transactional'='true');
@@ -199,7 +199,7 @@ ERROR:
199
199
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. work.rt can't be declared transactional because it's an external table (state=08S01,code=1)
200
200
```
201
201
202
-
This error is occurring because the table rt is external table and you can't convert external table to ACID.
202
+
This error is occurring because the table `rt` is external table and you can't convert external table to ACID.
203
203
204
204
**Scenario 3**
205
205
@@ -432,13 +432,13 @@ In certain situations when running a Hive query, you might receive `java.lang.Cl
432
432
```
433
433
The update command is to update the details manually in the backend DB and the alter command is used to alter the table with the new SerDe class from beeline or Hive.
434
434
435
-
### Hive Backend DB schema compare Script
435
+
### Hive Backend DB schema compares Script
436
436
437
437
You can run the following script after completing the migration.
438
438
439
439
There's a chance of missing few columns in the backend DB, which causes the query failures. If the schema upgrade wasn't happened properly, then there's chance that we may hit the invalid column name issue. The below script fetches the column name and datatype from customer backend DB and provides the output if there's any missing column or incorrect datatype.
440
440
441
-
The following path contains the schemacompare_final.py and test.csv file. The script is present in "schemacompare_final.py" file and the file "test.csv" contains all the column name and the datatype for all the tables, which should be present in the hive backend DB.
441
+
The following path contains the schemacompare_final.py and test.csv file. The script is present in `schemacompare_final.py` file and the file "test.csv" contains all the column name and the datatype for all the tables, which should be present in the hive backend DB.
@@ -448,11 +448,11 @@ Download these two files from the link. And copy these files to one of the head
448
448
449
449
**Steps to execute the script:**
450
450
451
-
Create a directory called "schemacompare" under "/tmp" directory.
451
+
Create a directory called `schemacompare` under "/tmp" directory.
452
452
453
453
Put the "schemacompare_final.py" and "test.csv" into the folder "/tmp/schemacompare". Do "ls -ltrh /tmp/schemacompare/" and verify whether the files are present.
454
454
455
-
To execute the Python script, use the command "python schemacompare_final.py". This script starts executing the script and it takes less than five minutes to complete. The above script automatically connects to your backend DB and fetches the details from each and every table, which Hive uses and update the details in the new csv file called "return.csv". After creating the file return.csv, it compares the data with the file "test.csv" and prints the column name or datatype if there's anything missing under the tablename.
455
+
To execute the Python script, use the command "python schemacompare_final.py". This script starts executing the script and it takes less than five minutes to complete. The above script automatically connects to your backend DB and fetches the details from each and every table, which Hive uses and update the details in the new csv file called "return.csv". After you create the file return.csv, it compares the data with the file "test.csv" and prints the column name or datatype if there's anything missing under the tablename.
456
456
457
457
Once after executing the script you can see the following lines, which indicate that the details are fetched for the tables and the script is in progressing
458
458
@@ -550,7 +550,7 @@ Tune Metastore to reduce their CPU usage.
550
550
1. New value: `false`
551
551
552
552
1. Optimize the partition repair feature
553
-
1. Disable partition repair - This feature is used to synchronize the partitions of Hive tables in storage location with Hive metastore. You may disable this feature if “msck repair” is used after the data ingestion.
553
+
1. Disable partition repair - This feature is used to synchronize the partitions of Hive tables in storage location with Hive metastore. You may disable this feature if `msck repair` is used after the data ingestion.
554
554
1. To disable the feature **add "discover.partitions=false"** under table properties using ALTER TABLE.
Copy file name to clipboardExpand all lines: articles/hdinsight/interactive-query/hive-default-metastore-export-import.md
+8-9Lines changed: 8 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,13 +14,13 @@ This article shows how to migrate metadata from a [default metastore DB](../hdin
14
14
15
15
## Why migrate to external metastore DB
16
16
17
-
* Default metastore DB is limited to basic SKU and cannot handle production scale workloads.
17
+
* Default metastore DB is limited to basic SKU and can't handle production scale workloads.
18
18
19
19
* External metastore DB enables customer to horizontally scale Hive compute resources by adding new HDInsight clusters sharing the same metastore DB.
20
20
21
-
* For HDInsight 3.6 to 4.0 migration, it is mandatory to migrate metadata to external metastore DB before upgrading the Hive schema version. See [migrating workloads from HDInsight 3.6 to HDInsight 4.0](./apache-hive-migrate-workloads.md).
21
+
* For HDInsight 3.6 to 4.0 migration, it's mandatory to migrate metadata to external metastore DB before upgrading the Hive schema version. See [migrating workloads from HDInsight 3.6 to HDInsight 4.0](./apache-hive-migrate-workloads.md).
22
22
23
-
Because the default metastore DB has limited compute capacity, we recommend low utilization from other jobs on the cluster while migrating metadata.
23
+
Because the default metastore DB with limited compute capacity, we recommend low utilization from other jobs on the cluster while migrating metadata.
24
24
25
25
Source and target DBs must use the same HDInsight version and the same Storage Accounts. If upgrading HDInsight versions from 3.6 to 4.0, complete the steps in this article first. Then, follow the official upgrade steps [here](./apache-hive-migrate-workloads.md).
26
26
@@ -33,7 +33,7 @@ The action is similar to replacing symlinks with their full paths.
@@ -64,13 +63,13 @@ An HDInsight cluster created only after 2020-10-15 supports SQL Export/Import fo
64
63
65
64
## Migrate using Hive script
66
65
67
-
Clusters created before 2020-10-15 do not support export/import of the default metastore DB.
66
+
Clusters created before 2020-10-15 don't support export/import of the default metastore DB.
68
67
69
68
For such clusters, follow the guide [Copy Hive tables across Storage Accounts](./hive-migration-across-storage-accounts.md), using a second cluster with an [external Hive metastore DB](../hdinsight-use-external-metadata-stores.md#select-a-custom-metastore-during-cluster-creation). The second cluster can use the same storage account but must use a new default filesystem.
70
69
71
70
### Option to "shallow" copy
72
-
Storage consumption would double when tables are "deep" copied using the above guide. You need to manually clean the data in the source storage container.
73
-
We can, instead, "shallow" copy the tables if they are non-transactional. All Hive tables in HDInsight 3.6 are non-transactional by default, but only external tables are non-transactionalin HDInsight 4.0. Transactional tables must be deep copied. Follow these steps to shallow copy non-transactional tables:
71
+
Storage consumption would double when tables are "deep" copied using the guide. You need to manually clean the data in the source storage container.
72
+
We can, instead, "shallow" copy the tables if they're nontransactional. All Hive tables in HDInsight 3.6 are nontransactional by default, but only external tables are nontransactionalin HDInsight 4.0. Transactional tables must be deep copied. Follow these steps to shallow copy nontransactional tables:
74
73
75
74
1. Execute script [hive-ddls.sh](https://hdiconfigactions.blob.core.windows.net/linuxhivemigrationv01/hive-ddls.sh) on the source cluster's primary headnode to generate the DDL for every Hive table.
76
75
2. The DDL is written to a local Hive script named `/tmp/hdi_hive_ddls.hql`. Execute this on the target cluster that uses an external Hive metastore DB.
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-overview.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ Spark clusters in HDInsight offer a fully managed Spark service. Benefits of cre
43
43
| Ease creation |You can create a new Spark cluster in HDInsight in minutes using the Azure portal, Azure PowerShell, or the HDInsight .NET SDK. See [Get started with Apache Spark cluster in HDInsight](apache-spark-jupyter-spark-sql-use-portal.md). |
44
44
| Ease of use |Spark cluster in HDInsight include Jupyter Notebooks and Apache Zeppelin Notebooks. You can use these notebooks for interactive data processing and visualization. See [Use Apache Zeppelin notebooks with Apache Spark](apache-spark-zeppelin-notebook.md) and [Load data and run queries on an Apache Spark cluster](apache-spark-load-data-run-query.md).|
45
45
| REST APIs |Spark clusters in HDInsight include [Apache Livy](https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server), a REST API-based Spark job server to remotely submit and monitor jobs. See [Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster](apache-spark-livy-rest-interface.md).|
46
-
| Support for Azure Storage | Spark clusters in HDInsight can use Azure Data Lake Storage Gen2 as both the primary storage or additional storage. . For more information on Data Lake Storage Gen2, see [Azure Data Lake Storage Gen2](../../storage/blobs/data-lake-storage-introduction.md).|
46
+
| Support for Azure Storage | Spark clusters in HDInsight can use Azure Data Lake Storage Gen2 as both the primary storage or additional storage. For more information on Data Lake Storage Gen2, see [Azure Data Lake Storage Gen2](../../storage/blobs/data-lake-storage-introduction.md).|
47
47
| Integration with Azure services |Spark cluster in HDInsight comes with a connector to Azure Event Hubs. You can build streaming applications using the Event Hubs. Including Apache Kafka, which is already available as part of Spark. |
48
48
| Integration with third-party IDEs | HDInsight provides several IDE plugins that are useful to create and submit applications to an HDInsight Spark cluster. For more information, see [Use Azure Toolkit for IntelliJ IDEA](apache-spark-intellij-tool-plugin.md), [Use Spark & Hive Tools for VSCode](../hdinsight-for-vscode.md), and [Use Azure Toolkit for Eclipse](apache-spark-eclipse-tool-plugin.md).|
49
49
| Concurrent Queries |Spark clusters in HDInsight support concurrent queries. This capability enables multiple queries from one user or multiple queries from various users and applications to share the same cluster resources. |
@@ -75,15 +75,15 @@ The SparkContext can connect to several types of cluster managers, which give re
75
75
76
76
The SparkContext runs the user's main function and executes the various parallel operations on the worker nodes. Then, the SparkContext collects the results of the operations. The worker nodes read and write data from and to the Hadoop distributed file system. The worker nodes also cache transformed data in-memory as Resilient Distributed Datasets (RDDs).
77
77
78
-
The SparkContext connects to the Spark master and is responsible for converting an application to a directed graph (DAG) of individual tasks. Tasks that get executed within an executor process on the worker nodes. Each application gets its own executor processes. Which stay up during the whole application and run tasks in multiple threads.
78
+
The SparkContext connects to the Spark master and is responsible for converting an application to a directed graph (DAG) of individual tasks. Tasks that get executed within an executor process on the worker nodes. Each application gets its own executor processes, which stay up during the whole application and run tasks in multiple threads.
79
79
80
80
## Spark in HDInsight use cases
81
81
82
82
Spark clusters in HDInsight enable the following key scenarios:
83
83
84
84
### Interactive data analysis and BI
85
85
86
-
Apache Spark in HDInsight stores data in Azure Blob Storage, Azure Data Lake Azure Data Lake Storage Gen2. Business experts and key decision makers can analyze and build reports over that data. And use Microsoft Power BI to build interactive reports from the analyzed data. Analysts can start from unstructured/semi structured data in cluster storage, define a schema for the data using notebooks, and then build data models using Microsoft Power BI. Spark clusters in HDInsight also support many third-party BI tools. Such as Tableau, making it easier for data analysts, business experts, and key decision makers.
86
+
Apache Spark in HDInsight stores data in Azure Blob Storage and Azure Data Lake Storage Gen2. Business experts and key decision makers can analyze and build reports over that data. And use Microsoft Power BI to build interactive reports from the analyzed data. Analysts can start from unstructured/semi structured data in cluster storage, define a schema for the data using notebooks, and then build data models using Microsoft Power BI. Spark clusters in HDInsight also support many third-party BI tools. Such as Tableau, making it easier for data analysts, business experts, and key decision makers.
87
87
88
88
*[Tutorial: Visualize Spark data using Power BI](apache-spark-use-bi-tools.md)
0 commit comments