You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-hadoop-use-data-lake-store.md
+18-21Lines changed: 18 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,15 +7,15 @@ ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: conceptual
9
9
ms.custom: hdinsightactive,hdiseo17may2017
10
-
ms.date: 03/01/2020
10
+
ms.date: 04/24/2020
11
11
---
12
12
13
13
# Use Data Lake Storage Gen1 with Azure HDInsight clusters
14
14
15
15
> [!Note]
16
16
> Deploy new HDInsight clusters using [Azure Data Lake Storage Gen2](hdinsight-hadoop-use-data-lake-storage-gen2.md) for improved performance and new features.
17
17
18
-
To analyze data in HDInsight cluster, you can store the data either in [Azure Storage](../storage/common/storage-introduction.md), [Azure Data Lake Storage Gen 1](../data-lake-store/data-lake-store-overview.md), or [Azure Data Lake Storage Gen 2](../storage/blobs/data-lake-storage-introduction.md). All storage options enable you to safely delete HDInsight clusters that are used for computation without losing user data.
18
+
To analyze data in HDInsight cluster, you can store the data either in [`Azure Storage`](../storage/common/storage-introduction.md), [Azure Data Lake Storage Gen 1](../data-lake-store/data-lake-store-overview.md), or [Azure Data Lake Storage Gen 2](../storage/blobs/data-lake-storage-introduction.md). All storage options enable you to safely delete HDInsight clusters that are used for computation without losing user data.
19
19
20
20
In this article, you learn how Data Lake Storage Gen1 works with HDInsight clusters. To learn how Azure Storage works with HDInsight clusters, see [Use Azure Storage with Azure HDInsight clusters](hdinsight-hadoop-use-blob-storage.md). For more information about creating an HDInsight cluster, see [Create Apache Hadoop clusters in HDInsight](hdinsight-hadoop-provision-linux-clusters.md).
21
21
@@ -26,20 +26,20 @@ In this article, you learn how Data Lake Storage Gen1 works with HDInsight clust
26
26
27
27
## Availability for HDInsight clusters
28
28
29
-
Apache Hadoop supports a notion of the default file system. The default file system implies a default scheme and authority. It can also be used to resolve relative paths. During the HDInsight cluster creation process, you can specify a blob container in Azure Storage as the default file system, or with HDInsight 3.5 and newer versions, you can select either Azure Storage or Azure Data Lake Storage Gen1 as the default files system with a few exceptions. Note that the cluster and the storage account must be hosted in the same region.
29
+
Apache Hadoop supports a notion of the default file system. The default file system implies a default scheme and authority. It can also be used to resolve relative paths. During the HDInsight cluster creation process, specify a blob container in Azure Storage as the default file system. Or with HDInsight 3.5 and newer versions, you can select either Azure Storage or Azure Data Lake Storage Gen1 as the default files system with a few exceptions. The cluster and the storage account must be hosted in the same region.
30
30
31
31
HDInsight clusters can use Data Lake Storage Gen1 in two ways:
32
32
33
33
* As the default storage
34
34
* As additional storage, with Azure Storage Blob as default storage.
35
35
36
-
As of now, only some of the HDInsight cluster types/versions support using Data Lake Storage Gen1 as default storage and additional storage accounts:
36
+
Currently, only some of the HDInsight cluster types/versions support using Data Lake Storage Gen1 as default storage and additional storage accounts:
37
37
38
38
| HDInsight cluster type | Data Lake Storage Gen1 as default storage | Data Lake Storage Gen1 as additional storage| Notes |
| HDInsight version 4.0 | No | No |ADLS Gen1 isn't supported with HDInsight 4.0 |
41
-
| HDInsight version 3.6 | Yes | Yes |With the exception of HBase|
42
-
| HDInsight version 3.5 | Yes | Yes |With the exception of HBase|
41
+
| HDInsight version 3.6 | Yes | Yes |Except HBase|
42
+
| HDInsight version 3.5 | Yes | Yes |Except HBase|
43
43
| HDInsight version 3.4 | No | Yes ||
44
44
| HDInsight version 3.3 | No | No ||
45
45
| HDInsight version 3.2 | No | Yes ||
@@ -48,7 +48,7 @@ As of now, only some of the HDInsight cluster types/versions support using Data
48
48
> [!WARNING]
49
49
> HDInsight HBase is not supported with Azure Data Lake Storage Gen1
50
50
51
-
Using Data Lake Storage Gen1 as an additional storage account doesn't affect performance or the ability to read or write to Azure storage from the cluster.
51
+
Using Data Lake Storage Gen1 as an additional storage account doesn't affect performance. Or the ability to read or write to Azure storage from the cluster.
52
52
53
53
## Use Data Lake Storage Gen1 as default storage
54
54
@@ -57,9 +57,9 @@ When HDInsight is deployed with Data Lake Storage Gen1 as default storage, the c
57
57
* Cluster1 can use the path `adl://mydatalakestore/cluster1storage`
58
58
* Cluster2 can use the path `adl://mydatalakestore/cluster2storage`
59
59
60
-
Notice that both the clusters use the same Data Lake Storage Gen1 account **mydatalakestore**. Each cluster has access to its own root filesystem in Data Lake Storage. The Azure portal deployment experience in particular prompts you to use a folder name such as **/clusters/\<clustername>** for the root path.
60
+
Notice that both the clusters use the same Data Lake Storage Gen1 account **mydatalakestore**. Each cluster has access to its own root filesystem in Data Lake Storage. The Azure portal deployment experience prompts you to use a folder name such as **/clusters/\<clustername>** for the root path.
61
61
62
-
To be able to use Data Lake Storage Gen1 as default storage, you must grant the service principal access to the following paths:
62
+
To use Data Lake Storage Gen1 as default storage, you must grant the service principal access to the following paths:
63
63
64
64
* The Data Lake Storage Gen1 account root. For example: adl://mydatalakestore/.
65
65
* The folder for all cluster folders. For example: adl://mydatalakestore/clusters.
@@ -69,7 +69,7 @@ For more information for creating service principal and grant access, see Config
69
69
70
70
### Extracting a certificate from Azure Keyvault for use in cluster creation
71
71
72
-
If you want to set up Azure Data Lake Storage Gen1 as your default storage for a new cluster and the certificate for your service principal is stored in Azure Key Vault, there are a few additional steps required to convert the certificate to the correct format. The following code snippets show how to perform the conversion.
72
+
If the certificate for your service principal is stored in Azure Key Vault, you must convert the certificate to the correct format. The following code snippets show how to do the conversion.
73
73
74
74
First, download the certificate from Key Vault and extract the `SecretValueText`.
## Use Data Lake Storage Gen1 as additional storage
105
105
106
-
You can use Data Lake Storage Gen1 as additional storage for the cluster as well. In such cases, the cluster default storage can either be an Azure Storage Blob or a Data Lake Storage account. If you're running HDInsight jobs against the data stored in Data Lake Storage as additional storage, you must use the fully qualified path to the files. For example:
106
+
You can use Data Lake Storage Gen1 as additional storage for the cluster as well. In such cases, the cluster default storage can either be an Azure Storage Blob or a Data Lake Storage account. When running HDInsight jobs against the data stored in Data Lake Storage as additional storage, use the fully qualified path. For example:
Note that there's no **cluster_root_path** in the URL now. That's because Data Lake Storage isn't a default storage in this case so all you need to do is provide the path to the files.
110
+
There's no **cluster_root_path** in the URL now. That's because Data Lake Storage isn't a default storage in this case. So all you need to do is provide the path to the files.
111
111
112
-
To be able to use a Data Lake Storage Gen1 as additional storage, you only need to grant the service principal access to the paths where your files are stored. For example:
112
+
To use a Data Lake Storage Gen1 as additional storage, grant the service principal access to the paths where your files are stored. For example:
For more information for creating service principal and grant access, see Configure Data Lake Storage access.
117
117
118
118
## Use more than one Data Lake Storage accounts
119
119
120
-
Adding a Data Lake Storage account as additional and adding more than one Data Lake Storage accounts are accomplished by giving the HDInsight cluster permission on data in one ore more Data Lake Storage accounts. See Configure Data Lake Storage access.
120
+
Adding a Data Lake Storage account as additional and adding more than one Data Lake Storage accounts can be done. Give the HDInsight cluster permission on data in one or more Data Lake Storage accounts. See Configure Data Lake Storage access.
121
121
122
122
## Configure Data Lake Storage access
123
123
@@ -126,7 +126,7 @@ To configure Data Lake Storage access from your HDInsight cluster, you must have
126
126
> [!NOTE]
127
127
> If you are going to use Azure Data Lake Storage Gen1 as additional storage for HDInsight cluster, we strongly recommend that you do this while you create the cluster as described in this article. Adding Azure Data Lake Storage Gen1 as additional storage to an existing HDInsight cluster is not a supported scenario.
128
128
129
-
For more information on the basics of the access control model for Data Lake Storage Gen1, see [Access control in Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-access-control.md).
129
+
For more information on the access control model, see [Access control in Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-access-control.md).
130
130
131
131
## Access files from the cluster
132
132
@@ -156,7 +156,7 @@ Examples are based on an [ssh connection](./hdinsight-hadoop-linux-use-ssh-unix.
156
156
157
157
#### A few hdfs commands
158
158
159
-
1. Create a simple file on local storage.
159
+
1. Create a file on local storage.
160
160
161
161
```bash
162
162
touch testFile.txt
@@ -222,7 +222,7 @@ Use the following links for detailed instructions on how to create HDInsight clu
222
222
223
223
## Refresh the HDInsight certificate for Data Lake Storage Gen1 access
224
224
225
-
The following example PowerShell code reads a certificate from a local file or Azure Key Vault, and updates your HDInsight cluster with the new certificate to access Azure Data Lake Storage Gen1. Provide your own HDInsight cluster name, resource group name, subscription ID, app ID, local path to the certificate. Type in the password when prompted.
225
+
The following example PowerShell code reads a certificate from a local file or Azure Key Vault, and updates your HDInsight cluster with the new certificate to access Azure Data Lake Storage Gen1. Provide your own HDInsight cluster name, resource group name, subscription ID, `app ID`, local path to the certificate. Type in the password when prompted.
226
226
227
227
```powershell-interactive
228
228
$clusterName = '<clustername>'
@@ -296,14 +296,11 @@ Invoke-AzResourceAction `
296
296
297
297
## Next steps
298
298
299
-
In this article, you learned how to use HDFS-compatible Azure Data Lake Storage Gen1 with HDInsight. This allows you to build scalable, long-term, archiving data acquisition solutions and use HDInsight to unlock the information inside the stored structured and unstructured data.
299
+
In this article, you learned how to use HDFS-compatible Azure Data Lake Storage Gen1 with HDInsight. This storage allows you to build adaptable, long-term, archiving data acquisition solutions. And use HDInsight to unlock the information inside the stored structured and unstructured data.
300
300
301
301
For more information, see:
302
302
303
-
*[Get started with Azure HDInsight](hadoop/apache-hadoop-linux-tutorial-get-started.md)
304
303
*[Quickstart: Set up clusters in HDInsight](../storage/data-lake-storage/quickstart-create-connect-hdi-cluster.md)
305
304
*[Create an HDInsight cluster to use Data Lake Storage Gen1 using the Azure PowerShell](../data-lake-store/data-lake-store-hdinsight-hadoop-use-powershell.md)
306
305
*[Upload data to HDInsight](hdinsight-upload-data.md)
307
-
*[Use Apache Hive with HDInsight](hadoop/hdinsight-use-hive.md)
308
306
*[Use Azure Storage Shared Access Signatures to restrict access to data with HDInsight](hdinsight-storage-sharedaccesssignature-permissions.md)
309
-
*[Tutorial: Extract, transform, and load data using Interactive Query in Azure HDInsight](./interactive-query/interactive-query-tutorial-analyze-flight-data.md)
0 commit comments