Skip to content

Commit 3b05ec9

Browse files
authored
Merge pull request #100037 from dagiro/freshness161
freshness161
2 parents b5cb3f6 + 67c1fc6 commit 3b05ec9

File tree

1 file changed

+80
-71
lines changed

1 file changed

+80
-71
lines changed
Lines changed: 80 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,28 @@
11
---
2-
title: Azure Storage solutions for ML Services on HDInsight - Azure
2+
title: Azure storage solutions for ML Services on HDInsight - Azure
33
description: Learn about the different storage options available with ML Services on HDInsight
44
ms.service: hdinsight
55
author: hrasheed-msft
66
ms.author: hrasheed
77
ms.reviewer: jasonh
88
ms.custom: hdinsightactive
99
ms.topic: conceptual
10-
ms.date: 06/27/2018
10+
ms.date: 01/02/2020
1111
---
12-
# Azure Storage solutions for ML Services on Azure HDInsight
1312

14-
ML Services on HDInsight can use a variety of storage solutions to persist data, code, or objects that contain results from analysis. These include the following options:
13+
# Azure storage solutions for ML Services on Azure HDInsight
14+
15+
ML Services on HDInsight can use different storage solutions to persist data, code, or objects that contain results from analysis. These solutions include the following options:
1516

1617
- [Azure Blob](https://azure.microsoft.com/services/storage/blobs/)
1718
- [Azure Data Lake Storage](https://azure.microsoft.com/services/storage/data-lake-storage/)
1819
- [Azure File storage](https://azure.microsoft.com/services/storage/files/)
1920

20-
You also have the option of accessing multiple Azure storage accounts or containers with your HDInsight cluster. Azure File storage is a convenient data storage option for use on the edge node that enables you to mount an Azure Storage file share to, for example, the Linux file system. But Azure File shares can be mounted and used by any system that has a supported operating system such as Windows or Linux.
21+
You also have the option of accessing multiple Azure storage accounts or containers with your HDInsight cluster. Azure File storage is a convenient data storage option for use on the edge node that enables you to mount an Azure storage file share to, for example, the Linux file system. But Azure File shares can be mounted and used by any system that has a supported operating system such as Windows or Linux.
2122

22-
When you create an Apache Hadoop cluster in HDInsight, you specify either an **Azure storage** account or **Data Lake Storage**. A specific storage container from that account holds the file system for the cluster that you create (for example, the Hadoop Distributed File System). For more information and guidance, see:
23+
When you create an Apache Hadoop cluster in HDInsight, you specify either an **Azure Storage** account or **Data Lake Storage**. A specific storage container from that account holds the file system for the cluster that you create (for example, the Hadoop Distributed File System). For more information and guidance, see:
2324

24-
- [Use Azure storage with HDInsight](../hdinsight-hadoop-use-blob-storage.md)
25+
- [Use Azure Storage with HDInsight](../hdinsight-hadoop-use-blob-storage.md)
2526
- [Use Data Lake Storage with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-store.md)
2627

2728
## Use Azure Blob storage accounts with ML Services cluster
@@ -35,73 +36,81 @@ If you specified more than one storage account when creating your ML Services cl
3536

3637
1. Using an SSH client, connect to the edge node of your cluster. For information on using SSH with HDInsight clusters, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md).
3738

38-
2. Copy a sample file, mysamplefile.csv, to the /share directory.
39+
2. Copy a sample file, mysamplefile.csv, to the /share directory.
3940

40-
hadoop fs –mkdir /share
41-
hadoop fs –copyFromLocal mycsv.scv /share
41+
```bash
42+
hadoop fs –mkdir /share
43+
hadoop fs –copyFromLocal mycsv.scv /share
44+
```
4245

4346
3. Switch to R Studio or another R console, and write R code to set the name node to **default** and location of the file you want to access.
4447

45-
myNameNode <- "default"
46-
myPort <- 0
48+
```R
49+
myNameNode <- "default"
50+
myPort <- 0
4751
48-
#Location of the data:
49-
bigDataDirRoot <- "/share"
52+
#Location of the data:
53+
bigDataDirRoot <- "/share"
5054
51-
#Define Spark compute context:
52-
mySparkCluster <- RxSpark(nameNode=myNameNode, consoleOutput=TRUE)
55+
#Define Spark compute context:
56+
mySparkCluster <- RxSpark(nameNode=myNameNode, consoleOutput=TRUE)
5357
54-
#Set compute context:
55-
rxSetComputeContext(mySparkCluster)
58+
#Set compute context:
59+
rxSetComputeContext(mySparkCluster)
5660
57-
#Define the Hadoop Distributed File System (HDFS) file system:
58-
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
61+
#Define the Hadoop Distributed File System (HDFS) file system:
62+
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
5963
60-
#Specify the input file to analyze in HDFS:
61-
inputFile <-file.path(bigDataDirRoot,"mysamplefile.csv")
64+
#Specify the input file to analyze in HDFS:
65+
inputFile <-file.path(bigDataDirRoot,"mysamplefile.csv")
66+
```
6267

63-
All the directory and file references point to the storage account `wasb://[email protected]`. This is the **default storage account** that's associated with the HDInsight cluster.
68+
All the directory and file references point to the storage account `wasbs://[email protected]`. This is the **default storage account** that's associated with the HDInsight cluster.
6469
6570
### Use the additional storage with ML Services on HDInsight
6671
6772
Now, suppose you want to process a file called mysamplefile1.csv that's located in the /private directory of **container2** in **storage2**.
6873

6974
In your R code, point the name node reference to the **storage2** storage account.
7075

71-
myNameNode <- "wasb://[email protected]"
72-
myPort <- 0
73-
74-
#Location of the data:
75-
bigDataDirRoot <- "/private"
76+
```R
77+
myNameNode <- "wasbs://[email protected]"
78+
myPort <- 0
7679
77-
#Define Spark compute context:
78-
mySparkCluster <- RxSpark(consoleOutput=TRUE, nameNode=myNameNode, port=myPort)
80+
#Location of the data:
81+
bigDataDirRoot <- "/private"
7982
80-
#Set compute context:
81-
rxSetComputeContext(mySparkCluster)
83+
#Define Spark compute context:
84+
mySparkCluster <- RxSpark(consoleOutput=TRUE, nameNode=myNameNode, port=myPort)
8285
83-
#Define HDFS file system:
84-
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
86+
#Set compute context:
87+
rxSetComputeContext(mySparkCluster)
8588
86-
#Specify the input file to analyze in HDFS:
87-
inputFile <-file.path(bigDataDirRoot,"mysamplefile1.csv")
89+
#Define HDFS file system:
90+
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
8891
89-
All of the directory and file references now point to the storage account `wasb://[email protected]`. This is the **Name Node** that you’ve specified.
92+
#Specify the input file to analyze in HDFS:
93+
inputFile <-file.path(bigDataDirRoot,"mysamplefile1.csv")
94+
```
9095

91-
You have to configure the `/user/RevoShare/<SSH username>` directory on **storage2** as follows:
96+
All of the directory and file references now point to the storage account `wasbs://[email protected]`. This is the **Name Node** that you’ve specified.
9297

98+
Configure the `/user/RevoShare/<SSH username>` directory on **storage2** as follows:
9399

94-
hadoop fs -mkdir wasb://[email protected]/user
95-
hadoop fs -mkdir wasb://[email protected]/user/RevoShare
96-
hadoop fs -mkdir wasb://[email protected]/user/RevoShare/<RDP username>
100+
```bash
101+
hadoop fs -mkdir wasbs://[email protected]/user
102+
hadoop fs -mkdir wasbs://[email protected]/user/RevoShare
103+
hadoop fs -mkdir wasbs://[email protected]/user/RevoShare/<RDP username>
104+
```
97105

98-
## Use Azure Data Lake Storage with ML Services cluster
106+
## Use Azure Data Lake Storage with ML Services cluster
99107

100108
To use Data Lake Storage with your HDInsight cluster, you need to give your cluster access to each Azure Data Lake Storage that you want to use. For instructions on how to use the Azure portal to create a HDInsight cluster with an Azure Data Lake Storage account as the default storage or as additional storage, see [Create an HDInsight cluster with Data Lake Storage using Azure portal](../../data-lake-store/data-lake-store-hdinsight-hadoop-use-portal.md).
101109

102110
You then use the storage in your R script much like you did a secondary Azure storage account as described in the previous procedure.
103111

104112
### Add cluster access to your Azure Data Lake Storage
113+
105114
You access Data Lake Storage by using an Azure Active Directory (Azure AD) Service Principal that's associated with your HDInsight cluster.
106115
107116
1. When you create your HDInsight cluster, select **Cluster AAD Identity** from the **Data Source** tab.
@@ -110,58 +119,58 @@ You access Data Lake Storage by using an Azure Active Directory (Azure AD) Servi
110119
111120
After you give the Service Principal a name and create a password for it, click **Manage ADLS Access** to associate the Service Principal with your Data Lake Storage.
112121
113-
Its also possible to add cluster access to one or more Data Lake Storage accounts following cluster creation. Open the Azure portal entry for a Data Lake Storage and go to **Data Explorer > Access > Add**.
122+
It's also possible to add cluster access to one or more Data Lake storage accounts following cluster creation. Open the Azure portal entry for a Data Lake Storage and go to **Data Explorer > Access > Add**.
114123

115124
### How to access Data Lake Storage Gen1 from ML Services on HDInsight
116125

117-
Once you’ve given access to Data Lake Storage Gen1, you can use the storage in ML Services cluster on HDInsight the way you would a secondary Azure storage account. The only difference is that the prefix **wasb://** changes to **adl://** as follows:
118-
126+
Once you've given access to Data Lake Storage Gen1, you can use the storage in ML Services cluster on HDInsight the way you would a secondary Azure storage account. The only difference is that the prefix **wasbs://** changes to **adl://** as follows:
119127
120-
# Point to the ADL Storage (e.g. ADLtest)
121-
myNameNode <- "adl://rkadl1.azuredatalakestore.net"
122-
myPort <- 0
128+
```R
129+
# Point to the ADL Storage (e.g. ADLtest)
130+
myNameNode <- "adl://rkadl1.azuredatalakestore.net"
131+
myPort <- 0
123132
124-
# Location of the data (assumes a /share directory on the ADL account)
125-
bigDataDirRoot <- "/share"
133+
# Location of the data (assumes a /share directory on the ADL account)
134+
bigDataDirRoot <- "/share"
126135
127-
# Define Spark compute context
128-
mySparkCluster <- RxSpark(consoleOutput=TRUE, nameNode=myNameNode, port=myPort)
136+
# Define Spark compute context
137+
mySparkCluster <- RxSpark(consoleOutput=TRUE, nameNode=myNameNode, port=myPort)
129138
130-
# Set compute context
131-
rxSetComputeContext(mySparkCluster)
139+
# Set compute context
140+
rxSetComputeContext(mySparkCluster)
132141
133-
# Define HDFS file system
134-
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
142+
# Define HDFS file system
143+
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
135144
136-
# Specify the input file in HDFS to analyze
137-
inputFile <-file.path(bigDataDirRoot,"mysamplefile.csv")
145+
# Specify the input file in HDFS to analyze
146+
inputFile <-file.path(bigDataDirRoot,"mysamplefile.csv")
147+
```
138148
139149
The following commands are used to configure the Data Lake Storage Gen1 account with the RevoShare directory and add the sample .csv file from the previous example:
140150
151+
```bash
152+
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user
153+
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user/RevoShare
154+
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user/RevoShare/<user>
141155
142-
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user
143-
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user/RevoShare
144-
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user/RevoShare/<user>
156+
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/share
145157
146-
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/share
147-
148-
hadoop fs -copyFromLocal /usr/lib64/R Server-7.4.1/library/RevoScaleR/SampleData/mysamplefile.csv adl://rkadl1.azuredatalakestore.net/share
149-
150-
hadoop fs –ls adl://rkadl1.azuredatalakestore.net/share
158+
hadoop fs -copyFromLocal /usr/lib64/R Server-7.4.1/library/RevoScaleR/SampleData/mysamplefile.csv adl://rkadl1.azuredatalakestore.net/share
151159
160+
hadoop fs –ls adl://rkadl1.azuredatalakestore.net/share
161+
```
152162
153163
## Use Azure File storage with ML Services on HDInsight
154164
155-
There is also a convenient data storage option for use on the edge node called [Azure Files](https://azure.microsoft.com/services/storage/files/). It enables you to mount an Azure Storage file share to the Linux file system. This option can be handy for storing data files, R scripts, and result objects that might be needed later, especially when it makes sense to use the native file system on the edge node rather than HDFS.
165+
There's also a convenient data storage option for use on the edge node called [Azure Files](https://azure.microsoft.com/services/storage/files/). It enables you to mount an Azure Storage file share to the Linux file system. This option can be handy for storing data files, R scripts, and result objects that might be needed later, especially when it makes sense to use the native file system on the edge node rather than HDFS.
156166

157167
A major benefit of Azure Files is that the file shares can be mounted and used by any system that has a supported OS such as Windows or Linux. For example, it can be used by another HDInsight cluster that you or someone on your team has, by an Azure VM, or even by an on-premises system. For more information, see:
158168

159169
- [How to use Azure File storage with Linux](../../storage/files/storage-how-to-use-files-linux.md)
160170
- [How to use Azure File storage on Windows](../../storage/files/storage-dotnet-how-to-use-files.md)
161171

162-
163172
## Next steps
164173

165-
* [Overview of ML Services cluster on HDInsight](r-server-overview.md)
166-
* [Compute context options for ML Services cluster on HDInsight](r-server-compute-contexts.md)
167-
* [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-storage-gen2.md)
174+
- [Overview of ML Services cluster on HDInsight](r-server-overview.md)
175+
- [Compute context options for ML Services cluster on HDInsight](r-server-compute-contexts.md)
176+
- [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../hdinsight-hadoop-use-data-lake-storage-gen2.md)

0 commit comments

Comments
 (0)