You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-use-distcp.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,11 +18,11 @@ DistCp provides a variety of command-line parameters and we strongly encourage y
18
18
19
19
## Prerequisites
20
20
21
-
***An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
22
-
***An existing Azure Storage account without Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled**.
23
-
***An Azure Storage account with Data Lake Storage Gen2 feature enabled**. For instructions on how to create one, see [Create an Azure Data Lake Storage Gen2 storage account](data-lake-storage-quickstart-create-account.md)
24
-
***A filesystem** that has been created in the storage account with hierarchical namespace enabled.
25
-
***Azure HDInsight cluster** with access to a storage account with Data Lake Storage Gen2 enabled. See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2?toc=%2fazure%2fstorage%2fblobs%2ftoc.json). Make sure you enable Remote Desktop for the cluster.
21
+
* An Azure subscription. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
22
+
* An existing Azure Storage account without Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled.
23
+
* An Azure Storage account with Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled. For instructions on how to create one, see [Create an Azure Storage account](../common/storage-account-create.md)
24
+
*A container that has been created in the storage account with hierarchical namespace enabled.
25
+
*An Azure HDInsight cluster with access to a storage account with the hierarchical namespace feature enabled. See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2?toc=%2fazure%2fstorage%2fblobs%2ftoc.json). Make sure you enable Remote Desktop for the cluster.
26
26
27
27
## Use DistCp from an HDInsight Linux cluster
28
28
@@ -32,35 +32,35 @@ An HDInsight cluster comes with the DistCp utility, which can be used to copy da
32
32
33
33
2. Verify whether you can access your existing general purpose V2 account (without hierarchical namespace enabled).
The command copies the contents of **/myfolder** in the Data Lake Store account to **/example/data/gutenberg/** folder in WASB.
56
56
57
57
## Performance considerations while using DistCp
58
58
59
-
Because DistCp’s lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Storage. Number of simultaneous copies is equal to the number of mappers (**m**) parameter on the command line. This parameter specifies the maximum number of mappers that are used to copy data. Default value is 20.
59
+
Because DistCp's lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Storage. Number of simultaneous copies is equal to the number of mappers (**m**) parameter on the command line. This parameter specifies the maximum number of mappers that are used to copy data. Default value is 20.
### How do I determine the number of mappers to use?
66
66
@@ -74,7 +74,7 @@ Here's some guidance that you can use.
74
74
75
75
**Example**
76
76
77
-
Let’s assume that you have a 4x D14v2s cluster and you are trying to transfer 10 TB of data from 10 different folders. Each of the folders contains varying amounts of data and the file sizes within each folder are different.
77
+
Let's assume that you have a 4x D14v2s cluster and you are trying to transfer 10 TB of data from 10 different folders. Each of the folders contains varying amounts of data and the file sizes within each folder are different.
78
78
79
79
***Total YARN memory**: From the Ambari portal you determine that the YARN memory is 96 GB for a D14 node. So, total YARN memory for four node cluster is:
0 commit comments