Skip to content

Commit 166bd5c

Browse files
committed
Adding dstcp tool back to the TOC
1 parent 434255f commit 166bd5c

File tree

2 files changed

+16
-14
lines changed

2 files changed

+16
-14
lines changed

articles/storage/blobs/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -752,6 +752,8 @@
752752
href: ../common/storage-use-azcopy-configure.md?toc=%2fazure%2fstorage%2fblobs%2ftoc.json
753753
- name: Azure Data Factory
754754
href: ../../data-factory/connector-azure-blob-storage.md?toc=%2fazure%2fstorage%2fblobs%2ftoc.json
755+
- name: Transfer data with the DistCp tool
756+
href: data-lake-storage-use-distcp.md
755757
- name: Transfer data with the Data Movement library
756758
href: ../common/storage-use-data-movement-library.md?toc=%2fazure%2fstorage%2fblobs%2ftoc.json
757759
- name: Optimize

articles/storage/blobs/data-lake-storage-use-distcp.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ DistCp provides a variety of command-line parameters and we strongly encourage y
1818

1919
## Prerequisites
2020

21-
* **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
22-
* **An existing Azure Storage account without Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled**.
23-
* **An Azure Storage account with Data Lake Storage Gen2 feature enabled**. For instructions on how to create one, see [Create an Azure Data Lake Storage Gen2 storage account](data-lake-storage-quickstart-create-account.md)
24-
* **A filesystem** that has been created in the storage account with hierarchical namespace enabled.
25-
* **Azure HDInsight cluster** with access to a storage account with Data Lake Storage Gen2 enabled. See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2?toc=%2fazure%2fstorage%2fblobs%2ftoc.json). Make sure you enable Remote Desktop for the cluster.
21+
* An Azure subscription. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
22+
* An existing Azure Storage account without Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled.
23+
* An Azure Storage account with Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled. For instructions on how to create one, see [Create an Azure Storage account](../common/storage-account-create.md)
24+
* A container that has been created in the storage account with hierarchical namespace enabled.
25+
* An Azure HDInsight cluster with access to a storage account with the hierarchical namespace feature enabled. See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2?toc=%2fazure%2fstorage%2fblobs%2ftoc.json). Make sure you enable Remote Desktop for the cluster.
2626

2727
## Use DistCp from an HDInsight Linux cluster
2828

@@ -32,35 +32,35 @@ An HDInsight cluster comes with the DistCp utility, which can be used to copy da
3232

3333
2. Verify whether you can access your existing general purpose V2 account (without hierarchical namespace enabled).
3434

35-
hdfs dfs –ls wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/
35+
hdfs dfs –ls wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/
3636

37-
The output should provide a list of contents in the container.
37+
The output should provide a list of contents in the container.
3838

3939
3. Similarly, verify whether you can access the storage account with hierarchical namespace enabled from the cluster. Run the following command:
4040

41-
hdfs dfs -ls abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/
41+
hdfs dfs -ls abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/
4242

43-
The output should provide a list of files/folders in the Data Lake Storage account.
43+
The output should provide a list of files/folders in the Data Lake storage account.
4444

4545
4. Use DistCp to copy data from WASB to a Data Lake Storage account.
4646

47-
hadoop distcp wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/example/data/gutenberg abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/myfolder
47+
hadoop distcp wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/example/data/gutenberg abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/myfolder
4848

4949
The command copies the contents of the **/example/data/gutenberg/** folder in Blob storage to **/myfolder** in the Data Lake Storage account.
5050

5151
5. Similarly, use DistCp to copy data from Data Lake Storage account to Blob Storage (WASB).
5252

53-
hadoop distcp abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/myfolder wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/example/data/gutenberg
53+
hadoop distcp abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/myfolder wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/example/data/gutenberg
5454

5555
The command copies the contents of **/myfolder** in the Data Lake Store account to **/example/data/gutenberg/** folder in WASB.
5656

5757
## Performance considerations while using DistCp
5858

59-
Because DistCps lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Storage. Number of simultaneous copies is equal to the number of mappers (**m**) parameter on the command line. This parameter specifies the maximum number of mappers that are used to copy data. Default value is 20.
59+
Because DistCp's lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Storage. Number of simultaneous copies is equal to the number of mappers (**m**) parameter on the command line. This parameter specifies the maximum number of mappers that are used to copy data. Default value is 20.
6060

6161
**Example**
6262

63-
hadoop distcp -m 100 wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/example/data/gutenberg abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/myfolder
63+
hadoop distcp -m 100 wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/example/data/gutenberg abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/myfolder
6464

6565
### How do I determine the number of mappers to use?
6666

@@ -74,7 +74,7 @@ Here's some guidance that you can use.
7474

7575
**Example**
7676

77-
Lets assume that you have a 4x D14v2s cluster and you are trying to transfer 10 TB of data from 10 different folders. Each of the folders contains varying amounts of data and the file sizes within each folder are different.
77+
Let's assume that you have a 4x D14v2s cluster and you are trying to transfer 10 TB of data from 10 different folders. Each of the folders contains varying amounts of data and the file sizes within each folder are different.
7878

7979
* **Total YARN memory**: From the Ambari portal you determine that the YARN memory is 96 GB for a D14 node. So, total YARN memory for four node cluster is:
8080

0 commit comments

Comments
 (0)