Skip to content

Commit f01eaef

Browse files
committed
Adding back file
1 parent ccb186a commit f01eaef

File tree

1 file changed

+99
-0
lines changed

1 file changed

+99
-0
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
title: Copy data into Azure Data Lake Storage Gen2 using DistCp| Microsoft Docs
3+
description: Use DistCp tool to copy data to and from Data Lake Storage Gen2
4+
author: normesta
5+
ms.subservice: data-lake-storage-gen2
6+
ms.service: storage
7+
ms.topic: conceptual
8+
ms.date: 12/06/2018
9+
ms.author: normesta
10+
ms.reviewer: stewu
11+
---
12+
13+
# Use DistCp to copy data between Azure Storage Blobs and Azure Data Lake Storage Gen2
14+
15+
You can use [DistCp](https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html) to copy data between a general purpose V2 storage account and a general purpose V2 storage account with hierarchical namespace enabled. This article provides instructions on how use the DistCp tool.
16+
17+
DistCp provides a variety of command-line parameters and we strongly encourage you to read this article in order to optimize your usage of it. This article shows basic functionality while focusing on its use for copying data to a hierarchical namespace enabled account.
18+
19+
## Prerequisites
20+
21+
* **An Azure subscription**. See [Get Azure free trial](https://azure.microsoft.com/pricing/free-trial/).
22+
* **An existing Azure Storage account without Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled**.
23+
* **An Azure Storage account with Data Lake Storage Gen2 feature enabled**. For instructions on how to create one, see [Create an Azure Data Lake Storage Gen2 storage account](data-lake-storage-quickstart-create-account.md)
24+
* **A filesystem** that has been created in the storage account with hierarchical namespace enabled.
25+
* **Azure HDInsight cluster** with access to a storage account with Data Lake Storage Gen2 enabled. See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2?toc=%2fazure%2fstorage%2fblobs%2ftoc.json). Make sure you enable Remote Desktop for the cluster.
26+
27+
## Use DistCp from an HDInsight Linux cluster
28+
29+
An HDInsight cluster comes with the DistCp utility, which can be used to copy data from different sources into an HDInsight cluster. If you have configured the HDInsight cluster to use Azure Blob Storage and Azure Data Lake Storage together, the DistCp utility can be used out-of-the-box to copy data between as well. In this section, we look at how to use the DistCp utility.
30+
31+
1. Create an SSH session to your HDI cluster. See [Connect to a Linux-based HDInsight cluster](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
32+
33+
2. Verify whether you can access your existing general purpose V2 account (without hierarchical namespace enabled).
34+
35+
hdfs dfs –ls wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/
36+
37+
The output should provide a list of contents in the container.
38+
39+
3. Similarly, verify whether you can access the storage account with hierarchical namespace enabled from the cluster. Run the following command:
40+
41+
hdfs dfs -ls abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/
42+
43+
The output should provide a list of files/folders in the Data Lake Storage account.
44+
45+
4. Use DistCp to copy data from WASB to a Data Lake Storage account.
46+
47+
hadoop distcp wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/example/data/gutenberg abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/myfolder
48+
49+
The command copies the contents of the **/example/data/gutenberg/** folder in Blob storage to **/myfolder** in the Data Lake Storage account.
50+
51+
5. Similarly, use DistCp to copy data from Data Lake Storage account to Blob Storage (WASB).
52+
53+
hadoop distcp abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/myfolder wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/example/data/gutenberg
54+
55+
The command copies the contents of **/myfolder** in the Data Lake Store account to **/example/data/gutenberg/** folder in WASB.
56+
57+
## Performance considerations while using DistCp
58+
59+
Because DistCp’s lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Storage. Number of simultaneous copies is equal to the number of mappers (**m**) parameter on the command line. This parameter specifies the maximum number of mappers that are used to copy data. Default value is 20.
60+
61+
**Example**
62+
63+
hadoop distcp -m 100 wasbs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.blob.core.windows.net/example/data/gutenberg abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/myfolder
64+
65+
### How do I determine the number of mappers to use?
66+
67+
Here's some guidance that you can use.
68+
69+
* **Step 1: Determine total memory available to the 'default' YARN app queue** - The first step is to determine the memory available to the 'default' YARN app queue. This information is available in the Ambari portal associated with the cluster. Navigate to YARN and view the Configs tab to see the YARN memory available to the 'default' app queue. This is the total available memory for your DistCp job (which is actually a MapReduce job).
70+
71+
* **Step 2: Calculate the number of mappers** - The value of **m** is equal to the quotient of total YARN memory divided by the YARN container size. The YARN container size information is available in the Ambari portal as well. Navigate to YARN and view the Configs tab. The YARN container size is displayed in this window. The equation to arrive at the number of mappers (**m**) is
72+
73+
m = (number of nodes * YARN memory for each node) / YARN container size
74+
75+
**Example**
76+
77+
Let’s assume that you have a 4x D14v2s cluster and you are trying to transfer 10 TB of data from 10 different folders. Each of the folders contains varying amounts of data and the file sizes within each folder are different.
78+
79+
* **Total YARN memory**: From the Ambari portal you determine that the YARN memory is 96 GB for a D14 node. So, total YARN memory for four node cluster is:
80+
81+
YARN memory = 4 * 96GB = 384GB
82+
83+
* **Number of mappers**: From the Ambari portal you determine that the YARN container size is 3,072 MB for a D14 cluster node. So, number of mappers is:
84+
85+
m = (4 nodes * 96GB) / 3072MB = 128 mappers
86+
87+
If other applications are using memory, then you can choose to only use a portion of your cluster’s YARN memory for DistCp.
88+
89+
### Copying large datasets
90+
91+
When the size of the dataset to be moved is large (for example, >1 TB) or if you have many different folders, you should consider using multiple DistCp jobs. There is likely no performance gain, but it spreads out the jobs so that if any job fails, you only need to restart that specific job rather than the entire job.
92+
93+
### Limitations
94+
95+
* DistCp tries to create mappers that are similar in size to optimize performance. Increasing the number of mappers may not always increase performance.
96+
97+
* DistCp is limited to only one mapper per file. Therefore, you should not have more mappers than you have files. Since DistCp can only assign one mapper to a file, this limits the amount of concurrency that can be used to copy large files.
98+
99+
* If you have a small number of large files, then you should split them into 256 MB file chunks to give you more potential concurrency.

0 commit comments

Comments
 (0)