Skip to content

Commit ad1585b

Browse files
authored
Merge pull request #110553 from dagiro/freshness42
freshness42
2 parents 34a2975 + 987dc09 commit ad1585b

File tree

1 file changed

+19
-21
lines changed

1 file changed

+19
-21
lines changed

articles/hdinsight/hdinsight-upload-data.md

Lines changed: 19 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
---
22
title: Upload data for Apache Hadoop jobs in HDInsight
3-
description: Learn how to upload and access data for Apache Hadoop jobs in HDInsight using the Azure classic CLI, Azure Storage Explorer, Azure PowerShell, the Hadoop command line, or Sqoop.
3+
description: Learn how to upload and access data for Apache Hadoop jobs in HDInsight. Use Azure classic CLI, Azure Storage Explorer, Azure PowerShell, the Hadoop command line, or Sqoop.
44
author: hrasheed-msft
55
ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
8-
ms.custom: hdiseo17may2017
98
ms.topic: conceptual
10-
ms.date: 10/29/2019
9+
ms.custom: hdiseo17may2017
10+
ms.date: 04/07/2020
1111
---
1212

1313
# Upload data for Apache Hadoop jobs in HDInsight
1414

15-
Azure HDInsight provides a full-featured Hadoop distributed file system (HDFS) over Azure Storage and Azure Data Lake Storage (Gen1 and Gen2). Azure Storage and Data Lake Storage Gen1 and Gen2 are designed as HDFS extensions to provide a seamless experience to customers. They enable the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Azure Storage, Data Lake Storage Gen1, and Gen2 are distinct file systems that are optimized for storage of data and computations on that data. For information about the benefits of using Azure Storage, see [Use Azure Storage with HDInsight](hdinsight-hadoop-use-blob-storage.md), [Use Data Lake Storage Gen1 with HDInsight](hdinsight-hadoop-use-data-lake-store.md), and [Use Data Lake Storage Gen2 with HDInsight](hdinsight-hadoop-use-data-lake-storage-gen2.md).
15+
HDInsight provides a Hadoop distributed file system (HDFS) over Azure Storage, and Azure Data Lake Storage. This storage includes Gen1 and Gen2. Azure Storage and Data Lake Storage Gen1 and Gen2 are designed as HDFS extensions. They enable the full set of components in the Hadoop environment to operate directly on the data it manages. Azure Storage, Data Lake Storage Gen1, and Gen2 are distinct file systems. The systems are optimized for storage of data and computations on that data. For information about the benefits of using Azure Storage, see [Use Azure Storage with HDInsight](hdinsight-hadoop-use-blob-storage.md). See also, [Use Data Lake Storage Gen1 with HDInsight](hdinsight-hadoop-use-data-lake-store.md), and [Use Data Lake Storage Gen2 with HDInsight](hdinsight-hadoop-use-data-lake-storage-gen2.md).
1616

1717
## Prerequisites
1818

@@ -36,16 +36,16 @@ Microsoft provides the following utilities to work with Azure Storage:
3636
| [Azure CLI](../storage/blobs/storage-quickstart-blobs-cli.md) ||||
3737
| [Azure PowerShell](../storage/blobs/storage-quickstart-blobs-powershell.md) | | ||
3838
| [AzCopy](../storage/common/storage-use-azcopy-v10.md) || ||
39-
| [Hadoop command](#commandline) ||||
39+
| [Hadoop command](#hadoop-command-line) ||||
4040

4141
> [!NOTE]
4242
> The Hadoop command is only available on the HDInsight cluster. The command only allows loading data from the local file system into Azure Storage.
4343
44-
## <a id="commandline"></a>Hadoop command line
44+
## Hadoop command line
4545

4646
The Hadoop command line is only useful for storing data into Azure storage blob when the data is already present on the cluster head node.
4747

48-
In order to use the Hadoop command, you must first connect to the headnode using [SSH or PuTTY](hdinsight-hadoop-linux-use-ssh-unix.md).
48+
To use the Hadoop command, you must first connect to the headnode using [SSH or PuTTY](hdinsight-hadoop-linux-use-ssh-unix.md).
4949

5050
Once connected, you can use the following syntax to upload a file to storage.
5151

@@ -66,7 +66,7 @@ or
6666
For a list of other Hadoop commands that work with files, see [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html)
6767

6868
> [!WARNING]
69-
> On Apache HBase clusters, the default block size used when writing data is 256 KB. While this works fine when using HBase APIs or REST APIs, using the `hadoop` or `hdfs dfs` commands to write data larger than ~12 GB results in an error. For more information, see the [storage exception for write on blob](#storageexception) section in this article.
69+
> On Apache HBase clusters, the default block size used when writing data is 256 KB. While this works fine when using HBase APIs or REST APIs, using the `hadoop` or `hdfs dfs` commands to write data larger than ~12 GB results in an error. For more information, see the [storage exception for write on blob](#storage-exception-for-write-on-blob) section in this article.
7070
7171
## Graphical clients
7272

@@ -76,7 +76,7 @@ There are also several applications that provide a graphical interface for worki
7676
| --- |:---:|:---:|:---:|
7777
| [Microsoft Visual Studio Tools for HDInsight](hadoop/apache-hadoop-visual-studio-tools-get-started.md#explore-linked-resources) ||||
7878
| [Azure Storage Explorer](../storage/blobs/storage-quickstart-blobs-storage-explorer.md) ||||
79-
| [Cerulea](https://www.cerebrata.com/products/cerulean/features/azure-storage) | | ||
79+
| [`Cerulea`](https://www.cerebrata.com/products/cerulean/features/azure-storage) | | ||
8080
| [CloudXplorer](https://clumsyleaf.com/products/cloudxplorer) | | ||
8181
| [CloudBerry Explorer for Microsoft Azure](https://www.cloudberrylab.com/free-microsoft-azure-explorer.aspx) | | ||
8282
| [Cyberduck](https://cyberduck.io/) | |||
@@ -89,17 +89,17 @@ See [Mount Azure Storage as Local Drive](https://blogs.msdn.com/b/bigdatasupport
8989

9090
### Azure Data Factory
9191

92-
The Azure Data Factory service is a fully managed service for composing data storage, data processing, and data movement services into streamlined, scalable, and reliable data production pipelines.
92+
The Azure Data Factory service is a fully managed service for composing data: storage, processing, and movement services into streamlined, adaptable, and reliable data production pipelines.
9393

9494
|Storage type|Documentation|
9595
|----|----|
9696
|Azure Blob storage|[Copy data to or from Azure Blob storage by using Azure Data Factory](../data-factory/connector-azure-blob-storage.md)|
9797
|Azure Data Lake Storage Gen1|[Copy data to or from Azure Data Lake Storage Gen1 by using Azure Data Factory](../data-factory/connector-azure-data-lake-store.md)|
9898
|Azure Data Lake Storage Gen2 |[Load data into Azure Data Lake Storage Gen2 with Azure Data Factory](../data-factory/load-azure-data-lake-storage-gen2.md)|
9999

100-
### <a id="sqoop"></a>Apache Sqoop
100+
### Apache Sqoop
101101

102-
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.
102+
Sqoop is a tool designed to transfer data between Hadoop and relational databases. Use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle. Then into the Hadoop distributed file system (HDFS). Transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.
103103

104104
For more information, see [Use Sqoop with HDInsight](hadoop/hdinsight-use-sqoop.md).
105105

@@ -118,9 +118,9 @@ For more information on installing the Azure SDKs, see [Azure downloads](https:/
118118

119119
## Troubleshooting
120120

121-
### <a id="storageexception"></a>Storage exception for write on blob
121+
### Storage exception for write on blob
122122

123-
**Symptoms**: When using the `hadoop` or `hdfs dfs` commands to write files that are ~12 GB or larger on an HBase cluster, you may encounter the following error:
123+
**Symptoms**: When using the `hadoop` or `hdfs dfs` commands to write files that are ~12 GB or larger on an HBase cluster, you may come across the following error:
124124

125125
ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for write on Blob : example/test_large_file.bin._COPYING_ Exception details: null Error Code : RequestBodyTooLarge
126126
copyFromLocal: java.io.IOException
@@ -144,19 +144,17 @@ For more information on installing the Azure SDKs, see [Azure downloads](https:/
144144

145145
**Cause**: HBase on HDInsight clusters default to a block size of 256 KB when writing to Azure storage. While it works for HBase APIs or REST APIs, it results in an error when using the `hadoop` or `hdfs dfs` command-line utilities.
146146

147-
**Resolution**: Use `fs.azure.write.request.size` to specify a larger block size. You can do this on a per-use basis by using the `-D` parameter. The following command is an example using this parameter with the `hadoop` command:
147+
**Resolution**: Use `fs.azure.write.request.size` to specify a larger block size. You can do this modification on a per-use basis by using the `-D` parameter. The following command is an example using this parameter with the `hadoop` command:
148148

149149
```bash
150150
hadoop -fs -D fs.azure.write.request.size=4194304 -copyFromLocal test_large_file.bin /example/data
151151
```
152152

153153
You can also increase the value of `fs.azure.write.request.size` globally by using Apache Ambari. The following steps can be used to change the value in the Ambari Web UI:
154154

155-
1. In your browser, go to the Ambari Web UI for your cluster. This is `https://CLUSTERNAME.azurehdinsight.net`, where `CLUSTERNAME` is the name of your cluster.
156-
157-
When prompted, enter the admin name and password for the cluster.
155+
1. In your browser, go to the Ambari Web UI for your cluster. The URL is `https://CLUSTERNAME.azurehdinsight.net`, where `CLUSTERNAME` is the name of your cluster. When prompted, enter the admin name and password for the cluster.
158156
2. From the left side of the screen, select **HDFS**, and then select the **Configs** tab.
159-
3. In the **Filter...** field, enter `fs.azure.write.request.size`. This displays the field and current value in the middle of the page.
157+
3. In the **Filter...** field, enter `fs.azure.write.request.size`.
160158
4. Change the value from 262144 (256 KB) to the new value. For example, 4194304 (4 MB).
161159

162160
![Image of changing the value through Ambari Web UI](./media/hdinsight-upload-data/hbase-change-block-write-size.png)
@@ -165,9 +163,9 @@ For more information on using Ambari, see [Manage HDInsight clusters using the A
165163

166164
## Next steps
167165

168-
Now that you understand how to get data into HDInsight, read the following articles to learn how to perform analysis:
166+
Now that you understand how to get data into HDInsight, read the following articles to learn analysis:
169167

170168
* [Get started with Azure HDInsight](hadoop/apache-hadoop-linux-tutorial-get-started.md)
171169
* [Submit Apache Hadoop jobs programmatically](hadoop/submit-apache-hadoop-jobs-programmatically.md)
172170
* [Use Apache Hive with HDInsight](hadoop/hdinsight-use-hive.md)
173-
* [Use Apache Pig with HDInsight](hadoop/hdinsight-use-pig.md)
171+
* [Use Apache Pig with HDInsight](./use-pig.md)

0 commit comments

Comments
 (0)