You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-introduction.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: normesta
6
6
7
7
ms.service: storage
8
8
ms.topic: overview
9
-
ms.date: 02/23/2022
9
+
ms.date: 03/01/2023
10
10
ms.author: normesta
11
11
ms.reviewer: jamesbak
12
12
ms.subservice: data-lake-storage-gen2
@@ -36,9 +36,9 @@ Also, Data Lake Storage Gen2 is very cost effective because it's built on top of
36
36
37
37
## Key features of Data Lake Storage Gen2
38
38
39
-
-**Hadoop compatible access:** Data Lake Storage Gen2 allows you to manage and access data just as you would with a [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html). The new [ABFS driver](data-lake-storage-abfs-driver.md) (used to access data) is available within all Apache Hadoop environments. These environments include [Azure HDInsight](../../hdinsight/index.yml)*,*[Azure Databricks](/azure/databricks/), and [Azure Synapse Analytics](../../synapse-analytics/index.yml).
39
+
-**Hadoop compatible access:** Data Lake Storage Gen2 allows you to manage and access data just as you would with a [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html). The [ABFS driver](data-lake-storage-abfs-driver.md) (used to access data) is available within all Apache Hadoop environments. These environments include [Azure HDInsight](../../hdinsight/index.yml)*,*[Azure Databricks](/azure/databricks/), and [Azure Synapse Analytics](../../synapse-analytics/index.yml).
40
40
41
-
-**A superset of POSIX permissions:** The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings may be configured through Storage Exploreror through frameworks like Hive and Spark.
41
+
-**A superset of POSIX permissions:** The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings can be configured by using Storage Explorer, the Azure portal, PowerShell, Azure CLI, REST APIs, Azure Storage SDKs, or by using frameworks like Hive and Spark.
42
42
43
43
-**Cost-effective:** Data Lake Storage Gen2 offers low-cost storage capacity and transactions. Features such as [Azure Blob Storage lifecycle](./lifecycle-management-overview.md) optimize costs as data transitions through its lifecycle.
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-tutorial-extract-transform-load-hive.md
+85-82Lines changed: 85 additions & 82 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ author: normesta
7
7
ms.subservice: data-lake-storage-gen2
8
8
ms.service: storage
9
9
ms.topic: tutorial
10
-
ms.date: 11/19/2019
10
+
ms.date: 03/07/2023
11
11
ms.author: normesta
12
12
ms.reviewer: jamesbak
13
13
#Customer intent: As an analytics user, I want to perform an ETL operation so that I can work with my data in my preferred environment.
@@ -28,38 +28,41 @@ If you don't have an Azure subscription, [create a free account](https://azure.m
28
28
29
29
## Prerequisites
30
30
31
-
-**An Azure Data Lake Storage Gen2 storage account that is configured for HDInsight**
31
+
-A storage account that has a hierarchical namespace (Azure Data Lake Storage Gen2) that is configured for HDInsight
32
32
33
-
See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../../hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2.md).
33
+
See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../../hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2.md).
34
34
35
-
-**A Linux-based Hadoop cluster on HDInsight**
35
+
- A Linux-based Hadoop cluster on HDInsight
36
+
37
+
See [Quickstart: Get started with Apache Hadoop and Apache Hive in Azure HDInsight using the Azure portal](../../hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal.md).
36
38
37
-
See [Quickstart: Get started with Apache Hadoop and Apache Hive in Azure HDInsight using the Azure portal](../../hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal.md).
39
+
-Azure SQL Database
38
40
39
-
-**Azure SQL Database**: You use Azure SQL Database as a destination data store. If you don't have a database in SQL Database, see [Create a database in Azure SQL Database in the Azure portal](/azure/azure-sql/database/single-database-create-quickstart).
41
+
You use Azure SQL Database as a destination data store. If you don't have a database in SQL Database, see [Create a database in Azure SQL Database in the Azure portal](/azure/azure-sql/database/single-database-create-quickstart).
40
42
41
-
-**Azure CLI**: If you haven't installed the Azure CLI, see [Install the Azure CLI](/cli/azure/install-azure-cli).
43
+
- Azure CLI
42
44
43
-
-**A Secure Shell (SSH) client**: For more information, see [Connect to HDInsight (Hadoop) by using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
45
+
If you haven't installed the Azure CLI, see [Install the Azure CLI](/cli/azure/install-azure-cli).
44
46
47
+
- A Secure Shell (SSH) client
48
+
49
+
For more information, see [Connect to HDInsight (Hadoop) by using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
45
50
46
51
## Download, extract and then upload the data
47
52
48
-
In this section, you'll download sample flight data. Then, you'll upload that data to your HDInsight cluster and then copy that data to your Data Lake Storage Gen2 account.
53
+
In this section, you download sample flight data. Then, you upload that data to your HDInsight cluster and then copy that data to your Data Lake Storage Gen2 account.
49
54
50
55
1. Download the [On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2016_1.zip](https://github.com/Azure-Samples/AzureStorageSnippets/blob/master/blobs/tutorials/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2016_1.zip) file. This file contains the flight data.
51
56
52
57
2. Open a command prompt and use the following Secure Copy (Scp) command to upload the .zip file to the HDInsight cluster head node:
- Replace the `<file-name>` placeholder with the name of the .zip file.
59
-
- Replace the `<ssh-user-name>` placeholder with the SSH login for the HDInsight cluster.
62
+
- Replace the `<ssh-user-name>` placeholder with the SSH username for the HDInsight cluster.
60
63
- Replace the `<cluster-name>` placeholder with the name of the HDInsight cluster.
61
64
62
-
If you use a password to authenticate your SSH login, you're prompted for the password.
65
+
If you use a password to authenticate your SSH username, you're prompted for the password.
63
66
64
67
If you use a public key, you might need to use the `-i` parameter and specify the path to the matching private key. For example, `scp -i ~/.ssh/id_rsa <file_name>.zip <user-name>@<cluster-name>-ssh.azurehdinsight.net:`.
65
68
@@ -96,7 +99,7 @@ In this section, you'll download sample flight data. Then, you'll upload that da
96
99
7. Use the following command to copy the *.csv* file to the directory:
Use quotes around the file name if the file name contains spaces or special characters.
@@ -113,71 +116,71 @@ As part of the Apache Hive job, you import the data from the .csv file into an A
113
116
nano flightdelays.hql
114
117
```
115
118
116
-
2. Modify the following text by replace the `<container-name>` and `<storage-account-name>` placeholders with your container and storage account name. Then copy and paste the text into the nano console by using pressing the SHIFT key along with the right-mouse click button.
119
+
2. Modify the following text by replacing the `<container-name>` and `<storage-account-name>` placeholders with your container and storage account name. Then copy and paste the text into the nano console by using pressing the SHIFT key along with the right-mouse select button.
117
120
118
121
```hiveql
119
-
DROP TABLE delays_raw;
120
-
-- Creates an external table over the csv file
121
-
CREATE EXTERNAL TABLE delays_raw (
122
-
YEAR string,
123
-
FL_DATE string,
124
-
UNIQUE_CARRIER string,
125
-
CARRIER string,
126
-
FL_NUM string,
127
-
ORIGIN_AIRPORT_ID string,
128
-
ORIGIN string,
129
-
ORIGIN_CITY_NAME string,
130
-
ORIGIN_CITY_NAME_TEMP string,
131
-
ORIGIN_STATE_ABR string,
132
-
DEST_AIRPORT_ID string,
133
-
DEST string,
134
-
DEST_CITY_NAME string,
135
-
DEST_CITY_NAME_TEMP string,
136
-
DEST_STATE_ABR string,
137
-
DEP_DELAY_NEW float,
138
-
ARR_DELAY_NEW float,
139
-
CARRIER_DELAY float,
140
-
WEATHER_DELAY float,
141
-
NAS_DELAY float,
142
-
SECURITY_DELAY float,
143
-
LATE_AIRCRAFT_DELAY float)
144
-
-- The following lines describe the format and location of the file
This query retrieves a list of cities that experienced weather delays, along with the average delay time, and saves it to `abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output`. Later, Sqoop reads the data from this location and exports it to Azure SQL Database.
@@ -237,11 +240,11 @@ You need the server name from SQL Database for this operation. Complete these st
237
240
238
241
- Replace the `<server-name>` placeholder with the logical SQL server name.
239
242
240
-
- Replace the `<admin-login>` placeholder with the admin loginfor SQL Database.
243
+
- Replace the `<admin-login>` placeholder with the admin usernamefor SQL Database.
241
244
242
245
- Replace the `<database-name>` placeholder with the database name
243
246
244
-
When you're prompted, enter the password for the SQL Database admin login.
247
+
When you're prompted, enter the password for the SQL Database admin username.
0 commit comments