Skip to content

Commit 4a31e5f

Browse files
authored
Merge pull request #229214 from normesta/gen2
Refreshing Gen2 articles
2 parents 6a48678 + 4429b10 commit 4a31e5f

15 files changed

+248
-309
lines changed

articles/storage/blobs/data-lake-storage-events.md

Lines changed: 116 additions & 143 deletions
Large diffs are not rendered by default.

articles/storage/blobs/data-lake-storage-integrate-with-services-tutorials.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: normesta
66

77
ms.topic: conceptual
88
ms.author: normesta
9-
ms.date: 10/06/2021
9+
ms.date: 03/07/2023
1010
ms.service: storage
1111
ms.subservice: data-lake-storage-gen2
1212
---

articles/storage/blobs/data-lake-storage-introduction.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: normesta
66

77
ms.service: storage
88
ms.topic: overview
9-
ms.date: 02/23/2022
9+
ms.date: 03/01/2023
1010
ms.author: normesta
1111
ms.reviewer: jamesbak
1212
ms.subservice: data-lake-storage-gen2
@@ -36,9 +36,9 @@ Also, Data Lake Storage Gen2 is very cost effective because it's built on top of
3636

3737
## Key features of Data Lake Storage Gen2
3838

39-
- **Hadoop compatible access:** Data Lake Storage Gen2 allows you to manage and access data just as you would with a [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html). The new [ABFS driver](data-lake-storage-abfs-driver.md) (used to access data) is available within all Apache Hadoop environments. These environments include [Azure HDInsight](../../hdinsight/index.yml)*,* [Azure Databricks](/azure/databricks/), and [Azure Synapse Analytics](../../synapse-analytics/index.yml).
39+
- **Hadoop compatible access:** Data Lake Storage Gen2 allows you to manage and access data just as you would with a [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html). The [ABFS driver](data-lake-storage-abfs-driver.md) (used to access data) is available within all Apache Hadoop environments. These environments include [Azure HDInsight](../../hdinsight/index.yml)*,* [Azure Databricks](/azure/databricks/), and [Azure Synapse Analytics](../../synapse-analytics/index.yml).
4040

41-
- **A superset of POSIX permissions:** The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings may be configured through Storage Explorer or through frameworks like Hive and Spark.
41+
- **A superset of POSIX permissions:** The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings can be configured by using Storage Explorer, the Azure portal, PowerShell, Azure CLI, REST APIs, Azure Storage SDKs, or by using frameworks like Hive and Spark.
4242

4343
- **Cost-effective:** Data Lake Storage Gen2 offers low-cost storage capacity and transactions. Features such as [Azure Blob Storage lifecycle](./lifecycle-management-overview.md) optimize costs as data transitions through its lifecycle.
4444

articles/storage/blobs/data-lake-storage-tutorial-extract-transform-load-hive.md

Lines changed: 85 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: normesta
77
ms.subservice: data-lake-storage-gen2
88
ms.service: storage
99
ms.topic: tutorial
10-
ms.date: 11/19/2019
10+
ms.date: 03/07/2023
1111
ms.author: normesta
1212
ms.reviewer: jamesbak
1313
#Customer intent: As an analytics user, I want to perform an ETL operation so that I can work with my data in my preferred environment.
@@ -28,38 +28,41 @@ If you don't have an Azure subscription, [create a free account](https://azure.m
2828

2929
## Prerequisites
3030

31-
- **An Azure Data Lake Storage Gen2 storage account that is configured for HDInsight**
31+
- A storage account that has a hierarchical namespace (Azure Data Lake Storage Gen2) that is configured for HDInsight
3232

33-
See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../../hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2.md).
33+
See [Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters](../../hdinsight/hdinsight-hadoop-use-data-lake-storage-gen2.md).
3434

35-
- **A Linux-based Hadoop cluster on HDInsight**
35+
- A Linux-based Hadoop cluster on HDInsight
36+
37+
See [Quickstart: Get started with Apache Hadoop and Apache Hive in Azure HDInsight using the Azure portal](../../hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal.md).
3638

37-
See [Quickstart: Get started with Apache Hadoop and Apache Hive in Azure HDInsight using the Azure portal](../../hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal.md).
39+
- Azure SQL Database
3840

39-
- **Azure SQL Database**: You use Azure SQL Database as a destination data store. If you don't have a database in SQL Database, see [Create a database in Azure SQL Database in the Azure portal](/azure/azure-sql/database/single-database-create-quickstart).
41+
You use Azure SQL Database as a destination data store. If you don't have a database in SQL Database, see [Create a database in Azure SQL Database in the Azure portal](/azure/azure-sql/database/single-database-create-quickstart).
4042

41-
- **Azure CLI**: If you haven't installed the Azure CLI, see [Install the Azure CLI](/cli/azure/install-azure-cli).
43+
- Azure CLI
4244

43-
- **A Secure Shell (SSH) client**: For more information, see [Connect to HDInsight (Hadoop) by using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
45+
If you haven't installed the Azure CLI, see [Install the Azure CLI](/cli/azure/install-azure-cli).
4446

47+
- A Secure Shell (SSH) client
48+
49+
For more information, see [Connect to HDInsight (Hadoop) by using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
4550

4651
## Download, extract and then upload the data
4752

48-
In this section, you'll download sample flight data. Then, you'll upload that data to your HDInsight cluster and then copy that data to your Data Lake Storage Gen2 account.
53+
In this section, you download sample flight data. Then, you upload that data to your HDInsight cluster and then copy that data to your Data Lake Storage Gen2 account.
4954

5055
1. Download the [On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2016_1.zip](https://github.com/Azure-Samples/AzureStorageSnippets/blob/master/blobs/tutorials/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2016_1.zip) file. This file contains the flight data.
5156

5257
2. Open a command prompt and use the following Secure Copy (Scp) command to upload the .zip file to the HDInsight cluster head node:
5358

5459
```bash
55-
scp <file-name>.zip <ssh-user-name>@<cluster-name>-ssh.azurehdinsight.net:<file-name.zip>
60+
scp On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2016_1.zip <ssh-user-name>@<cluster-name>-ssh.azurehdinsight.net:
5661
```
57-
58-
- Replace the `<file-name>` placeholder with the name of the .zip file.
59-
- Replace the `<ssh-user-name>` placeholder with the SSH login for the HDInsight cluster.
62+
- Replace the `<ssh-user-name>` placeholder with the SSH username for the HDInsight cluster.
6063
- Replace the `<cluster-name>` placeholder with the name of the HDInsight cluster.
6164

62-
If you use a password to authenticate your SSH login, you're prompted for the password.
65+
If you use a password to authenticate your SSH username, you're prompted for the password.
6366

6467
If you use a public key, you might need to use the `-i` parameter and specify the path to the matching private key. For example, `scp -i ~/.ssh/id_rsa <file_name>.zip <user-name>@<cluster-name>-ssh.azurehdinsight.net:`.
6568

@@ -96,7 +99,7 @@ In this section, you'll download sample flight data. Then, you'll upload that da
9699
7. Use the following command to copy the *.csv* file to the directory:
97100

98101
```bash
99-
hdfs dfs -put "<file-name>.csv" abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/data/
102+
hdfs dfs -put "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2016_1.csv" abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/data/
100103
```
101104

102105
Use quotes around the file name if the file name contains spaces or special characters.
@@ -113,71 +116,71 @@ As part of the Apache Hive job, you import the data from the .csv file into an A
113116
nano flightdelays.hql
114117
```
115118

116-
2. Modify the following text by replace the `<container-name>` and `<storage-account-name>` placeholders with your container and storage account name. Then copy and paste the text into the nano console by using pressing the SHIFT key along with the right-mouse click button.
119+
2. Modify the following text by replacing the `<container-name>` and `<storage-account-name>` placeholders with your container and storage account name. Then copy and paste the text into the nano console by using pressing the SHIFT key along with the right-mouse select button.
117120

118121
```hiveql
119-
DROP TABLE delays_raw;
120-
-- Creates an external table over the csv file
121-
CREATE EXTERNAL TABLE delays_raw (
122-
YEAR string,
123-
FL_DATE string,
124-
UNIQUE_CARRIER string,
125-
CARRIER string,
126-
FL_NUM string,
127-
ORIGIN_AIRPORT_ID string,
128-
ORIGIN string,
129-
ORIGIN_CITY_NAME string,
130-
ORIGIN_CITY_NAME_TEMP string,
131-
ORIGIN_STATE_ABR string,
132-
DEST_AIRPORT_ID string,
133-
DEST string,
134-
DEST_CITY_NAME string,
135-
DEST_CITY_NAME_TEMP string,
136-
DEST_STATE_ABR string,
137-
DEP_DELAY_NEW float,
138-
ARR_DELAY_NEW float,
139-
CARRIER_DELAY float,
140-
WEATHER_DELAY float,
141-
NAS_DELAY float,
142-
SECURITY_DELAY float,
143-
LATE_AIRCRAFT_DELAY float)
144-
-- The following lines describe the format and location of the file
145-
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
146-
LINES TERMINATED BY '\n'
147-
STORED AS TEXTFILE
148-
LOCATION 'abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/data';
149-
150-
-- Drop the delays table if it exists
151-
DROP TABLE delays;
152-
-- Create the delays table and populate it with data
153-
-- pulled in from the CSV file (via the external table defined previously)
154-
CREATE TABLE delays
155-
LOCATION 'abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/processed'
156-
AS
157-
SELECT YEAR AS year,
158-
FL_DATE AS flight_date,
159-
substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS unique_carrier,
160-
substring(CARRIER, 2, length(CARRIER) -1) AS carrier,
161-
substring(FL_NUM, 2, length(FL_NUM) -1) AS flight_num,
162-
ORIGIN_AIRPORT_ID AS origin_airport_id,
163-
substring(ORIGIN, 2, length(ORIGIN) -1) AS origin_airport_code,
164-
substring(ORIGIN_CITY_NAME, 2) AS origin_city_name,
165-
substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1) AS origin_state_abr,
166-
DEST_AIRPORT_ID AS dest_airport_id,
167-
substring(DEST, 2, length(DEST) -1) AS dest_airport_code,
168-
substring(DEST_CITY_NAME,2) AS dest_city_name,
169-
substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS dest_state_abr,
170-
DEP_DELAY_NEW AS dep_delay_new,
171-
ARR_DELAY_NEW AS arr_delay_new,
172-
CARRIER_DELAY AS carrier_delay,
173-
WEATHER_DELAY AS weather_delay,
174-
NAS_DELAY AS nas_delay,
175-
SECURITY_DELAY AS security_delay,
176-
LATE_AIRCRAFT_DELAY AS late_aircraft_delay
177-
FROM delays_raw;
122+
DROP TABLE delays_raw;
123+
-- Creates an external table over the csv file
124+
CREATE EXTERNAL TABLE delays_raw (
125+
YEAR string,
126+
FL_DATE string,
127+
UNIQUE_CARRIER string,
128+
CARRIER string,
129+
FL_NUM string,
130+
ORIGIN_AIRPORT_ID string,
131+
ORIGIN string,
132+
ORIGIN_CITY_NAME string,
133+
ORIGIN_CITY_NAME_TEMP string,
134+
ORIGIN_STATE_ABR string,
135+
DEST_AIRPORT_ID string,
136+
DEST string,
137+
DEST_CITY_NAME string,
138+
DEST_CITY_NAME_TEMP string,
139+
DEST_STATE_ABR string,
140+
DEP_DELAY_NEW float,
141+
ARR_DELAY_NEW float,
142+
CARRIER_DELAY float,
143+
WEATHER_DELAY float,
144+
NAS_DELAY float,
145+
SECURITY_DELAY float,
146+
LATE_AIRCRAFT_DELAY float)
147+
-- The following lines describe the format and location of the file
148+
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
149+
LINES TERMINATED BY '\n'
150+
STORED AS TEXTFILE
151+
LOCATION 'abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/data';
152+
153+
-- Drop the delays table if it exists
154+
DROP TABLE delays;
155+
-- Create the delays table and populate it with data
156+
-- pulled in from the CSV file (via the external table defined previously)
157+
CREATE TABLE delays
158+
LOCATION 'abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/processed'
159+
AS
160+
SELECT YEAR AS year,
161+
FL_DATE AS FlightDate,
162+
substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS IATA_CODE_Reporting_Airline,
163+
substring(CARRIER, 2, length(CARRIER) -1) AS Reporting_Airline,
164+
substring(FL_NUM, 2, length(FL_NUM) -1) AS Flight_Number_Reporting_Airline,
165+
ORIGIN_AIRPORT_ID AS OriginAirportID,
166+
substring(ORIGIN, 2, length(ORIGIN) -1) AS OriginAirportSeqID,
167+
substring(ORIGIN_CITY_NAME, 2) AS OriginCityName,
168+
substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1) AS OriginState,
169+
DEST_AIRPORT_ID AS DestAirportID,
170+
substring(DEST, 2, length(DEST) -1) AS DestAirportSeqID,
171+
substring(DEST_CITY_NAME,2) AS DestCityName,
172+
substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS DestState,
173+
DEP_DELAY_NEW AS DepDelay,
174+
ARR_DELAY_NEW AS ArrDelay,
175+
CARRIER_DELAY AS CarrierDelay,
176+
WEATHER_DELAY AS WeatherDelay,
177+
NAS_DELAY AS NASDelay,
178+
SECURITY_DELAY AS SecurityDelay,
179+
LATE_AIRCRAFT_DELAY AS LateAircraftDelay
180+
FROM delays_raw;
178181
```
179182
180-
3. Save the file by using use CTRL+X and then type `Y` when prompted.
183+
3. Save the file by typing CTRL+X and then typing `Y` when prompted.
181184
182185
4. To start Hive and run the `flightdelays.hql` file, use the following command:
183186
@@ -196,11 +199,11 @@ As part of the Apache Hive job, you import the data from the .csv file into an A
196199
```hiveql
197200
INSERT OVERWRITE DIRECTORY '/tutorials/flightdelays/output'
198201
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
199-
SELECT regexp_replace(origin_city_name, '''', ''),
200-
avg(weather_delay)
202+
SELECT regexp_replace(OriginCityName, '''', ''),
203+
avg(WeatherDelay)
201204
FROM delays
202-
WHERE weather_delay IS NOT NULL
203-
GROUP BY origin_city_name;
205+
WHERE WeatherDelay IS NOT NULL
206+
GROUP BY OriginCityName;
204207
```
205208
206209
This query retrieves a list of cities that experienced weather delays, along with the average delay time, and saves it to `abfs://<container-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output`. Later, Sqoop reads the data from this location and exports it to Azure SQL Database.
@@ -237,11 +240,11 @@ You need the server name from SQL Database for this operation. Complete these st
237240

238241
- Replace the `<server-name>` placeholder with the logical SQL server name.
239242

240-
- Replace the `<admin-login>` placeholder with the admin login for SQL Database.
243+
- Replace the `<admin-login>` placeholder with the admin username for SQL Database.
241244

242245
- Replace the `<database-name>` placeholder with the database name
243246

244-
When you're prompted, enter the password for the SQL Database admin login.
247+
When you're prompted, enter the password for the SQL Database admin username.
245248
246249
You receive output similar to the following text:
247250

0 commit comments

Comments
 (0)