Skip to content

Commit cce6e5d

Browse files
authored
Merge pull request #230119 from normesta/gen2
Gen2 conceptual refresh
2 parents 87a6644 + c983134 commit cce6e5d

10 files changed

+36
-38
lines changed

articles/storage/blobs/data-lake-storage-abfs-driver.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,30 +7,30 @@ author: normesta
77
ms.topic: conceptual
88
ms.author: normesta
99
ms.reviewer: jamesbak
10-
ms.date: 12/06/2018
10+
ms.date: 03/09/2023
1111
ms.service: storage
1212
ms.subservice: data-lake-storage-gen2
1313
---
1414

1515
# The Azure Blob Filesystem driver (ABFS): A dedicated Azure Storage driver for Hadoop
1616

17-
One of the primary access methods for data in Azure Data Lake Storage Gen2 is via the [Hadoop FileSystem](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html). Data Lake Storage Gen2 allows users of Azure Blob Storage access to a new driver, the Azure Blob File System driver or `ABFS`. ABFS is part of Apache Hadoop and is included in many of the commercial distributions of Hadoop. Using this driver, many applications and frameworks can access data in Azure Blob Storage without any code explicitly referencing Data Lake Storage Gen2.
17+
One of the primary access methods for data in Azure Data Lake Storage Gen2 is via the [Hadoop FileSystem](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html). Data Lake Storage Gen2 allows users of Azure Blob Storage access to a new driver, the Azure Blob File System driver or `ABFS`. ABFS is part of Apache Hadoop and is included in many of the commercial distributions of Hadoop. By the ABFS driver, many applications and frameworks can access data in Azure Blob Storage without any code explicitly referencing Data Lake Storage Gen2.
1818

1919
## Prior capability: The Windows Azure Storage Blob driver
2020

2121
The Windows Azure Storage Blob driver or [WASB driver](https://hadoop.apache.org/docs/current/hadoop-azure/index.html) provided the original support for Azure Blob Storage. This driver performed the complex task of mapping file system semantics (as required by the Hadoop FileSystem interface) to that of the object store style interface exposed by Azure Blob Storage. This driver continues to support this model, providing high performance access to data stored in blobs, but contains a significant amount of code performing this mapping, making it difficult to maintain. Additionally, some operations such as [FileSystem.rename()](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d) and [FileSystem.delete()](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_deletePath_p_boolean_recursive) when applied to directories require the driver to perform a vast number of operations (due to object stores lack of support for directories) which often leads to degraded performance. The ABFS driver was designed to overcome the inherent deficiencies of WASB.
2222

2323
## The Azure Blob File System driver
2424

25-
The [Azure Data Lake Storage REST interface](/rest/api/storageservices/data-lake-storage-gen2) is designed to support file system semantics over Azure Blob Storage. Given that the Hadoop FileSystem is also designed to support the same semantics there is no requirement for a complex mapping in the driver. Thus, the Azure Blob File System driver (or ABFS) is a mere client shim for the REST API.
25+
The [Azure Data Lake Storage REST interface](/rest/api/storageservices/data-lake-storage-gen2) is designed to support file system semantics over Azure Blob Storage. Given that the Hadoop file system is also designed to support the same semantics there's no requirement for a complex mapping in the driver. Thus, the Azure Blob File System driver (or ABFS) is a mere client shim for the REST API.
2626

2727
However, there are some functions that the driver must still perform:
2828

2929
### URI scheme to reference data
3030

31-
Consistent with other FileSystem implementations within Hadoop, the ABFS driver defines its own URI scheme so that resources (directories and files) may be distinctly addressed. The URI scheme is documented in [Use the Azure Data Lake Storage Gen2 URI](./data-lake-storage-introduction-abfs-uri.md). The structure of the URI is: `abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>`
31+
Consistent with other file system implementations within Hadoop, the ABFS driver defines its own URI scheme so that resources (directories and files) may be distinctly addressed. The URI scheme is documented in [Use the Azure Data Lake Storage Gen2 URI](./data-lake-storage-introduction-abfs-uri.md). The structure of the URI is: `abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>`
3232

33-
Using the above URI format, standard Hadoop tools and frameworks can be used to reference these resources:
33+
By using this URI format, standard Hadoop tools and frameworks can be used to reference these resources:
3434

3535
```bash
3636
hdfs dfs -mkdir -p abfs://[email protected]/tutorials/flightdelays/data

articles/storage/blobs/data-lake-storage-best-practices.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ author: normesta
77
ms.subservice: data-lake-storage-gen2
88
ms.service: storage
99
ms.topic: conceptual
10-
ms.date: 09/29/2022
10+
ms.date: 03/09/2023
1111
ms.author: normesta
1212
ms.reviewer: sachins
1313
---
@@ -37,7 +37,7 @@ Use the following pattern as you configure your account to use Blob storage feat
3737

3838
#### Understand the terms used in documentation
3939

40-
As you move between content sets, you'll notice some slight terminology differences. For example, content featured in the [Blob storage documentation](storage-blobs-introduction.md), will use the term *blob* instead of *file*. Technically, the files that you ingest to your storage account become blobs in your account. Therefore, the term is correct. However, the term *blob* can cause confusion if you're used to the term *file*. You'll also see the term *container* used to refer to a *file system*. Consider these terms as synonymous.
40+
As you move between content sets, you notice some slight terminology differences. For example, content featured in the [Blob storage documentation](storage-blobs-introduction.md), will use the term *blob* instead of *file*. Technically, the files that you ingest to your storage account become blobs in your account. Therefore, the term is correct. However, the term *blob* can cause confusion if you're used to the term *file*. You'll also see the term *container* used to refer to a *file system*. Consider these terms as synonymous.
4141

4242
## Consider premium
4343

@@ -84,7 +84,7 @@ Consider pre-planning the structure of your data. File format, file size, and di
8484

8585
### File formats
8686

87-
Data can be ingested in various formats. Data can be appear in human readable formats such as JSON, CSV, or XML or as compressed binary formats such as `.tar.gz`. Data can come in various sizes as well. Data can be composed of large files (a few terabytes) such as data from an export of a SQL table from your on-premises systems. Data can also come in the form of a large number of tiny files (a few kilobytes) such as data from real-time events from an Internet of things (IoT) solution. You can optimize efficiency and costs by choosing an appropriate file format and file size.
87+
Data can be ingested in various formats. Data can appear in human readable formats such as JSON, CSV, or XML or as compressed binary formats such as `.tar.gz`. Data can come in various sizes as well. Data can be composed of large files (a few terabytes) such as data from an export of a SQL table from your on-premises systems. Data can also come in the form of a large number of tiny files (a few kilobytes) such as data from real-time events from an Internet of things (IoT) solution. You can optimize efficiency and costs by choosing an appropriate file format and file size.
8888

8989
Hadoop supports a set of file formats that are optimized for storing and processing structured data. Some common formats are Avro, Parquet, and Optimized Row Columnar (ORC) format. All of these formats are machine-readable binary file formats. They're compressed to help you manage file size. They have a schema embedded in each file, which makes them self-describing. The difference between these formats is in how data is stored. Avro stores data in a row-based format and the Parquet and ORC formats store data in a columnar format.
9090

@@ -100,7 +100,7 @@ Larger files lead to better performance and reduced costs.
100100

101101
Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). Some engines and applications might have trouble efficiently processing files that are greater than 100 GB in size.
102102

103-
Increasing file size can also reduce transaction costs. Read and write operations are billed in 4-megabyte increments so you're charged for operation whether or not the file contains 4 megabytes or only a few kilobytes. For pricing information, see [Azure Data Lake Storage pricing](https://azure.microsoft.com/pricing/details/storage/data-lake/).
103+
Increasing file size can also reduce transaction costs. Read and write operations are billed in 4 megabyte increments so you're charged for operation whether or not the file contains 4 megabytes or only a few kilobytes. For pricing information, see [Azure Data Lake Storage pricing](https://azure.microsoft.com/pricing/details/storage/data-lake/).
104104

105105
Sometimes, data pipelines have limited control over the raw data, which has lots of small files. In general, we recommend that your system have some sort of process to aggregate small files into larger ones for use by downstream applications. If you're processing data in real time, you can use a real time streaming engine (such as [Azure Stream Analytics](../../stream-analytics/stream-analytics-introduction.md) or [Spark Streaming](https://databricks.com/glossary/what-is-spark-streaming)) together with a message broker (such as [Event Hubs](../../event-hubs/event-hubs-about.md) or [Apache Kafka](https://kafka.apache.org/)) to store your data as larger files. As you aggregate small files into larger ones, consider saving them in a read-optimized format such as [Apache Parquet](https://parquet.apache.org/) for downstream processing.
106106

articles/storage/blobs/data-lake-storage-introduction-abfs-uri.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
22
title: Use the Azure Data Lake Storage Gen2 URI
33
titleSuffix: Azure Storage
4-
description: Learn URI syntax for the abfs scheme identifier, which represents the Azure Blob File System driver (Hadoop Filesystem driver for Azure Data Lake Storage Gen2).
4+
description: Learn URI syntax for the ABFS scheme identifier, which represents the Azure Blob File System driver (Hadoop Filesystem driver for Azure Data Lake Storage Gen2).
55
author: normesta
66

77
ms.topic: conceptual
88
ms.author: normesta
9-
ms.date: 12/06/2018
9+
ms.date: 03/09/2023
1010
ms.service: storage
1111
ms.subservice: data-lake-storage-gen2
1212
ms.reviewer: jamesbak
@@ -24,17 +24,17 @@ If the Data Lake Storage Gen2 capable account you wish to address **is not** set
2424

2525
<pre>abfs[s]<sup>1</sup>://&lt;file_system&gt;<sup>2</sup>@&lt;account_name&gt;<sup>3</sup>.dfs.core.windows.net/&lt;path&gt;<sup>4</sup>/&lt;file_name&gt;<sup>5</sup></pre>
2626

27-
1. **Scheme identifier**: The `abfs` protocol is used as the scheme identifier. If you add an 's' at the end (abfs<b><i>s</i></b>) then the ABFS Hadoop client driver will <i>ALWAYS</i> use Transport Layer Security (TLS) irrespective of the authentication method chosen. If you choose OAuth as your authentication then the client driver will always use TLS even if you specify 'abfs' instead of 'abfss' because OAuth solely relies on the TLS layer. Finally, if you choose to use the older method of storage account key, then the client driver will interpret 'abfs' to mean that you do not want to use TLS.
27+
1. **Scheme identifier**: The `abfs` protocol is used as the scheme identifier. If you add an `s` at the end (abfs<b><i>s</i></b>) then the ABFS Hadoop client driver will always use Transport Layer Security (TLS) irrespective of the authentication method chosen. If you choose OAuth as your authentication, then the client driver will always use TLS even if you specify `abfs` instead of `abfss` because OAuth solely relies on the TLS layer. Finally, if you choose to use the older method of storage account key, then the client driver interprets `abfs` to mean that you don't want to use TLS.
2828

29-
2. **File system**: The parent location that holds the files and folders. This is the same as Containers in the Azure Storage Blobs service.
29+
2. **File system**: The parent location that holds the files and folders. This is the same as containers in the Azure Storage Blob service.
3030

3131
3. **Account name**: The name given to your storage account during creation.
3232

3333
4. **Paths**: A forward slash delimited (`/`) representation of the directory structure.
3434

35-
5. **File name**: The name of the individual file. This parameter is optional if you are addressing a directory.
35+
5. **File name**: The name of the individual file. This parameter is optional if you're addressing a directory.
3636

37-
However, if the account you wish to address is set as the default file system during account creation, then the shorthand URI syntax is:
37+
However, if the account you want to address is set as the default file system during account creation, then the shorthand URI syntax is:
3838

3939
<pre>/&lt;path&gt;<sup>1</sup>/&lt;file_name&gt;<sup>2</sup></pre>
4040

articles/storage/blobs/data-lake-storage-introduction.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: normesta
66

77
ms.service: storage
88
ms.topic: overview
9-
ms.date: 03/01/2023
9+
ms.date: 03/09/2023
1010
ms.author: normesta
1111
ms.reviewer: jamesbak
1212
ms.subservice: data-lake-storage-gen2
@@ -16,7 +16,7 @@ ms.subservice: data-lake-storage-gen2
1616

1717
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on [Azure Blob Storage](storage-blobs-introduction.md).
1818

19-
Data Lake Storage Gen2 converges the capabilities of [Azure Data Lake Storage Gen1](../../data-lake-store/index.yml) with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster recovery capabilities.
19+
Data Lake Storage Gen2 converges the capabilities of [Azure Data Lake Storage Gen1](../../data-lake-store/index.yml) with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you also get low-cost, tiered storage, with high availability/disaster recovery capabilities.
2020

2121
## Designed for enterprise big data analytics
2222

@@ -81,6 +81,7 @@ Several open source platforms support Data Lake Storage Gen2. For a complete lis
8181

8282
## See also
8383

84+
- [Introduction to Azure Data Lake Storage Gen2 (Training module)](/training/modules/introduction-to-azure-data-lake-storage/)
8485
- [Best practices for using Azure Data Lake Storage Gen2](data-lake-storage-best-practices.md)
8586
- [Known issues with Azure Data Lake Storage Gen2](data-lake-storage-known-issues.md)
8687
- [Multi-protocol access on Azure Data Lake Storage](data-lake-storage-multi-protocol-access.md)

articles/storage/blobs/data-lake-storage-multi-protocol-access.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@ author: normesta
77
ms.subservice: data-lake-storage-gen2
88
ms.service: storage
99
ms.topic: conceptual
10-
ms.date: 02/25/2020
10+
ms.date: 03/09/2023
1111
ms.author: normesta
1212
ms.reviewer: stewu
1313
---
1414

1515
# Multi-protocol access on Azure Data Lake Storage
1616

17-
Blob APIs now work with accounts that have a hierarchical namespace. This unlocks the ecosystem of tools, applications, and services, as well as several Blob storage features to accounts that have a hierarchical namespace.
17+
Blob APIs work with accounts that have a hierarchical namespace. This unlocks the ecosystem of tools, applications, and services, as well as several Blob storage features to accounts that have a hierarchical namespace.
1818

1919
Until recently, you might have had to maintain separate storage solutions for object storage and analytics storage. That's because Azure Data Lake Storage Gen2 had limited ecosystem support. It also had limited access to Blob service features such as diagnostic logging. A fragmented storage solution is hard to maintain because you have to move data between accounts to accomplish various scenarios. You no longer have to do that.
2020

@@ -23,7 +23,7 @@ With multi-protocol access on Data Lake Storage, you can work with your data by
2323
Blob storage features such as [diagnostic logging](../common/storage-analytics-logging.md), [access tiers](access-tiers-overview.md), and [Blob storage lifecycle management policies](./lifecycle-management-overview.md) now work with accounts that have a hierarchical namespace. Therefore, you can enable hierarchical namespaces on your blob Storage accounts without losing access to these important features.
2424

2525
> [!NOTE]
26-
> Multi-protocol access on Data Lake Storage is generally available and is available in all regions. Some Azure services or Blob storage features enabled by multi-protocol access remain in preview. These articles summarize the current support for Blob storage features and Azure service integrations.
26+
> Some Azure services or Blob storage features enabled by multi-protocol access remain in preview. These articles summarize the current support for Blob storage features and Azure service integrations.
2727
>
2828
> [Blob Storage feature support in Azure Storage accounts](storage-feature-support-in-storage-accounts.md)
2929
>

articles/storage/blobs/data-lake-storage-namespace.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
2-
title: Azure Data Lake Storage Gen2 Hierarchical Namespace
2+
title: Azure Data Lake Storage Gen2 hierarchical namespace
33
titleSuffix: Azure Storage
44
description: Describes the concept of a hierarchical namespace for Azure Data Lake Storage Gen2
55
author: normesta
66

77
ms.service: storage
88
ms.topic: conceptual
9-
ms.date: 10/22/2021
9+
ms.date: 03/09/2023
1010
ms.author: normesta
1111
ms.reviewer: jamesbak
1212
ms.subservice: data-lake-storage-gen2
@@ -44,5 +44,5 @@ To analyze differences in data storage prices, transaction prices, and storage c
4444

4545
## Next steps
4646

47-
- Enable a hierarchical namespace when you create a new storage account. See [Create a Storage account](../common/storage-account-create.md).
47+
- Enable a hierarchical namespace when you create a new storage account. See [Create a storage account to use with Azure Data Lake Storage Gen2](create-data-lake-storage-account.md).
4848
- Enable a hierarchical namespace on an existing storage account. See [Upgrade Azure Blob Storage with Azure Data Lake Storage Gen2 capabilities](upgrade-to-data-lake-storage-gen2-how-to.md).

0 commit comments

Comments
 (0)