You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-abfs-driver.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,30 +7,30 @@ author: normesta
7
7
ms.topic: conceptual
8
8
ms.author: normesta
9
9
ms.reviewer: jamesbak
10
-
ms.date: 12/06/2018
10
+
ms.date: 03/09/2023
11
11
ms.service: storage
12
12
ms.subservice: data-lake-storage-gen2
13
13
---
14
14
15
15
# The Azure Blob Filesystem driver (ABFS): A dedicated Azure Storage driver for Hadoop
16
16
17
-
One of the primary access methods for data in Azure Data Lake Storage Gen2 is via the [Hadoop FileSystem](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html). Data Lake Storage Gen2 allows users of Azure Blob Storage access to a new driver, the Azure Blob File System driver or `ABFS`. ABFS is part of Apache Hadoop and is included in many of the commercial distributions of Hadoop. Using this driver, many applications and frameworks can access data in Azure Blob Storage without any code explicitly referencing Data Lake Storage Gen2.
17
+
One of the primary access methods for data in Azure Data Lake Storage Gen2 is via the [Hadoop FileSystem](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html). Data Lake Storage Gen2 allows users of Azure Blob Storage access to a new driver, the Azure Blob File System driver or `ABFS`. ABFS is part of Apache Hadoop and is included in many of the commercial distributions of Hadoop. By the ABFS driver, many applications and frameworks can access data in Azure Blob Storage without any code explicitly referencing Data Lake Storage Gen2.
18
18
19
19
## Prior capability: The Windows Azure Storage Blob driver
20
20
21
21
The Windows Azure Storage Blob driver or [WASB driver](https://hadoop.apache.org/docs/current/hadoop-azure/index.html) provided the original support for Azure Blob Storage. This driver performed the complex task of mapping file system semantics (as required by the Hadoop FileSystem interface) to that of the object store style interface exposed by Azure Blob Storage. This driver continues to support this model, providing high performance access to data stored in blobs, but contains a significant amount of code performing this mapping, making it difficult to maintain. Additionally, some operations such as [FileSystem.rename()](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d) and [FileSystem.delete()](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_deletePath_p_boolean_recursive) when applied to directories require the driver to perform a vast number of operations (due to object stores lack of support for directories) which often leads to degraded performance. The ABFS driver was designed to overcome the inherent deficiencies of WASB.
22
22
23
23
## The Azure Blob File System driver
24
24
25
-
The [Azure Data Lake Storage REST interface](/rest/api/storageservices/data-lake-storage-gen2) is designed to support file system semantics over Azure Blob Storage. Given that the Hadoop FileSystem is also designed to support the same semantics there is no requirement for a complex mapping in the driver. Thus, the Azure Blob File System driver (or ABFS) is a mere client shim for the REST API.
25
+
The [Azure Data Lake Storage REST interface](/rest/api/storageservices/data-lake-storage-gen2) is designed to support file system semantics over Azure Blob Storage. Given that the Hadoop file system is also designed to support the same semantics there's no requirement for a complex mapping in the driver. Thus, the Azure Blob File System driver (or ABFS) is a mere client shim for the REST API.
26
26
27
27
However, there are some functions that the driver must still perform:
28
28
29
29
### URI scheme to reference data
30
30
31
-
Consistent with other FileSystem implementations within Hadoop, the ABFS driver defines its own URI scheme so that resources (directories and files) may be distinctly addressed. The URI scheme is documented in [Use the Azure Data Lake Storage Gen2 URI](./data-lake-storage-introduction-abfs-uri.md). The structure of the URI is: `abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>`
31
+
Consistent with other file system implementations within Hadoop, the ABFS driver defines its own URI scheme so that resources (directories and files) may be distinctly addressed. The URI scheme is documented in [Use the Azure Data Lake Storage Gen2 URI](./data-lake-storage-introduction-abfs-uri.md). The structure of the URI is: `abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>`
32
32
33
-
Using the above URI format, standard Hadoop tools and frameworks can be used to reference these resources:
33
+
By using this URI format, standard Hadoop tools and frameworks can be used to reference these resources:
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-best-practices.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ author: normesta
7
7
ms.subservice: data-lake-storage-gen2
8
8
ms.service: storage
9
9
ms.topic: conceptual
10
-
ms.date: 09/29/2022
10
+
ms.date: 03/09/2023
11
11
ms.author: normesta
12
12
ms.reviewer: sachins
13
13
---
@@ -37,7 +37,7 @@ Use the following pattern as you configure your account to use Blob storage feat
37
37
38
38
#### Understand the terms used in documentation
39
39
40
-
As you move between content sets, you'll notice some slight terminology differences. For example, content featured in the [Blob storage documentation](storage-blobs-introduction.md), will use the term *blob* instead of *file*. Technically, the files that you ingest to your storage account become blobs in your account. Therefore, the term is correct. However, the term *blob* can cause confusion if you're used to the term *file*. You'll also see the term *container* used to refer to a *file system*. Consider these terms as synonymous.
40
+
As you move between content sets, you notice some slight terminology differences. For example, content featured in the [Blob storage documentation](storage-blobs-introduction.md), will use the term *blob* instead of *file*. Technically, the files that you ingest to your storage account become blobs in your account. Therefore, the term is correct. However, the term *blob* can cause confusion if you're used to the term *file*. You'll also see the term *container* used to refer to a *file system*. Consider these terms as synonymous.
41
41
42
42
## Consider premium
43
43
@@ -84,7 +84,7 @@ Consider pre-planning the structure of your data. File format, file size, and di
84
84
85
85
### File formats
86
86
87
-
Data can be ingested in various formats. Data can be appear in human readable formats such as JSON, CSV, or XML or as compressed binary formats such as `.tar.gz`. Data can come in various sizes as well. Data can be composed of large files (a few terabytes) such as data from an export of a SQL table from your on-premises systems. Data can also come in the form of a large number of tiny files (a few kilobytes) such as data from real-time events from an Internet of things (IoT) solution. You can optimize efficiency and costs by choosing an appropriate file format and file size.
87
+
Data can be ingested in various formats. Data can appear in human readable formats such as JSON, CSV, or XML or as compressed binary formats such as `.tar.gz`. Data can come in various sizes as well. Data can be composed of large files (a few terabytes) such as data from an export of a SQL table from your on-premises systems. Data can also come in the form of a large number of tiny files (a few kilobytes) such as data from real-time events from an Internet of things (IoT) solution. You can optimize efficiency and costs by choosing an appropriate file format and file size.
88
88
89
89
Hadoop supports a set of file formats that are optimized for storing and processing structured data. Some common formats are Avro, Parquet, and Optimized Row Columnar (ORC) format. All of these formats are machine-readable binary file formats. They're compressed to help you manage file size. They have a schema embedded in each file, which makes them self-describing. The difference between these formats is in how data is stored. Avro stores data in a row-based format and the Parquet and ORC formats store data in a columnar format.
90
90
@@ -100,7 +100,7 @@ Larger files lead to better performance and reduced costs.
100
100
101
101
Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). Some engines and applications might have trouble efficiently processing files that are greater than 100 GB in size.
102
102
103
-
Increasing file size can also reduce transaction costs. Read and write operations are billed in 4-megabyte increments so you're charged for operation whether or not the file contains 4 megabytes or only a few kilobytes. For pricing information, see [Azure Data Lake Storage pricing](https://azure.microsoft.com/pricing/details/storage/data-lake/).
103
+
Increasing file size can also reduce transaction costs. Read and write operations are billed in 4megabyte increments so you're charged for operation whether or not the file contains 4 megabytes or only a few kilobytes. For pricing information, see [Azure Data Lake Storage pricing](https://azure.microsoft.com/pricing/details/storage/data-lake/).
104
104
105
105
Sometimes, data pipelines have limited control over the raw data, which has lots of small files. In general, we recommend that your system have some sort of process to aggregate small files into larger ones for use by downstream applications. If you're processing data in real time, you can use a real time streaming engine (such as [Azure Stream Analytics](../../stream-analytics/stream-analytics-introduction.md) or [Spark Streaming](https://databricks.com/glossary/what-is-spark-streaming)) together with a message broker (such as [Event Hubs](../../event-hubs/event-hubs-about.md) or [Apache Kafka](https://kafka.apache.org/)) to store your data as larger files. As you aggregate small files into larger ones, consider saving them in a read-optimized format such as [Apache Parquet](https://parquet.apache.org/) for downstream processing.
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-introduction-abfs-uri.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,12 @@
1
1
---
2
2
title: Use the Azure Data Lake Storage Gen2 URI
3
3
titleSuffix: Azure Storage
4
-
description: Learn URI syntax for the abfs scheme identifier, which represents the Azure Blob File System driver (Hadoop Filesystem driver for Azure Data Lake Storage Gen2).
4
+
description: Learn URI syntax for the ABFS scheme identifier, which represents the Azure Blob File System driver (Hadoop Filesystem driver for Azure Data Lake Storage Gen2).
5
5
author: normesta
6
6
7
7
ms.topic: conceptual
8
8
ms.author: normesta
9
-
ms.date: 12/06/2018
9
+
ms.date: 03/09/2023
10
10
ms.service: storage
11
11
ms.subservice: data-lake-storage-gen2
12
12
ms.reviewer: jamesbak
@@ -24,17 +24,17 @@ If the Data Lake Storage Gen2 capable account you wish to address **is not** set
1.**Scheme identifier**: The `abfs` protocol is used as the scheme identifier. If you add an 's' at the end (abfs<b><i>s</i></b>) then the ABFS Hadoop client driver will <i>ALWAYS</i> use Transport Layer Security (TLS) irrespective of the authentication method chosen. If you choose OAuth as your authentication then the client driver will always use TLS even if you specify 'abfs' instead of 'abfss' because OAuth solely relies on the TLS layer. Finally, if you choose to use the older method of storage account key, then the client driver will interpret 'abfs' to mean that you do not want to use TLS.
27
+
1.**Scheme identifier**: The `abfs` protocol is used as the scheme identifier. If you add an `s` at the end (abfs<b><i>s</i></b>) then the ABFS Hadoop client driver will always use Transport Layer Security (TLS) irrespective of the authentication method chosen. If you choose OAuth as your authentication, then the client driver will always use TLS even if you specify `abfs` instead of `abfss` because OAuth solely relies on the TLS layer. Finally, if you choose to use the older method of storage account key, then the client driver interprets `abfs` to mean that you don't want to use TLS.
28
28
29
-
2.**File system**: The parent location that holds the files and folders. This is the same as Containers in the Azure Storage Blobs service.
29
+
2.**File system**: The parent location that holds the files and folders. This is the same as containers in the Azure Storage Blob service.
30
30
31
31
3.**Account name**: The name given to your storage account during creation.
32
32
33
33
4.**Paths**: A forward slash delimited (`/`) representation of the directory structure.
34
34
35
-
5.**File name**: The name of the individual file. This parameter is optional if you are addressing a directory.
35
+
5.**File name**: The name of the individual file. This parameter is optional if you're addressing a directory.
36
36
37
-
However, if the account you wish to address is set as the default file system during account creation, then the shorthand URI syntax is:
37
+
However, if the account you want to address is set as the default file system during account creation, then the shorthand URI syntax is:
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on [Azure Blob Storage](storage-blobs-introduction.md).
18
18
19
-
Data Lake Storage Gen2 converges the capabilities of [Azure Data Lake Storage Gen1](../../data-lake-store/index.yml) with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster recovery capabilities.
19
+
Data Lake Storage Gen2 converges the capabilities of [Azure Data Lake Storage Gen1](../../data-lake-store/index.yml) with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you also get low-cost, tiered storage, with high availability/disaster recovery capabilities.
20
20
21
21
## Designed for enterprise big data analytics
22
22
@@ -81,6 +81,7 @@ Several open source platforms support Data Lake Storage Gen2. For a complete lis
81
81
82
82
## See also
83
83
84
+
-[Introduction to Azure Data Lake Storage Gen2 (Training module)](/training/modules/introduction-to-azure-data-lake-storage/)
84
85
-[Best practices for using Azure Data Lake Storage Gen2](data-lake-storage-best-practices.md)
85
86
-[Known issues with Azure Data Lake Storage Gen2](data-lake-storage-known-issues.md)
86
87
-[Multi-protocol access on Azure Data Lake Storage](data-lake-storage-multi-protocol-access.md)
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-multi-protocol-access.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,14 +7,14 @@ author: normesta
7
7
ms.subservice: data-lake-storage-gen2
8
8
ms.service: storage
9
9
ms.topic: conceptual
10
-
ms.date: 02/25/2020
10
+
ms.date: 03/09/2023
11
11
ms.author: normesta
12
12
ms.reviewer: stewu
13
13
---
14
14
15
15
# Multi-protocol access on Azure Data Lake Storage
16
16
17
-
Blob APIs now work with accounts that have a hierarchical namespace. This unlocks the ecosystem of tools, applications, and services, as well as several Blob storage features to accounts that have a hierarchical namespace.
17
+
Blob APIs work with accounts that have a hierarchical namespace. This unlocks the ecosystem of tools, applications, and services, as well as several Blob storage features to accounts that have a hierarchical namespace.
18
18
19
19
Until recently, you might have had to maintain separate storage solutions for object storage and analytics storage. That's because Azure Data Lake Storage Gen2 had limited ecosystem support. It also had limited access to Blob service features such as diagnostic logging. A fragmented storage solution is hard to maintain because you have to move data between accounts to accomplish various scenarios. You no longer have to do that.
20
20
@@ -23,7 +23,7 @@ With multi-protocol access on Data Lake Storage, you can work with your data by
23
23
Blob storage features such as [diagnostic logging](../common/storage-analytics-logging.md), [access tiers](access-tiers-overview.md), and [Blob storage lifecycle management policies](./lifecycle-management-overview.md) now work with accounts that have a hierarchical namespace. Therefore, you can enable hierarchical namespaces on your blob Storage accounts without losing access to these important features.
24
24
25
25
> [!NOTE]
26
-
> Multi-protocol access on Data Lake Storage is generally available and is available in all regions. Some Azure services or Blob storage features enabled by multi-protocol access remain in preview. These articles summarize the current support for Blob storage features and Azure service integrations.
26
+
> Some Azure services or Blob storage features enabled by multi-protocol access remain in preview. These articles summarize the current support for Blob storage features and Azure service integrations.
27
27
>
28
28
> [Blob Storage feature support in Azure Storage accounts](storage-feature-support-in-storage-accounts.md)
Copy file name to clipboardExpand all lines: articles/storage/blobs/data-lake-storage-namespace.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,12 @@
1
1
---
2
-
title: Azure Data Lake Storage Gen2 Hierarchical Namespace
2
+
title: Azure Data Lake Storage Gen2 hierarchical namespace
3
3
titleSuffix: Azure Storage
4
4
description: Describes the concept of a hierarchical namespace for Azure Data Lake Storage Gen2
5
5
author: normesta
6
6
7
7
ms.service: storage
8
8
ms.topic: conceptual
9
-
ms.date: 10/22/2021
9
+
ms.date: 03/09/2023
10
10
ms.author: normesta
11
11
ms.reviewer: jamesbak
12
12
ms.subservice: data-lake-storage-gen2
@@ -44,5 +44,5 @@ To analyze differences in data storage prices, transaction prices, and storage c
44
44
45
45
## Next steps
46
46
47
-
- Enable a hierarchical namespace when you create a new storage account. See [Create a Storage account](../common/storage-account-create.md).
47
+
- Enable a hierarchical namespace when you create a new storage account. See [Create a storage account to use with Azure Data Lake Storage Gen2](create-data-lake-storage-account.md).
48
48
- Enable a hierarchical namespace on an existing storage account. See [Upgrade Azure Blob Storage with Azure Data Lake Storage Gen2 capabilities](upgrade-to-data-lake-storage-gen2-how-to.md).
0 commit comments