Skip to content

Commit 24c1554

Browse files
authored
Merge pull request #53893 from roygara/serviceScrub
Reworking all articles to make it explicitly clear that this is *not* a service.
2 parents 9babd3a + 4ea5d85 commit 24c1554

12 files changed

+72
-71
lines changed

articles/storage/data-lake-storage/abfs-driver.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@ ms.component: data-lake-storage-gen2
1313

1414
# The Azure Blob Filesystem driver (ABFS): A dedicated Azure Storage driver for Hadoop
1515

16-
One of the primary access methods for data in Azure Data Lake Storage Gen2 Preview is via the [Hadoop FileSystem](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html). Azure Data Lake Storage Gen2 features an associated driver, the Azure Blob File System driver or `ABFS`. ABFS is part of Apache Hadoop and is included in many of the commercial distributions of Hadoop. Using this driver, many applications and frameworks can access data in Data Lake Storage Gen2 without any code explicitly referencing the Data Lake Storage Gen2 service.
16+
One of the primary access methods for data in Azure Data Lake Storage Gen2 Preview is via the [Hadoop FileSystem](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html). Data Lake Storage Gen2 allows users of Azure Blob Storage access to a new driver, the Azure Blob File System driver or `ABFS`. ABFS is part of Apache Hadoop and is included in many of the commercial distributions of Hadoop. Using this driver, many applications and frameworks can access data in Azure Blob Storage without any code explicitly referencing Data Lake Storage Gen2.
1717

1818
## Prior capability: The Windows Azure Storage Blob driver
1919

20-
The Windows Azure Storage Blob driver or [WASB driver](https://hadoop.apache.org/docs/current/hadoop-azure/index.html) provided the original support for Azure Storage Blobs. This driver performed the complex task of mapping file system semantics (as required by the Hadoop FileSystem interface) to that of the object store style interface exposed by Azure Blob Storage. This driver continues to support this model, providing high performance access to data stored in Blobs, but contains a significant amount of code performing this mapping, making it difficult to maintain. Additionally, some operations such as [FileSystem.rename()](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d) and [FileSystem.delete()](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_deletePath_p_boolean_recursive) when applied to directories require the driver to perform a vast number of operations (due to object stores lack of support for directories) which often leads to degraded performance. The new Azure Data Lake Storage service was designed to overcome the inherent deficiencies of WASB.
20+
The Windows Azure Storage Blob driver or [WASB driver](https://hadoop.apache.org/docs/current/hadoop-azure/index.html) provided the original support for Azure Blob Storage. This driver performed the complex task of mapping file system semantics (as required by the Hadoop FileSystem interface) to that of the object store style interface exposed by Azure Blob Storage. This driver continues to support this model, providing high performance access to data stored in Blobs, but contains a significant amount of code performing this mapping, making it difficult to maintain. Additionally, some operations such as [FileSystem.rename()](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_renamePath_src_Path_d) and [FileSystem.delete()](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_deletePath_p_boolean_recursive) when applied to directories require the driver to perform a vast number of operations (due to object stores lack of support for directories) which often leads to degraded performance. The ABFS driver was designed to overcome the inherent deficiencies of WASB.
2121

2222
## The Azure Blob File System driver
2323

articles/storage/data-lake-storage/handle-data-using-databricks.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.date: 06/27/2018
1212
---
1313
# Tutorial: Extract, transform, and load data using Azure Databricks
1414

15-
In this tutorial, you perform an ETL (extract, transform, and load data) operation to move data from Azure Data Lake Storage Gen2 Preview to Azure SQL Data Warehouse, using Azure Databricks.
15+
In this tutorial, you perform an ETL (extract, transform, and load data) operation to move data from an Azure Storage account with Azure Data Lake Storage Gen2 enabled, to Azure SQL Data Warehouse, using Azure Databricks.
1616

1717
The following illustration shows the application flow:
1818

@@ -46,7 +46,7 @@ Sign in to the [Azure portal](https://portal.azure.com/).
4646

4747
## Create an Azure Databricks workspace
4848

49-
In this section, you create an Azure Databricks workspace using the Azure portal.
49+
In this section, you create an Azure Databricks workspace using the Azure portal.
5050

5151
1. In the Azure portal, select **Create a resource** > **Analytics** > **Azure Databricks**.
5252

@@ -132,7 +132,7 @@ The next step is to upload a sample data file to the storage account to later tr
132132

133133
2. Next, you upload the sample data into your storage account. The method you use to upload data into your storage account differs depending on whether you have the hierarchical namespace enabled.
134134

135-
If the hierarchical namespace is enabled on your Azure Storage account created for Gen2 account, you can use Azure Data Factory, distp, or AzCopy (version 10) to handle the upload. AzCopy version 10 is only available to preview customers. To use AzCopy pase in the following code into a command window:
135+
If the hierarchical namespace is enabled on your Azure Storage account, you can use Azure Data Factory, distp, or AzCopy (version 10) to handle the upload. AzCopy version 10 is only available to preview via preview at this time. To use AzCopy, paste in the following code into a command window:
136136

137137
```bash
138138
set ACCOUNT_NAME=<ACCOUNT_NAME>

articles/storage/data-lake-storage/introduction.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ ms.component: data-lake-storage-gen2
1212

1313
# Introduction to Azure Data Lake Storage Gen2 Preview
1414

15-
Azure Data Lake Storage Gen2 Preview is a set of capabilities dedicated to big data analytics, built on top of [Azure Blob storage](../blobs/storage-blobs-introduction.md). It allows you to interface with your data using both file system and object storage paradigms. This makes Data Lake Storage Gen2 the only cloud-based multi-modal storage service, allowing you to extract analytics value from all of your data.
15+
Azure Data Lake Storage Gen2 Preview is a set of capabilities dedicated to big data analytics, built into [Azure Blob storage](../blobs/storage-blobs-introduction.md). It allows you to interface with your data using both file system and object storage paradigms. The addition of Data Lake Storage Gen2 makes Azure Storage the only cloud-based multi-modal platform, allowing you to extract analytics value from all of your data.
1616

17-
Data Lake Storage Gen2 features all qualities that are required for the full lifecycle of analytics data. This results from converging the capabilities of our two existing storage services. Features from [Azure Data Lake Storage Gen1](../../data-lake-store/index.md), such as file system semantics, file-level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities and a large SDK/tooling ecosystem from [Azure Blob storage](../blobs/storage-blobs-introduction.md). In Data Lake Storage Gen2, all the qualities of object storage remain while adding the advantages of a file system interface optimized for analytics workloads.
17+
Data Lake Storage Gen2 brings all the qualities that are required for the full lifecycle of analytics data to Azure Storage. It is the result of converging the capabilities of our two existing storage services, Azure Blob Storage and Azure Data Lake Storage Gen1. Features from [Azure Data Lake Storage Gen1](../../data-lake-store/index.md), such as file system semantics, file-level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities from [Azure Blob storage](../blobs/storage-blobs-introduction.md).
1818

1919
## Designed for enterprise big data analytics
2020

21-
Data Lake Storage Gen2 is the foundational storage service for building enterprise data lakes (EDL) on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 gives you an easy way to manage massive amounts of data.
21+
Data Lake Storage Gen2 makes Azure storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.
2222

23-
A fundamental feature of Data Lake Storage Gen2 is the addition of a [hierarchical namespace](./namespace.md) to the Blob storage service which organizes objects/files into a hierarchy of directories for performant data access. The hierarchical namespace also enables Data Lake Storage Gen2 to support both object store and file system paradigms at the same time. For instance, a common object store naming convention uses slashes in the name to mimic a hierarchical folder structure. This structure becomes real with Data Lake Storage Gen2. Operations such as renaming or deleting a directory become single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.
23+
A fundamental part of Data Lake Storage Gen2 is the addition of a [hierarchical namespace](./namespace.md) to the Blob storage service which organizes objects/files into a hierarchy of directories for efficient data access. The hierarchical namespace also enables Data Lake Storage Gen2 to support both object store and file system paradigms at the same time. For instance, a common object store naming convention uses slashes in the name to mimic a hierarchical folder structure. This structure becomes real with Data Lake Storage Gen2. Operations such as renaming or deleting a directory become single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.
2424

2525
In the past, cloud-based analytics had to compromise in areas of performance, management, and security. Data Lake Storage Gen2 addresses each of these aspects in the following ways:
2626

@@ -40,9 +40,9 @@ In the past, cloud-based analytics had to compromise in areas of performance, ma
4040
4141
- **Hadoop compatible access**: Data Lake Storage Gen2 allows you to manage and access data just as you would with a [Hadoop Distributed File System (HDFS)](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html). The new [ABFS driver](./abfs-driver.md) is available within all Apache Hadoop environments, including [Azure HDInsight](../../hdinsight/index.yml) and [Azure Databricks](../../azure-databricks/index.yml) to access data stored in Data Lake Storage Gen2.
4242

43-
- **A superset of POSIX permissions**: The security model for Data Lake Gen2 fully supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings may be configured through admin tools or through frameworks like Hive and Spark.
43+
- **A superset of POSIX permissions**: The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings may be configured through admin tools or through frameworks like Hive and Spark.
4444

45-
- **Cost effective**: Data Lake Storage Gen2 features low-cost storage capacity and transactions. As data transitions through its complete lifecycle, billing rates change keeping costs to a minimum via built-in features such as [Azure Blob storage lifecycle](../common/storage-lifecycle-managment-concepts.md).
45+
- **Cost effective**: Data Lake Storage Gen2 offers low-cost storage capacity and transactions. As data transitions through its complete lifecycle, billing rates change keeping costs to a minimum via built-in features such as [Azure Blob storage lifecycle](../common/storage-lifecycle-managment-concepts.md).
4646

4747
- **Works with Blob storage tools, frameworks, and apps**: Data Lake Storage Gen2 continues to work with a wide array of tools, frameworks, and applications that exist today for Blob storage.
4848

@@ -65,4 +65,4 @@ The following articles describe some of the main concepts of Data Lake Storage G
6565
* [Hierarchical namespace](./namespace.md)
6666
* [Create a storage account](./quickstart-create-account.md)
6767
* [Create an HDInsight cluster with Azure Data Lake Storage Gen2](./quickstart-create-connect-hdi-cluster.md)
68-
* [Use an Azure Data Lake Storage Gen2 account in Azure Databricks](./quickstart-create-databricks-account.md)
68+
* [Use an Azure Data Lake Storage Gen2 account in Azure Databricks](./quickstart-create-databricks-account.md)

articles/storage/data-lake-storage/namespace.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ ms.component: data-lake-storage-gen2
1212

1313
# Azure Data Lake Storage Gen2 Preview hierarchical namespace
1414

15-
A key mechanism that allows Azure Data Lake Storage Gen2 Preview to provide file system performance at object storage scale and prices is the addition of a **hierarchical namespace**. This allows the collection of objects/files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. With the hierarchical namespace enabled, Data Lake Storage Gen2 provides the scalability and cost-effectiveness of object storage, with file system semantics that are familiar to analytics engines and frameworks.
15+
A key mechanism that allows Azure Data Lake Storage Gen2 Preview to provide file system performance at object storage scale and prices is the addition of a **hierarchical namespace**. This allows the collection of objects/files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. With the hierarchical namespace enabled, a storage account becomes capable of providing the scalability and cost-effectiveness of object storage, with file system semantics that are familiar to analytics engines and frameworks.
1616

1717
## The benefits of the hierarchical namespace
1818

@@ -21,7 +21,7 @@ A key mechanism that allows Azure Data Lake Storage Gen2 Preview to provide file
2121
2222
The following benefits are associated with file systems that implement a hierarchical namespace over blob data:
2323

24-
- **Atomic Directory Manipulation:** Object stores approximate a directory hierarchy by adopting a convention of embedding slashes (/) in the object name to denote path segments. While this convention works for organizing objects, the convention provides no assistance for actions like moving, renaming or deleting directories. Without real directories, applications must process potentially millions of individual blobs to achieve directory-level tasks. By contrast, the hierarchical namespace processes these tasks by updating a single entry (the parent directory).
24+
- **Atomic directory manipulation:** Object stores approximate a directory hierarchy by adopting a convention of embedding slashes (/) in the object name to denote path segments. While this convention works for organizing objects, the convention provides no assistance for actions like moving, renaming or deleting directories. Without real directories, applications must process potentially millions of individual blobs to achieve directory-level tasks. By contrast, the hierarchical namespace processes these tasks by updating a single entry (the parent directory).
2525

2626
This dramatic optimization is especially significant for many big data analytics frameworks. Tools like Hive, Spark, etc. often write output to temporary locations and then rename the location at the conclusion of the job. Without the hierarchical namespace, this rename can often take longer than the analytics process itself. Lower job latency equals lower total cost of ownership (TCO) for analytics workloads.
2727

0 commit comments

Comments
 (0)