Skip to content

Commit 1aa3e92

Browse files
committed
Merging Training module content into the Data Lake Storage Gen2 overview
1 parent 9c514f8 commit 1aa3e92

File tree

1 file changed

+26
-33
lines changed

1 file changed

+26
-33
lines changed

articles/storage/blobs/data-lake-storage-introduction.md

Lines changed: 26 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -18,64 +18,57 @@ Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data anal
1818

1919
Data Lake Storage Gen2 converges the capabilities of [Azure Data Lake Storage Gen1](../../data-lake-store/index.yml) with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you also get low-cost, tiered storage, with high availability/disaster recovery capabilities.
2020

21-
## What is a data lake?
21+
## Data lakes, Data Lake Storage, and Gen2
2222

23-
A *data lake* is a single, centralized repository where you can store all your data, both structured and unstructured. A data lake enables your organization to quickly and more easily store, access, and analyze a wide variety of data in a single location. With a data lake, you don't need to conform your data to fit an existing structure. Instead, you can store your data in its raw or native format, usually as files or as binary large objects (blobs).
23+
A _data lake_ is a single, centralized repository where you can store all your data, both structured and unstructured. A data lake enables your organization to quickly and more easily store, access, and analyze a wide variety of data in a single location. With a data lake, you don't need to conform your data to fit an existing structure. Instead, you can store your data in its raw or native format, usually as files or as binary large objects (blobs).
2424

25-
When evaluating whether a data lake is the correct solution for your company, you should consider several elements as described in the following table.
25+
_Azure Data Lake Storage_ is a cloud-based, enterprise data lake solution. It's engineered to store massive amounts of data in any format, and to facilitate big data analytical workloads. You use it to capture data of any type and ingestion speed in a single location for easy access and analysis using various frameworks.
2626

27-
| **Element** | **Description** |
28-
| --- | --- |
29-
| **Data speed** | A data lake must be able to ingest data at any speed: from the occasional file, to large relational data imports, to real-time data generated by web server logs or IoT devices. |
30-
| **Data scalability** | A data lake might be required to store massive amounts of data that arrive in real time. Thus, the storage must be highly scalable to keep up with demand. |
31-
| **Data availability** | After the data is stored in a data lake, it must be readily available via browsing, searching, and indexing. |
32-
| **Data security** | Most data lakes store crucial data assets, including line-of-business (LOB) data, company-developed apps, and productivity output. The data lake requires robust security to protect these assets. |
33-
| **Data analytics** | A data lake must store data in a way that enables users to use their preferred tools to analyze the data in place. Business analysts, data scientists, and AI modelers need to use their own tools to derive business intelligence, insights, trends, and forecasts. |
34-
| | |
27+
_Azure Data Lake Storage Gen2_ refers to the current implementation of Azure's Data Lake Storage solution. The previous implementation, _Azure Data Lake Storage Gen1_, is scheduled to be retired on February 29, 2024. Unlike Data Lake Storage Gen1, Data Lake Storage Gen2 isn't a dedicated service or account type. Instead, it's implemented as a set of capabilities that you use with the Blob Storage service of your Azure Storage account.
3528

36-
## Azure Data Lake Storage definition
29+
Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.
3730

38-
*Azure Data Lake Storage* is a cloud-based, enterprise data lake solution. It's engineered to store massive amounts of data in any format, and to facilitate big data analytical workloads. You use it to capture data of any type and ingestion speed in a single location for easy access and analysis using various frameworks. The current implementation of Azure Data Lake Storage is Azure Data Lake Storage Gen2 and it is not a dedicated service. Data Lake Storage Gen2 is implemented as a set of capabilities that are built on top of the Azure Blob Storage service. The previous implementation, named Azure Data Lake Storage Gen1, is a dedicated service separate from Azure Storage. This service is scheduled to be retired on February 29, 2024.
3931

40-
## Designed for enterprise big data analytics
4132

42-
Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.
33+
## Data Lake Storage Gen2 capabilities
4334

44-
A fundamental part of Data Lake Storage Gen2 is the addition of a [hierarchical namespace](data-lake-storage-namespace.md) to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. A common object store naming convention uses slashes in the name to mimic a hierarchical directory structure. This structure becomes real with Data Lake Storage Gen2. Operations such as renaming or deleting a directory, become single atomic metadata operations on the directory. There's no need to enumerate and process all objects that share the name prefix of the directory.
35+
This section describes Data Lake Storage Gen2 capabilities. You can unlock these capabilities in your Azure Storage account by enabling the hierarchical namespace setting.
4536

46-
Data Lake Storage Gen2 builds on Blob storage and enhances performance, management, and security in the following ways:
37+
> [!NOTE]
38+
> The hierarchical namespace setting is *not* enabled by default. When you create a storage account, you can select the **Enable Hierarchical Namespace** checkbox. You can also enable hierarchical namespaces for existing account by selecting the **Data Lake Gen2 Migration** setting available in the Azure portal.
4739
48-
- **Performance** is optimized because you don't need to copy or transform data as a prerequisite for analysis. Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.
40+
#### Hadoop-compatible access
4941

50-
- **Management** is easier because you can organize and manipulate files through directories and subdirectories.
42+
Azure Data Lake Storage Gen2 is primarily designed to work with Hadoop and all frameworks that use the Apache [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) as their data access layer. Hadoop distributions include the [Azure Blob File System (ABFS)](data-lake-storage-abfs-driver.md) driver, which enables many applications and frameworks to access Azure Blob Storage data directly. The ABFS driver is [optimized specifically](data-lake-storage-abfs-driver.md) for big data analytics. The corresponding REST APIs are surfaced through the endpoint `dfs.core.windows.net`.
5143

52-
- **Security** is enforceable because you can define POSIX permissions on directories or individual files.
44+
Data analysis frameworks that use HDFS as their data access layer can directly access Azure Data Lake Storage Gen2 data through ABFS. The Apache Spark analytics engine and the Presto SQL query engine are examples of such frameworks. See [Azure services that support Azure Data Lake Storage Gen2](data-lake-storage-supported-azure-services).
5345

54-
Also, Data Lake Storage Gen2 is very cost effective because it's built on top of the low-cost [Azure Blob Storage](storage-blobs-introduction.md). The extra features further lower the total cost of ownership for running big data analytics on Azure.
46+
#### Hierarchical directory structure
5547

56-
## Data Lake Storage Gen2 capabilities
48+
The [hierarchical namespace](data-lake-storage-namespace.md) is a key feature that enables Azure Data Lake Storage Gen2 to provide high-performance data access at object storage scale and price. You can use this feature to organize all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. In other words, your Azure Data Lake Storage Gen2 data is organized in much the same way that files are organized on your computer.
49+
50+
Operations such as renaming or deleting a directory, become single atomic metadata operations on the directory. There's no need to enumerate and process all objects that share the name prefix of the directory.
5751

58-
Add those paragraphs from the learn module
52+
#### Optimized cost and performance
5953

60-
- **Hadoop compatible access:** Data Lake Storage Gen2 allows you to manage and access data just as you would with a [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html). The [ABFS driver](data-lake-storage-abfs-driver.md) (used to access data) is available within all Apache Hadoop environments. These environments include [Azure HDInsight](../../hdinsight/index.yml)*,* [Azure Databricks](/azure/databricks/), and [Azure Synapse Analytics](../../synapse-analytics/index.yml).
54+
Azure Data Lake Storage Gen2 is priced at Azure Blob Storage levels. It builds on Azure Blob Storage capabilities such as automated lifecycle policy management and object level tiering to manage big data storage costs. A hierarchical namespace provides the scalability and cost-effectiveness of object storage.
6155

62-
- **A superset of POSIX permissions:** The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings can be configured by using Storage Explorer, the Azure portal, PowerShell, Azure CLI, REST APIs, Azure Storage SDKs, or by using frameworks like Hive and Spark.
56+
Performance is optimized because you don't need to copy or transform data as a prerequisite for analysis. The hierarchical namespace capability of Azure Data Lake Storage allows for efficient access and navigation. This architecture means that data processing requires fewer computational resources, reducing both the speed and cost of accessing data.
6357

64-
- **Cost-effective:** Data Lake Storage Gen2 offers low-cost storage capacity and transactions. Features such as [Azure Blob Storage lifecycle](./lifecycle-management-overview.md) optimize costs as data transitions through its lifecycle.
58+
#### Finer grain security model
6559

66-
- **Optimized driver:** The ABFS driver is [optimized specifically](data-lake-storage-abfs-driver.md) for big data analytics. The corresponding REST APIs are surfaced through the endpoint `dfs.core.windows.net`.
60+
The Azure Data Lake Storage Gen2 access control model supports both Azure role-based access control (RBAC) and Portable Operating System Interface for UNIX (POSIX) access control lists (ACLs). There are also a few extra security settings that are specific to Azure Data Lake Storage Gen2. You can set permissions either at the directory level or at the file level. All stored data is encrypted at rest by using either Microsoft-managed or customer-managed encryption keys.
6761

68-
### Scalability
62+
#### Massive scalability
6963

70-
Azure Storage is scalable by design whether you access via Data Lake Storage Gen2 or Blob storage interfaces. It's able to store and serve *many exabytes of data*. This amount of storage is available with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS). Processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.
64+
This amount of storage is available with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS). Processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.
7165

72-
### Cost effectiveness
66+
Azure Data Lake Storage Gen2 offers massive storage and accepts numerous data types for analytics. It doesn't impose any limits on account sizes, file sizes, or the amount of data that can be stored in the data lake. Individual files can have sizes that range from a few kilobytes (KBs) to a few petabytes (PBs). Processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.
7367

74-
Because Data Lake Storage Gen2 is built on top of Azure Blob Storage, storage capacity and transaction costs are lower. Unlike other cloud storage services, you don't have to move or transform your data before you can analyze it. For more information about pricing, see [Azure Storage pricing](https://azure.microsoft.com/pricing/details/storage).
68+
This design means that Azure Data Lake Storage Gen2 can easily and quickly scale up to meet the most demanding workloads. It can also just as easily scale back down when demand drops.
7569

76-
Additionally, features such as the [hierarchical namespace](data-lake-storage-namespace.md) significantly improve the overall performance of many analytics jobs. This improvement in performance means that you require less compute power to process the same amount of data, resulting in a lower total cost of ownership (TCO) for the end-to-end analytics job.
7770

78-
### One service, multiple concepts
71+
## One service, multiple concepts
7972

8073
Because Data Lake Storage Gen2 is built on top of Azure Blob Storage, multiple concepts can describe the same, shared things.
8174

0 commit comments

Comments
 (0)