Skip to content

Commit 3453088

Browse files
committed
Update data lake
1 parent 28c0735 commit 3453088

File tree

1 file changed

+32
-2
lines changed

1 file changed

+32
-2
lines changed

Concepts/Data Lake.md

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,38 @@ Tags: [seedling]
44
publish: true
55
---
66

7-
> A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics - from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
8-
> - [AWS, What is a data lake?](https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/)
7+
A data lake is a flexible storage pattern that is typically used for storing massive amounts of raw data in its native format.[^1] Data lakes are flexible in that they can store practically any type of data from structured (tabular data), semi-structured (JSON, XML), and unstructured data (videos, images, audio). Data lakes utilize blob storage which is a cheap and abundant and a compute engine of the user's choice.
8+
9+
```mermaid
10+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
11+
graph LR
12+
    A1((Structured Data)) --> B[(Blob Storage)]
13+
    A2((Semi-structured Data))-->B
14+
    A3((Unstructured Data))-->B
15+
    B -->D[Data Engineer]
16+
    B -->E[Data Scientist]
17+
    B -->F[Machine Learning Engineer]
18+
```
19+
20+
## Data Lake Advantages
21+
22+
- Cheaply store large amounts of data.
23+
- Flexibly store any type of data (futureproof).
24+
- Allows for compute optimization by mixing and matching compute options for different workloads.
25+
- All data is stored in one place where all stakeholders can work.
26+
27+
## Data Lake Disadvantages
28+
29+
- Data governance is more challenging and relies on robust cataloging and metadata to make the data useful. Cloud providers often offer additional services to address these issues.
30+
- Because storage is cheap, there's a tendency to store more data regardless of it's business value.
31+
32+
## Data Lake Reference Architectures
33+
34+
- AWS: [Deploy and manage a serverless data lake on the AWS Cloud by using infrastructure as code](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/deploy-and-manage-a-serverless-data-lake-on-the-aws-cloud-by-using-infrastructure-as-code.html?did=pg_card&trk=pg_card)
35+
- Azure: [Introduction to Azure Data Lake Storage Gen2](https://learn.microsoft.com/en-us/training/modules/introduction-to-azure-data-lake-storage/)
36+
- GCP: [Data lakes in cloud with Kafka and Confluent](https://cloud.google.com/blog/products/data-analytics/data-lakes-in-cloud-with-kafka-and-confluent)
37+
38+
[^1]: [AWS, What is a data lake?](https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/)
939

1040
%% wiki footer: Please don't edit anything below this line %%
1141

0 commit comments

Comments
 (0)