Skip to content

Commit b83032f

Browse files
committed
closes #98, closes #99, closes #101
1 parent acc9ce0 commit b83032f

File tree

4 files changed

+174
-18
lines changed

4 files changed

+174
-18
lines changed

Concepts/Data Ingestion/Data Ingestion.md

Lines changed: 39 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,45 @@ Common data sources include:
2828

2929
### 2. Ingestion Patterns
3030

31+
#### Extract, Transform, Load (ETL)
32+
33+
ETL is a traditional ingestion pattern where data is extracted from a source, transformed (during the ingestion process), and then loaded into the destination.
34+
35+
```mermaid
36+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
37+
graph LR
38+
A[Data Source]
39+
B[Extract]
40+
C[Transform<br/>Data Validation<br/>Business Rules<br/>Cleaning]
41+
D[Load]
42+
E[(Data Warehouse)]
43+
44+
A -->|Raw data| B
45+
B -->|Extracted data| C
46+
C -->|Clean data| D
47+
D -->|Structured data| E
48+
```
49+
50+
#### Extract, Load, Transform (ELT)
51+
52+
ELT is the modern ingestion pattern where raw data is extracted and loaded directly into the destination, then transformed within the destination system. ELT is the more popular pattern because storage is cheap and keeping the raw data allows for more flexibility in future data use cases.
53+
54+
```mermaid
55+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
56+
graph LR
57+
A[Data Source]
58+
B[Extract]
59+
C[Load]
60+
D[Transform<br/>In destination]
61+
E[(Data Warehouse/Lake)]
62+
63+
A -->|Raw data| B
64+
B -->|Extracted data| C
65+
C -->|Raw data| E
66+
E -->|Stored data| D
67+
D -->|Transformed data| E
68+
```
69+
3170
#### [[Batch Data Processing|Batch Ingestion]]
3271

3372
Data is collected and processed in discrete chunks at scheduled intervals.
@@ -195,25 +234,7 @@ graph LR
195234
D -->|Load processed data| E
196235
```
197236

198-
## Common Data Ingestion Challenges
199-
200-
### Scalability
201-
202-
- Volume Growth: Handling increasing data volumes
203-
- Source System Impact: Minimizing load on operational systems
204-
- Resource Management: Efficiently using compute and storage resources
205-
206-
### Reliability
207-
208-
- Source System Downtime: Handling unavailable data sources
209-
- Network Issues: Managing connectivity problems
210-
- Data Consistency: Ensuring data integrity across systems
211-
212-
### Complexity
213237

214-
- Schema Evolution: Handling changes in source data structures
215-
- Multiple Sources: Managing diverse data sources and formats
216-
- Dependency Management: Coordinating ingestion across related datasets
217238

218239
%% wiki footer: Please don't edit anything below this line %%
219240

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
Aliases:
3+
- Concepts/Data Management
4+
Tags:
5+
- seedling
6+
publish: true
7+
---
8+
9+
Data Management is the practice of collecting, organizing, protecting, and storing data in a way that enables efficient access, analysis, and decision-making throughout its entire lifecycle. It encompasses the policies, procedures, and technologies used to ensure data is accurate, available, secure, and compliant with regulations while meeting business requirements.
10+
11+
## Data Management Components
12+
13+
### 1. [[Data Governance]]
14+
15+
Data Governance establishes the policies, procedures, and standards for managing data across an organization.
16+
17+
### 2. [[Data Quality Management]]
18+
19+
Data quality ensures that data is accurate, complete, consistent, and fit for its intended use.
20+
21+
#placeholder
22+
23+
### 3. [[Data Catalog]]
24+
25+
Data cataloging creates a centralized inventory of data assets with metadata to improve discoverability and understanding.
26+
27+
### 4. [[Data Security]]
28+
29+
Data security protects data from unauthorized access, corruption, and theft throughout its lifecycle.
30+
31+
#placeholder
32+
33+
%% wiki footer: Please don't edit anything below this line %%
34+
35+
## This note in GitHub
36+
37+
<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Concepts/Data%20Management/Data%20Management.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Concepts/Data%20Management/Data%20Management.md "git-hub-copy-note")</span>
38+
39+
<span class="git-footer">Was this page helpful?
40+
[👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Concepts/Data%20Management/Data%20Management) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Concepts/Data%20Management/Data%20Management)</span>
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
Aliases:
3+
- Concepts/Data Processing
4+
Tags:
5+
- seedling
6+
publish: true
7+
---
8+
9+
Data Processing is the act of transforming raw data into meaningful, actionable information. It involves collecting, manipulating, filtering, sorting, and analyzing data to extract insights, support decision-making, and enable business operations. Data processing focuses on what happens to data after it has been ingested into your systems.
10+
11+
## Data Processing Components
12+
13+
### 1. Processing Systems
14+
15+
- [[Online Transaction Processing|OLTP (Online Transaction Processing)]]
16+
- [[Online Analytical Processing|OLAP (Online Analytical Processing)]]
17+
- [[Hybrid Transactional Analytical Processing|HTAP (Hybrid Transactional Analytical Processing)]]
18+
19+
### 2. Processing Execution Models
20+
21+
- [[Batch Data Processing|Batch Processing]]
22+
- [[Stream Data Processing|Stream Processing]]
23+
- [[Micro-batch Processing]]
24+
25+
### 3. [[Workflow Orchestration]]
26+
27+
Scheduling/workflow orchestration manages the coordination of processing jobs.
28+
29+
### 4. Processing Architectures
30+
31+
![[Data Architecture#Popular Data Architecture Patterns]]
32+
33+
%% wiki footer: Please don't edit anything below this line %%
34+
35+
## This note in GitHub
36+
37+
<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Concepts/Data%20Processing/Data%20Processing.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Concepts/Data%20Processing/Data%20Processing.md "git-hub-copy-note")</span>
38+
39+
<span class="git-footer">Was this page helpful?
40+
[👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Concepts/Data%20Processing/Data%20Processing) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Concepts/Data%20Processing/Data%20Processing)</span>
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
---
2+
Aliases: [Concepts/Data Storage]
3+
Tags: [incubating]
4+
publish: true
5+
---
6+
7+
This page contains an overview of the technologies and systems used to store and retrieve data in various formats and structures. Modern data storage can be fundamentally divided into two categories: **Databases** (managed storage with built-in compute) and **Object Storage** (raw storage that requires external compute).
8+
9+
## 1. Databases (Storage + Compute)
10+
11+
[[Database|Databases]] provide both storage and built-in compute capabilities with structured query interfaces.
12+
13+
### [[Relational Database]]
14+
15+
A relational database is a traditional structured storage using tables, rows, and columns with ACID properties.
16+
17+
### Non-Relational (NoSQL) Databases
18+
19+
NoSQL databases store data in flexible formats such as documents, key-value pairs, graphs, or columns, enabling scalability and schema-less design for diverse data types.
20+
21+
![[Non-relational Database#Types of Non-relational Databases]]
22+
23+
## 2. [[Object/Blob Storage]]
24+
25+
Object storage provides raw data persistence without built-in compute - requiring external processing engines.
26+
27+
```mermaid
28+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
29+
graph TB
30+
A[Applications]
31+
B[Object Storage API]
32+
33+
subgraph "Object Storage"
34+
C[Bucket/Container]
35+
D[Objects/Files]
36+
E[Metadata]
37+
end
38+
39+
A -->|PUT/GET/DELETE| B
40+
B --> C
41+
C --> D
42+
C --> E
43+
44+
F[External Compute] -->|Process files| D
45+
```
46+
See the **data stores** category for examples and popular tools.
47+
48+
%% wiki footer: Please don't edit anything below this line %%
49+
50+
## This note in GitHub
51+
52+
<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Concepts/Data%20Storage/Data%20Storage.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Concepts/Data%20Storage/Data%20Storage.md "git-hub-copy-note")</span>
53+
54+
<span class="git-footer">Was this page helpful?
55+
[👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Concepts/Data%20Storage/Data%20Storage) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Concepts/Data%20Storage/Data%20Storage)</span>

0 commit comments

Comments
 (0)