Skip to content

Commit 40c8e5f

Browse files
committed
Refactor data ingestion
1 parent a745a2b commit 40c8e5f

File tree

4 files changed

+302
-13
lines changed

4 files changed

+302
-13
lines changed

Concepts/Data Ingestion/Change Data Capture.md

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,42 @@
11
---
2-
Aliases: [CDC, log-based CDC, Concepts/Change Data Capture]
3-
Tags: [incubating]
2+
Aliases:
3+
- CDC
4+
- log-based CDC
5+
- Concepts/Change Data Capture
6+
Tags:
7+
- evergreen
48
publish: true
59
---
610

7-
Change data capture describes the process of recording the change of data in a database. Typically, this means tracking when records are inserted, updated, and deleted along with the data itself.
11+
Change data capture (CDC) is a specialized incremental ingestion technique that captures changes from database transaction logs using CDC software. This means tracking when records are inserted, updated, and deleted along with the data itself and optionally events like schema changes. It is a widely used technique because of it's efficiency and minimal impact on source systems. ^overview-cdc
12+
13+
```mermaid
14+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
15+
graph LR
16+
subgraph Source
17+
A[(Database)]
18+
B[Transaction Log]
19+
A --> B
20+
end
21+
22+
subgraph CDC Process
23+
C[CDC Tool]
24+
D[Change Events]
25+
B -->|Read log| C
26+
C --> D
27+
end
28+
29+
subgraph Target
30+
E[(Data Warehouse)]
31+
D -->|Apply changes| E
32+
end
33+
```
34+
^overview-cdc-diagram
835

936
## Change Data Capture Advantages
10-
- Better use of bandwidth
11-
- Can keep historical data changes
37+
- Real-time or near real-time data replication
38+
- Minimal impact on source systems
39+
- Captures all types of changes (INSERT, UPDATE, DELETE) and often schema changes as well.
1240

1341
## Change Data Capture Disadvantages
1442
- More complex to set up than [[Full Load|full loads]] or [[Delta Load|delta loads]]
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
---
2+
Aliases:
3+
- Concepts/Data Ingestion
4+
Tags:
5+
- evergreen
6+
publish: true
7+
---
8+
9+
Data ingestion is the process of extracting and importing data from various sources into a destination system where it can be stored, transformed, and analyzed. It commonly involves moving data from operational systems, external sources, or real-time streams into data storage systems like data warehouses and data lakes.
10+
11+
Data ingestion can be categorized into two main approaches: [[Batch Data Processing|batch ingestion]] (processing data in scheduled intervals) and [[Stream Data Processing|real-time/streaming ingestion]] (processing data continuously as it arrives).
12+
13+
## Data Ingestion Components
14+
15+
Data ingestion consists of a few key components that work together to reliably move data from sources to destinations:
16+
17+
### 1. Data Sources
18+
19+
Common data sources include:
20+
21+
- **Databases**: Operational databases (PostgreSQL, MySQL, SQL Server)
22+
- **Applications**: SaaS platforms, CRM systems (Hubspot, Salesforce), ERP systems
23+
- **Files**: CSV, JSON, XML, Parquet files from SFTP/FTP servers or cloud storage
24+
- **APIs**: REST APIs, GraphQL endpoints, webhooks
25+
- **Message Queues**: Kafka, RabbitMQ, Amazon SQS
26+
- **Streaming Sources**: IoT devices, clickstreams, social media feeds
27+
- **Cloud Services**: AWS S3, Google Cloud Storage, Azure Blob Storage
28+
29+
### 2. Ingestion Patterns
30+
31+
#### [[Batch Data Processing|Batch Ingestion]]
32+
33+
Data is collected and processed in discrete chunks at scheduled intervals.
34+
35+
```mermaid
36+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
37+
graph LR
38+
A[Source System]
39+
B["Scheduler (Airflow/Cron)"]
40+
C[Batch Extraction Script]
41+
D[(Staging Area)]
42+
E[(Data Warehouse/Data Lake)]
43+
44+
B -->|Trigger at interval| C
45+
C -->|Extract data| A
46+
A -->|Data files| C
47+
C -->|Write batch| D
48+
D -->|Load batch| E
49+
```
50+
51+
Characteristics:
52+
53+
- Higher latency (minutes to hours)
54+
- More efficient for large volumes
55+
- Easiest to implement and debug
56+
- Lower infrastructure costs
57+
58+
#### [[Stream Data Processing|Streaming Ingestion]]
59+
60+
Data is processed continuously in real-time as it arrives.
61+
62+
```mermaid
63+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
64+
graph LR
65+
A[Source System]
66+
B[Event Producer]
67+
C["Message Broker<br/>(Kafka/Kinesis/PubSub)"]
68+
D["Stream Processor<br/>(Flink/Spark Streaming)"]
69+
E[(Data Warehouse/Data Lake)]
70+
71+
A -->|Generate events| B
72+
B -->|Send events| C
73+
C -->|Stream events| D
74+
D -->|Process & transform| E
75+
```
76+
77+
Characteristics:
78+
79+
- Low latency (seconds to milliseconds)
80+
- More complex to implement
81+
- Higher infrastructure costs
82+
- Enables real-time analytics
83+
84+
#### Micro-batch Ingestion
85+
86+
A hybrid approach that processes small batches of data at frequent intervals.
87+
88+
```mermaid
89+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
90+
graph LR
91+
A[Source System]
92+
B["Scheduler (Airflow/Cron)"]
93+
C[Micro-batch Extraction Script]
94+
D[(Staging Area)]
95+
E[(Data Warehouse/Data Lake)]
96+
97+
B -->|Trigger every few minutes| C
98+
C -->|Extract recent data| A
99+
A -->|New/changed data| C
100+
C -->|Write micro-batch| D
101+
D -->|Load micro-batch| E
102+
```
103+
104+
Characteristics:
105+
106+
- **Near** real-time processing (typically 5-15 minutes)
107+
- Balances latency and efficiency
108+
- Easier than true streaming
109+
- Good for most use cases
110+
111+
### 3. Data Ingestion Strategies
112+
113+
#### [[Full Load|Full Load]]
114+
115+
![[Full Load#^overview-full-load]]
116+
![[Full Load#^overview-full-load-diagram]]
117+
118+
#### [[Delta Load|Incremental Load]]
119+
120+
![[Delta Load#^overview-delta-load]]
121+
![[Delta Load#^overview-delta-load-diagram]]
122+
123+
#### [[Change Data Capture|Change Data Capture (CDC)]]
124+
125+
![[Change Data Capture#^overview-cdc]]
126+
![[Change Data Capture#^overview-cdc-diagram]]
127+
128+
## Data Ingestion Examples
129+
130+
Common examples of data ingestion patterns.
131+
132+
### API Data Ingestion Example
133+
134+
Ingesting data from a REST API on a scheduled basis:
135+
136+
```mermaid
137+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
138+
graph LR
139+
A[External API]
140+
B[Scheduler: Airflow/Cron]
141+
C[Python Script]
142+
D[(Data Lake)]
143+
144+
B -->|Trigger every hour| C
145+
C -->|HTTP Request| A
146+
A -->|JSON Response| C
147+
C -->|Store data| D
148+
```
149+
150+
### Database Replication Example
151+
152+
Real-time replication from an operational database to an analytics database:
153+
154+
```mermaid
155+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
156+
graph LR
157+
subgraph Production
158+
A[(PostgreSQL)]
159+
B[WAL Logs]
160+
A --> B
161+
end
162+
163+
subgraph Ingestion
164+
C[Debezium Connector]
165+
D[Apache Kafka]
166+
B --> C
167+
C --> D
168+
end
169+
170+
subgraph Analytics
171+
E[Kafka Connect]
172+
F[(Data Warehouse)]
173+
D --> E
174+
E --> F
175+
end
176+
```
177+
178+
### File-Based Ingestion Example
179+
180+
Processing files dropped into cloud storage:
181+
182+
```mermaid
183+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
184+
graph LR
185+
A[External System]
186+
B[(Cloud Storage<br/>S3/GCS)]
187+
C[Event Trigger]
188+
D[Processing Function]
189+
E[(Data Warehouse)]
190+
191+
A -->|Upload files| B
192+
B -->|File arrival event| C
193+
C -->|Trigger| D
194+
D -->|Read and process| B
195+
D -->|Load processed data| E
196+
```
197+
198+
## Common Data Ingestion Challenges
199+
200+
### Scalability
201+
202+
- Volume Growth: Handling increasing data volumes
203+
- Source System Impact: Minimizing load on operational systems
204+
- Resource Management: Efficiently using compute and storage resources
205+
206+
### Reliability
207+
208+
- Source System Downtime: Handling unavailable data sources
209+
- Network Issues: Managing connectivity problems
210+
- Data Consistency: Ensuring data integrity across systems
211+
212+
### Complexity
213+
214+
- Schema Evolution: Handling changes in source data structures
215+
- Multiple Sources: Managing diverse data sources and formats
216+
- Dependency Management: Coordinating ingestion across related datasets
217+
218+
%% wiki footer: Please don't edit anything below this line %%
219+
220+
## This note in GitHub
221+
222+
<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Concepts/Data%20Ingestion/Data%20Ingestion.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Concepts/Data%20Ingestion/Data%20Ingestion.md "git-hub-copy-note")</span>
223+
224+
<span class="git-footer">Was this page helpful?
225+
[👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Concepts/Data%20Ingestion/Data%20Ingestion) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Concepts/Data%20Ingestion/Data%20Ingestion)</span>

Concepts/Data Ingestion/Delta Load.md

Lines changed: 30 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,42 @@
11
---
2-
Aliases: [incremental load, query-based CDC, Concepts/Delta Load]
3-
Tags: [seedling]
2+
Aliases:
3+
- incremental load
4+
- query-based CDC
5+
- Concepts/Delta Load
6+
Tags:
7+
- evergreen
48
publish: true
59
---
610

7-
A delta load refers to extracting only the data that has changed since the last time the extract process has run. The most commonly used steps to perform a delta load are:
8-
11+
A delta load (or incremental load) refers to extracting only the data that has changed since the last time the extract process has run. This process is typically query-based and requires an incrementing id or timestamp column that can be used to determine new records. ^overview-delta-load
12+
13+
```mermaid
14+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
15+
graph TD
16+
subgraph S1 [Initial Load]
17+
direction LR
18+
A1[(Source<br/>100,000 records)] -->|Extract all records| B1[Ingestion Process]
19+
B1 -->|Load all records| C1[(Destination<br/>100,000 records)]
20+
end
21+
22+
subgraph S2 [Subsequent Runs]
23+
direction LR
24+
A2[(Destination<br/> 100,000 records)] -->|"Query for MAX(modified_at)<br/>from Destination"| D[Latest Timestamp]
25+
D -->|"Query source using timestamp to filter"|A[(Source<br/>100,500 records)]
26+
A -->|Load 500 new/changed records| B[Ingestion Process]
27+
B --> C[(Destination<br/>100,500 records)]
28+
end
29+
30+
S1 --> S2
31+
```
32+
^overview-delta-load-diagram
33+
34+
The most commonly used steps to perform a delta load are:
935
1. Ensure there is a `modified_at` timestamp or incremental id column such as a primary key on the data source.
1036
2. On the initial run of the pipeline, do a full load of the dataset.
1137
3. On following runs of the pipeline, query the target dataset using `MAX(column_name)`.
1238
4. Query the source dataset and filter records where values are greater than the value from step 3.
1339

14-
1540
## Delta Load Advantages
1641

1742
- More resource efficient

Concepts/Data Ingestion/Full Load.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,21 @@
11
---
2-
Aliases: [Destructive Load, Concepts/Full Load]
3-
Tags: [seedling]
2+
Aliases:
3+
- Destructive Load
4+
- Concepts/Full Load
5+
Tags:
6+
- evergreen
47
publish: true
58
---
9+
With a full load, the entire dataset is dumped, or loaded, and is then completely replaced (i.e., deleted and replaced) with the new, updated dataset. No additional information, such as timestamps, is required. ^overview-full-load
610

7-
With a full load, the entire dataset is dumped, or loaded, and is then completely replaced (i.e., deleted and replaced) with the new, updated dataset. No additional information, such as timestamps, is required.
11+
```mermaid
12+
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
13+
graph LR
14+
A[(Source<br/>100,000 records)] -->|Extract all records| B[Ingestion Process]
15+
B -->|Load/overwrite all records| C[(Destination<br/>100,000 records)]
16+
```
17+
18+
^overview-full-load-diagram
819

920
## Full Load Advantages
1021

0 commit comments

Comments
 (0)