Refactor data ingestion

JPHaus · JPHaus · commit 40c8e5f71ffa · 2025-07-05T12:08:50.000-05:00
diff --git a/Concepts/Data Ingestion/Change Data Capture.md b/Concepts/Data Ingestion/Change Data Capture.md
@@ -1,14 +1,42 @@
 ---
-Aliases: [CDC, log-based CDC, Concepts/Change Data Capture]
-Tags: [incubating]
+Aliases:
+  - CDC
+  - log-based CDC
+  - Concepts/Change Data Capture
+Tags:
+  - evergreen
 publish: true
 ---
 
-Change data capture describes the process of recording the change of data in a database. Typically, this means tracking when records are inserted, updated, and deleted along with the data itself.
+Change data capture (CDC) is a specialized incremental ingestion technique that captures changes from database transaction logs using CDC software. This means tracking when records are inserted, updated, and deleted along with the data itself and optionally events like schema changes. It is a widely used technique because of it's efficiency and minimal impact on source systems. ^overview-cdc
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    subgraph Source
+        A[(Database)]
+        B[Transaction Log]
+        A --> B
+    end
+    
+    subgraph CDC Process
+        C[CDC Tool]
+        D[Change Events]
+        B -->|Read log| C
+        C --> D
+    end
+    
+    subgraph Target
+        E[(Data Warehouse)]
+        D -->|Apply changes| E
+    end
+```
+^overview-cdc-diagram
 
 ## Change Data Capture Advantages
-- Better use of bandwidth
-- Can keep historical data changes
+- Real-time or near real-time data replication
+- Minimal impact on source systems
+- Captures all types of changes (INSERT, UPDATE, DELETE) and often schema changes as well.
 
 ## Change Data Capture Disadvantages
 - More complex to set up than [[Full Load|full loads]] or [[Delta Load|delta loads]]
diff --git a/Concepts/Data Ingestion/Data Ingestion.md b/Concepts/Data Ingestion/Data Ingestion.md
@@ -0,0 +1,225 @@
+---
+Aliases:
+  - Concepts/Data Ingestion
+Tags:
+  - evergreen
+publish: true
+---
+
+Data ingestion is the process of extracting and importing data from various sources into a destination system where it can be stored, transformed, and analyzed. It commonly involves moving data from operational systems, external sources, or real-time streams into data storage systems like data warehouses and data lakes.
+
+Data ingestion can be categorized into two main approaches: [[Batch Data Processing|batch ingestion]] (processing data in scheduled intervals) and [[Stream Data Processing|real-time/streaming ingestion]] (processing data continuously as it arrives).
+
+## Data Ingestion Components
+
+Data ingestion consists of a few key components that work together to reliably move data from sources to destinations:
+
+### 1. Data Sources
+
+Common data sources include:
+
+- **Databases**: Operational databases (PostgreSQL, MySQL, SQL Server)
+- **Applications**: SaaS platforms, CRM systems (Hubspot, Salesforce), ERP systems
+- **Files**: CSV, JSON, XML, Parquet files from SFTP/FTP servers or cloud storage
+- **APIs**: REST APIs, GraphQL endpoints, webhooks
+- **Message Queues**: Kafka, RabbitMQ, Amazon SQS
+- **Streaming Sources**: IoT devices, clickstreams, social media feeds
+- **Cloud Services**: AWS S3, Google Cloud Storage, Azure Blob Storage
+
+### 2. Ingestion Patterns
+
+#### [[Batch Data Processing|Batch Ingestion]]
+
+Data is collected and processed in discrete chunks at scheduled intervals.
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    A[Source System] 
+    B["Scheduler (Airflow/Cron)"]
+    C[Batch Extraction Script]
+    D[(Staging Area)]
+    E[(Data Warehouse/Data Lake)]
+    
+    B -->|Trigger at interval| C
+    C -->|Extract data| A
+    A -->|Data files| C
+    C -->|Write batch| D
+    D -->|Load batch| E
+```
+
+Characteristics:
+
+- Higher latency (minutes to hours)
+- More efficient for large volumes
+- Easiest to implement and debug
+- Lower infrastructure costs
+
+#### [[Stream Data Processing|Streaming Ingestion]]
+
+Data is processed continuously in real-time as it arrives.
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    A[Source System] 
+    B[Event Producer]
+    C["Message Broker<br/>(Kafka/Kinesis/PubSub)"]
+    D["Stream Processor<br/>(Flink/Spark Streaming)"]
+    E[(Data Warehouse/Data Lake)]
+    
+    A -->|Generate events| B
+    B -->|Send events| C
+    C -->|Stream events| D
+    D -->|Process & transform| E
+```
+
+Characteristics:
+
+- Low latency (seconds to milliseconds)
+- More complex to implement
+- Higher infrastructure costs
+- Enables real-time analytics
+
+#### Micro-batch Ingestion
+
+A hybrid approach that processes small batches of data at frequent intervals.
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    A[Source System]
+    B["Scheduler (Airflow/Cron)"]
+    C[Micro-batch Extraction Script]
+    D[(Staging Area)]
+    E[(Data Warehouse/Data Lake)]
+
+    B -->|Trigger every few minutes| C
+    C -->|Extract recent data| A
+    A -->|New/changed data| C
+    C -->|Write micro-batch| D
+    D -->|Load micro-batch| E
+```
+
+Characteristics:
+
+- **Near** real-time processing (typically 5-15 minutes)
+- Balances latency and efficiency
+- Easier than true streaming
+- Good for most use cases
+
+### 3. Data Ingestion Strategies
+
+#### [[Full Load|Full Load]]
+
+![[Full Load#^overview-full-load]]
+![[Full Load#^overview-full-load-diagram]]
+
+#### [[Delta Load|Incremental Load]]
+
+![[Delta Load#^overview-delta-load]]
+![[Delta Load#^overview-delta-load-diagram]]
+
+#### [[Change Data Capture|Change Data Capture (CDC)]]
+
+![[Change Data Capture#^overview-cdc]]
+![[Change Data Capture#^overview-cdc-diagram]]
+
+## Data Ingestion Examples
+
+Common examples of data ingestion patterns.
+
+### API Data Ingestion Example
+
+Ingesting data from a REST API on a scheduled basis:
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    A[External API] 
+    B[Scheduler: Airflow/Cron]
+    C[Python Script]
+    D[(Data Lake)]
+    
+    B -->|Trigger every hour| C
+    C -->|HTTP Request| A
+    A -->|JSON Response| C
+    C -->|Store data| D
+```
+
+### Database Replication Example
+
+Real-time replication from an operational database to an analytics database:
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    subgraph Production
+        A[(PostgreSQL)]
+        B[WAL Logs]
+        A --> B
+    end
+    
+    subgraph Ingestion
+        C[Debezium Connector]
+        D[Apache Kafka]
+        B --> C
+        C --> D
+    end
+    
+    subgraph Analytics
+        E[Kafka Connect]
+        F[(Data Warehouse)]
+        D --> E
+        E --> F
+    end
+```
+
+### File-Based Ingestion Example
+
+Processing files dropped into cloud storage:
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    A[External System] 
+    B[(Cloud Storage<br/>S3/GCS)]
+    C[Event Trigger]
+    D[Processing Function]
+    E[(Data Warehouse)]
+    
+    A -->|Upload files| B
+    B -->|File arrival event| C
+    C -->|Trigger| D
+    D -->|Read and process| B
+    D -->|Load processed data| E
+```
+
+## Common Data Ingestion Challenges
+
+### Scalability
+
+- Volume Growth: Handling increasing data volumes
+- Source System Impact: Minimizing load on operational systems
+- Resource Management: Efficiently using compute and storage resources
+
+### Reliability
+
+- Source System Downtime: Handling unavailable data sources
+- Network Issues: Managing connectivity problems
+- Data Consistency: Ensuring data integrity across systems
+
+### Complexity
+
+- Schema Evolution: Handling changes in source data structures
+- Multiple Sources: Managing diverse data sources and formats
+- Dependency Management: Coordinating ingestion across related datasets
+
+%% wiki footer: Please don't edit anything below this line %%
+
+## This note in GitHub
+
+<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Concepts/Data%20Ingestion/Data%20Ingestion.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Concepts/Data%20Ingestion/Data%20Ingestion.md "git-hub-copy-note")</span>
+
+<span class="git-footer">Was this page helpful?
+[👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Concepts/Data%20Ingestion/Data%20Ingestion) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Concepts/Data%20Ingestion/Data%20Ingestion)</span>
diff --git a/Concepts/Data Ingestion/Delta Load.md b/Concepts/Data Ingestion/Delta Load.md
@@ -1,17 +1,42 @@
 ---
-Aliases: [incremental load, query-based CDC, Concepts/Delta Load]
-Tags: [seedling]
+Aliases:
+  - incremental load
+  - query-based CDC
+  - Concepts/Delta Load
+Tags:
+  - evergreen
 publish: true
 ---
 
-A delta load refers to extracting only the data that has changed since the last time the extract process has run. The most commonly used steps to perform a delta load are:
-
+A delta load (or incremental load) refers to extracting only the data that has changed since the last time the extract process has run. This process is typically query-based and requires an incrementing id or timestamp column that can be used to determine new records. ^overview-delta-load
+
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph TD
+    subgraph S1 [Initial Load]
+    direction LR
+    A1[(Source<br/>100,000 records)] -->|Extract all records| B1[Ingestion Process]
+    B1 -->|Load all records| C1[(Destination<br/>100,000 records)]
+    end
+
+    subgraph S2 [Subsequent Runs]
+    direction LR
+    A2[(Destination<br/> 100,000 records)] -->|"Query for MAX(modified_at)<br/>from Destination"| D[Latest Timestamp]
+    D -->|"Query source using timestamp to filter"|A[(Source<br/>100,500 records)] 
+    A -->|Load 500 new/changed records| B[Ingestion Process]
+    B --> C[(Destination<br/>100,500 records)]
+    end
+
+    S1 --> S2
+```
+^overview-delta-load-diagram
+
+The most commonly used steps to perform a delta load are:
 1. Ensure there is a `modified_at` timestamp or incremental id column such as a primary key on the data source. 
 2. On the initial run of the pipeline, do a full load of the dataset.
 3. On following runs of the pipeline, query the target dataset using `MAX(column_name)`.
 4. Query the source dataset and filter records where values are greater than the value from step 3.
 
-
 ## Delta Load Advantages
 
 - More resource efficient
diff --git a/Concepts/Data Ingestion/Full Load.md b/Concepts/Data Ingestion/Full Load.md
@@ -1,10 +1,21 @@
 ---
-Aliases: [Destructive Load, Concepts/Full Load]
-Tags: [seedling]
+Aliases:
+  - Destructive Load
+  - Concepts/Full Load
+Tags:
+  - evergreen
 publish: true
 ---
+With a full load, the entire dataset is dumped, or loaded, and is then completely replaced (i.e., deleted and replaced) with the new, updated dataset. No additional information, such as timestamps, is required. ^overview-full-load
 
-With a full load, the entire dataset is dumped, or loaded, and is then completely replaced (i.e., deleted and replaced) with the new, updated dataset. No additional information, such as timestamps, is required.
+```mermaid
+%%{init: { "flowchart": { "useMaxWidth": true } } }%%
+graph LR
+    A[(Source<br/>100,000 records)] -->|Extract all records| B[Ingestion Process]
+    B -->|Load/overwrite all records| C[(Destination<br/>100,000 records)]
+```
+
+^overview-full-load-diagram
 
 ## Full Load Advantages