|
| 1 | +--- |
| 2 | +Aliases: |
| 3 | + - Concepts/Data Ingestion |
| 4 | +Tags: |
| 5 | + - evergreen |
| 6 | +publish: true |
| 7 | +--- |
| 8 | + |
| 9 | +Data ingestion is the process of extracting and importing data from various sources into a destination system where it can be stored, transformed, and analyzed. It commonly involves moving data from operational systems, external sources, or real-time streams into data storage systems like data warehouses and data lakes. |
| 10 | + |
| 11 | +Data ingestion can be categorized into two main approaches: [[Batch Data Processing|batch ingestion]] (processing data in scheduled intervals) and [[Stream Data Processing|real-time/streaming ingestion]] (processing data continuously as it arrives). |
| 12 | + |
| 13 | +## Data Ingestion Components |
| 14 | + |
| 15 | +Data ingestion consists of a few key components that work together to reliably move data from sources to destinations: |
| 16 | + |
| 17 | +### 1. Data Sources |
| 18 | + |
| 19 | +Common data sources include: |
| 20 | + |
| 21 | +- **Databases**: Operational databases (PostgreSQL, MySQL, SQL Server) |
| 22 | +- **Applications**: SaaS platforms, CRM systems (Hubspot, Salesforce), ERP systems |
| 23 | +- **Files**: CSV, JSON, XML, Parquet files from SFTP/FTP servers or cloud storage |
| 24 | +- **APIs**: REST APIs, GraphQL endpoints, webhooks |
| 25 | +- **Message Queues**: Kafka, RabbitMQ, Amazon SQS |
| 26 | +- **Streaming Sources**: IoT devices, clickstreams, social media feeds |
| 27 | +- **Cloud Services**: AWS S3, Google Cloud Storage, Azure Blob Storage |
| 28 | + |
| 29 | +### 2. Ingestion Patterns |
| 30 | + |
| 31 | +#### [[Batch Data Processing|Batch Ingestion]] |
| 32 | + |
| 33 | +Data is collected and processed in discrete chunks at scheduled intervals. |
| 34 | + |
| 35 | +```mermaid |
| 36 | +%%{init: { "flowchart": { "useMaxWidth": true } } }%% |
| 37 | +graph LR |
| 38 | + A[Source System] |
| 39 | + B["Scheduler (Airflow/Cron)"] |
| 40 | + C[Batch Extraction Script] |
| 41 | + D[(Staging Area)] |
| 42 | + E[(Data Warehouse/Data Lake)] |
| 43 | + |
| 44 | + B -->|Trigger at interval| C |
| 45 | + C -->|Extract data| A |
| 46 | + A -->|Data files| C |
| 47 | + C -->|Write batch| D |
| 48 | + D -->|Load batch| E |
| 49 | +``` |
| 50 | + |
| 51 | +Characteristics: |
| 52 | + |
| 53 | +- Higher latency (minutes to hours) |
| 54 | +- More efficient for large volumes |
| 55 | +- Easiest to implement and debug |
| 56 | +- Lower infrastructure costs |
| 57 | + |
| 58 | +#### [[Stream Data Processing|Streaming Ingestion]] |
| 59 | + |
| 60 | +Data is processed continuously in real-time as it arrives. |
| 61 | + |
| 62 | +```mermaid |
| 63 | +%%{init: { "flowchart": { "useMaxWidth": true } } }%% |
| 64 | +graph LR |
| 65 | + A[Source System] |
| 66 | + B[Event Producer] |
| 67 | + C["Message Broker<br/>(Kafka/Kinesis/PubSub)"] |
| 68 | + D["Stream Processor<br/>(Flink/Spark Streaming)"] |
| 69 | + E[(Data Warehouse/Data Lake)] |
| 70 | + |
| 71 | + A -->|Generate events| B |
| 72 | + B -->|Send events| C |
| 73 | + C -->|Stream events| D |
| 74 | + D -->|Process & transform| E |
| 75 | +``` |
| 76 | + |
| 77 | +Characteristics: |
| 78 | + |
| 79 | +- Low latency (seconds to milliseconds) |
| 80 | +- More complex to implement |
| 81 | +- Higher infrastructure costs |
| 82 | +- Enables real-time analytics |
| 83 | + |
| 84 | +#### Micro-batch Ingestion |
| 85 | + |
| 86 | +A hybrid approach that processes small batches of data at frequent intervals. |
| 87 | + |
| 88 | +```mermaid |
| 89 | +%%{init: { "flowchart": { "useMaxWidth": true } } }%% |
| 90 | +graph LR |
| 91 | + A[Source System] |
| 92 | + B["Scheduler (Airflow/Cron)"] |
| 93 | + C[Micro-batch Extraction Script] |
| 94 | + D[(Staging Area)] |
| 95 | + E[(Data Warehouse/Data Lake)] |
| 96 | +
|
| 97 | + B -->|Trigger every few minutes| C |
| 98 | + C -->|Extract recent data| A |
| 99 | + A -->|New/changed data| C |
| 100 | + C -->|Write micro-batch| D |
| 101 | + D -->|Load micro-batch| E |
| 102 | +``` |
| 103 | + |
| 104 | +Characteristics: |
| 105 | + |
| 106 | +- **Near** real-time processing (typically 5-15 minutes) |
| 107 | +- Balances latency and efficiency |
| 108 | +- Easier than true streaming |
| 109 | +- Good for most use cases |
| 110 | + |
| 111 | +### 3. Data Ingestion Strategies |
| 112 | + |
| 113 | +#### [[Full Load|Full Load]] |
| 114 | + |
| 115 | +![[Full Load#^overview-full-load]] |
| 116 | +![[Full Load#^overview-full-load-diagram]] |
| 117 | + |
| 118 | +#### [[Delta Load|Incremental Load]] |
| 119 | + |
| 120 | +![[Delta Load#^overview-delta-load]] |
| 121 | +![[Delta Load#^overview-delta-load-diagram]] |
| 122 | + |
| 123 | +#### [[Change Data Capture|Change Data Capture (CDC)]] |
| 124 | + |
| 125 | +![[Change Data Capture#^overview-cdc]] |
| 126 | +![[Change Data Capture#^overview-cdc-diagram]] |
| 127 | + |
| 128 | +## Data Ingestion Examples |
| 129 | + |
| 130 | +Common examples of data ingestion patterns. |
| 131 | + |
| 132 | +### API Data Ingestion Example |
| 133 | + |
| 134 | +Ingesting data from a REST API on a scheduled basis: |
| 135 | + |
| 136 | +```mermaid |
| 137 | +%%{init: { "flowchart": { "useMaxWidth": true } } }%% |
| 138 | +graph LR |
| 139 | + A[External API] |
| 140 | + B[Scheduler: Airflow/Cron] |
| 141 | + C[Python Script] |
| 142 | + D[(Data Lake)] |
| 143 | + |
| 144 | + B -->|Trigger every hour| C |
| 145 | + C -->|HTTP Request| A |
| 146 | + A -->|JSON Response| C |
| 147 | + C -->|Store data| D |
| 148 | +``` |
| 149 | + |
| 150 | +### Database Replication Example |
| 151 | + |
| 152 | +Real-time replication from an operational database to an analytics database: |
| 153 | + |
| 154 | +```mermaid |
| 155 | +%%{init: { "flowchart": { "useMaxWidth": true } } }%% |
| 156 | +graph LR |
| 157 | + subgraph Production |
| 158 | + A[(PostgreSQL)] |
| 159 | + B[WAL Logs] |
| 160 | + A --> B |
| 161 | + end |
| 162 | + |
| 163 | + subgraph Ingestion |
| 164 | + C[Debezium Connector] |
| 165 | + D[Apache Kafka] |
| 166 | + B --> C |
| 167 | + C --> D |
| 168 | + end |
| 169 | + |
| 170 | + subgraph Analytics |
| 171 | + E[Kafka Connect] |
| 172 | + F[(Data Warehouse)] |
| 173 | + D --> E |
| 174 | + E --> F |
| 175 | + end |
| 176 | +``` |
| 177 | + |
| 178 | +### File-Based Ingestion Example |
| 179 | + |
| 180 | +Processing files dropped into cloud storage: |
| 181 | + |
| 182 | +```mermaid |
| 183 | +%%{init: { "flowchart": { "useMaxWidth": true } } }%% |
| 184 | +graph LR |
| 185 | + A[External System] |
| 186 | + B[(Cloud Storage<br/>S3/GCS)] |
| 187 | + C[Event Trigger] |
| 188 | + D[Processing Function] |
| 189 | + E[(Data Warehouse)] |
| 190 | + |
| 191 | + A -->|Upload files| B |
| 192 | + B -->|File arrival event| C |
| 193 | + C -->|Trigger| D |
| 194 | + D -->|Read and process| B |
| 195 | + D -->|Load processed data| E |
| 196 | +``` |
| 197 | + |
| 198 | +## Common Data Ingestion Challenges |
| 199 | + |
| 200 | +### Scalability |
| 201 | + |
| 202 | +- Volume Growth: Handling increasing data volumes |
| 203 | +- Source System Impact: Minimizing load on operational systems |
| 204 | +- Resource Management: Efficiently using compute and storage resources |
| 205 | + |
| 206 | +### Reliability |
| 207 | + |
| 208 | +- Source System Downtime: Handling unavailable data sources |
| 209 | +- Network Issues: Managing connectivity problems |
| 210 | +- Data Consistency: Ensuring data integrity across systems |
| 211 | + |
| 212 | +### Complexity |
| 213 | + |
| 214 | +- Schema Evolution: Handling changes in source data structures |
| 215 | +- Multiple Sources: Managing diverse data sources and formats |
| 216 | +- Dependency Management: Coordinating ingestion across related datasets |
| 217 | + |
| 218 | +%% wiki footer: Please don't edit anything below this line %% |
| 219 | + |
| 220 | +## This note in GitHub |
| 221 | + |
| 222 | +<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Concepts/Data%20Ingestion/Data%20Ingestion.md "git-hub-edit-note") | [Copy this note](https://raw.githubusercontent.com/data-engineering-community/data-engineering-wiki/main/Concepts/Data%20Ingestion/Data%20Ingestion.md "git-hub-copy-note")</span> |
| 223 | + |
| 224 | +<span class="git-footer">Was this page helpful? |
| 225 | +[👍](https://tally.so/r/mOaxjk?rating=Yes&url=https://dataengineering.wiki/Concepts/Data%20Ingestion/Data%20Ingestion) or [👎](https://tally.so/r/mOaxjk?rating=No&url=https://dataengineering.wiki/Concepts/Data%20Ingestion/Data%20Ingestion)</span> |
0 commit comments