Skip to content

Commit 3db6a74

Browse files
committed
Added best practices around optimization, security, and error handling from recent discussions
1 parent f0f8e3d commit 3db6a74

File tree

1 file changed

+13
-0
lines changed

1 file changed

+13
-0
lines changed

Guides/Data Pipeline Best Practices.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,17 @@ A best practice guide for data pipelines compiled from data engineers in the com
3333

3434
- Don't let file sizes become too large or too small. Large files (>1 GB) can require more resources and many small files can create a large overhead to process. ~250 MB is a good size that allows for better parallel processing.
3535
- Use the [[Claim Check Pattern|claim check pattern]] to pass large amounts of data between tasks in your pipeline.
36+
- Compress data when possible before storing or transmitting. gzip, snappy, and lz4 are common compression algorithms.
37+
- Partition/cluster data based on common query patterns for efficient querying.
3638

3739
## Security
3840

3941
- Save credentials in a secrets manager and access them in your pipeline programmatically.
4042
- Ideally, have secrets rotated automatically.
4143
- Avoid logging any sensitive information like credentials or PII data.
44+
- Encrypt data in transit and at rest.
45+
- Implement proper access control (IAM) policies.
46+
- Create audit trails/logs of data access and changes.
4247

4348
## Testing
4449

@@ -47,6 +52,14 @@ A best practice guide for data pipelines compiled from data engineers in the com
4752
- Set up a local environment to test pipelines locally first. (see Docker above)
4853
- Re-define pipeline failures. If pipeline fails x times but data is still delivered on time then it was successful.
4954

55+
## Error Handling and Monitoring
56+
57+
- Implement exponential backoff for transient failures.
58+
- Streaming: Add dead letter queues (DQL) to store failed messages. This allows you to inspect and reprocess them later without losing data.
59+
- Set up notifications for pipeline failures.
60+
- Use comprehensive logging for debugging and auditing.
61+
- Track metrics such as ingestion rates, latency, CPU/memory usage, and error rates.
62+
5063
%% wiki footer: Please don't edit anything below this line %%
5164

5265
## This note in GitHub

0 commit comments

Comments
 (0)