You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Guides/Data Pipeline Best Practices.md
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,12 +33,17 @@ A best practice guide for data pipelines compiled from data engineers in the com
33
33
34
34
- Don't let file sizes become too large or too small. Large files (>1 GB) can require more resources and many small files can create a large overhead to process. ~250 MB is a good size that allows for better parallel processing.
35
35
- Use the [[Claim Check Pattern|claim check pattern]] to pass large amounts of data between tasks in your pipeline.
36
+
- Compress data when possible before storing or transmitting. gzip, snappy, and lz4 are common compression algorithms.
37
+
- Partition/cluster data based on common query patterns for efficient querying.
36
38
37
39
## Security
38
40
39
41
- Save credentials in a secrets manager and access them in your pipeline programmatically.
40
42
- Ideally, have secrets rotated automatically.
41
43
- Avoid logging any sensitive information like credentials or PII data.
44
+
- Encrypt data in transit and at rest.
45
+
- Implement proper access control (IAM) policies.
46
+
- Create audit trails/logs of data access and changes.
42
47
43
48
## Testing
44
49
@@ -47,6 +52,14 @@ A best practice guide for data pipelines compiled from data engineers in the com
47
52
- Set up a local environment to test pipelines locally first. (see Docker above)
48
53
- Re-define pipeline failures. If pipeline fails x times but data is still delivered on time then it was successful.
49
54
55
+
## Error Handling and Monitoring
56
+
57
+
- Implement exponential backoff for transient failures.
58
+
- Streaming: Add dead letter queues (DQL) to store failed messages. This allows you to inspect and reprocess them later without losing data.
59
+
- Set up notifications for pipeline failures.
60
+
- Use comprehensive logging for debugging and auditing.
61
+
- Track metrics such as ingestion rates, latency, CPU/memory usage, and error rates.
62
+
50
63
%% wiki footer: Please don't edit anything below this line %%
0 commit comments