You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/02-use-cases/06-batch-job.md
+115-5Lines changed: 115 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,120 @@ title: Batch job
4
4
permalink: /docs/use-cases/batch-job
5
5
---
6
6
7
-
# Batch job
8
-
9
-
A lot of batch jobs are not pure data manipulation programs. For those, the existing big data frameworks are the best fit.
7
+
## Batch job
8
+
A lot of batch jobs are not pure data manipulation programs. For those, the existing big data frameworks are the best fit. Cadence is a more general orchestration mechanism and doesn't provide native SQL or worker data shuffle functionality out of the box, engineers wishing to rely on these would need to implement this functionality themselves.
10
9
But if processing a record requires external API calls that might fail and potentially take a long time, Cadence might be preferable.
11
10
12
-
One of our internal Uber customers use Cadence for end of month statement generation. Each statement requires calls to multiple
13
-
microservices and some statements can be really large. Cadence was chosen because it provides hard guarantees around durability of the financial data and seamlessly deals with long running operations, retries, and intermittent failures.
11
+
#### Use Case:
12
+
13
+
One of our internal Uber customers use Cadence for end of month statement generation. Each statement requires calls to multiple microservices and some statements can be really large. Cadence was chosen because it provides hard guarantees around durability of the financial data and seamlessly deals with long running operations, retries, and intermittent failures.
14
+
15
+
## Batch jobs with heartbeating
16
+
17
+
Cadence is able to coordinate, restart and track progress of large batch jobs by keeping track of their incremental progress and allowing them to resume if they're stopped for any reason. This predominantly relies on the `heartbeat` feature and activity retries.
18
+
19
+
This is used in production for customers who wish to work through large batch workloads
20
+
21
+
### Considerations before starting
22
+
23
+
Heartbeating cadence activities are activities who emit their progress at an appropriate interval (usually every few seconds) indicating where they are up to. Optionally, they may use progress information (like an offset number or iterator) to resume their progress. However, this necessarily implies that:
24
+
25
+
- If activities get restarted, they may redo some work, so this is not suitable for non-idempotent operations.
26
+
- The activity will be handling all the progress, so apart from heartbeat information, debugging about the granular operations being performed is not necessarily visible as compared by doing each operation in a distinct activity.
27
+
28
+
### What problems this solves
29
+
30
+
- This is for high-throughput operations where work may able to fit into a single long-running activity, or partitioned across multiple activities which can run for a longer duration.
31
+
- This addresses problems customers may have running workflows which are returning large blocks of data where the data is hitting up against Cadence activity limits
32
+
- This is a good way avoid hitting Cadence workflow history limits Cadence History entries (since this only is a single activity which is long running vs many small short-lived activities).
33
+
34
+
### High level concept:
35
+
36
+
The idea is to create an activity which will handle a lot of records and record its progress:
0 commit comments