You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/geneva/jobs/backfilling.mdx
+30-1Lines changed: 30 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,35 @@ Triggering backfill creates a distributed job to run the UDF and populate the co
11
11
12
12
**Checkpoints**: Each batch of UDF execution is checkpointed so that partial results are not lost in case of job failures. Jobs can resume and avoid most of the expense of having to recalculate values.
13
13
14
+
## Adaptive checkpoint sizing
15
+
16
+
Geneva can automatically adjust checkpoint sizes during a backfill. It starts with small checkpoints (faster proof-of-life) and grows them as it observes stable throughput, while staying within safe bounds. Planning still uses your configured checkpoint size (`checkpoint_size`), but the actual checkpoint chunks can be smaller when adaptive sizing is enabled.
17
+
18
+
Adaptive sizing is always clamped to bounds:
19
+
20
+
-`max_checkpoint_size`: Upper bound. Defaults to the job's checkpoint size (`checkpoint_size`) and is capped at that value if you set a larger max.
21
+
-`min_checkpoint_size`: Lower bound. Defaults to 1.
22
+
23
+
When `min_checkpoint_size == max_checkpoint_size`, adaptive sizing is disabled and checkpoints are fixed-size.
24
+
25
+
You can set adaptive bounds in two places:
26
+
27
+
- On the UDF definition via `@udf(..., min_checkpoint_size=..., max_checkpoint_size=...)`
28
+
- On the backfill call via `table.backfill(..., min_checkpoint_size=..., max_checkpoint_size=...)`
29
+
30
+
Backfill-level values take precedence over UDF defaults.
One way to speed up the execution of a job to give it more resources and to have it work in parallel. There are a few settings you can use on the backfill command to tune this.
@@ -95,4 +124,4 @@ tbl.backfill("embedding", where="content is not null and embeddding is not null"
Copy file name to clipboardExpand all lines: docs/geneva/jobs/performance.mdx
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ The `Table.backfill(..) ` method has several optional arguments to tune performa
26
26
27
27
`commit_granularity` controls how frequently fragments are committed so that partical results can be come visible to table readers.
28
28
29
-
Setting `batch_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed. This is useful if the computation on your data is expensive.
29
+
Setting `checkpoint_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed. This is useful if the computation on your data is expensive.
@@ -44,4 +44,4 @@ Certain jobs that take a small data set and expand it may appear as if the write
44
44
45
45
An example is table that contains a list of URLs pointing to large media files. This list is relatively small (< 100MB) and can fit into a single fragment. A UDF that downloads will fetch all the data and then attempt to write all of it out through the single writer. This single writer then can be responsible for serially writing out 500+GB of data to a single file!
46
46
47
-
To mitigate this, you can load your initial table so that there will be multipe fragments. Each fragment with new outputs can be written in parallel with higher write throughput.
47
+
To mitigate this, you can load your initial table so that there will be multipe fragments. Each fragment with new outputs can be written in parallel with higher write throughput.
0 commit comments