Skip to content

Commit ea6055b

Browse files
authored
docs: add Geneva adaptive batching doc (#95)
* Add Geneva checkpointing doc * docs: document adaptive checkpoint sizing * docs Signed-off-by: BubbleCal <bubble-cal@outlook.com> * docs: rename batch_size to checkpoint_size * refine Signed-off-by: BubbleCal <bubble-cal@outlook.com> --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
1 parent 69e8448 commit ea6055b

File tree

2 files changed

+32
-3
lines changed

2 files changed

+32
-3
lines changed

docs/geneva/jobs/backfilling.mdx

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,35 @@ Triggering backfill creates a distributed job to run the UDF and populate the co
1111

1212
**Checkpoints**: Each batch of UDF execution is checkpointed so that partial results are not lost in case of job failures. Jobs can resume and avoid most of the expense of having to recalculate values.
1313

14+
## Adaptive checkpoint sizing
15+
16+
Geneva can automatically adjust checkpoint sizes during a backfill. It starts with small checkpoints (faster proof-of-life) and grows them as it observes stable throughput, while staying within safe bounds. Planning still uses your configured checkpoint size (`checkpoint_size`), but the actual checkpoint chunks can be smaller when adaptive sizing is enabled.
17+
18+
Adaptive sizing is always clamped to bounds:
19+
20+
- `max_checkpoint_size`: Upper bound. Defaults to the job's checkpoint size (`checkpoint_size`) and is capped at that value if you set a larger max.
21+
- `min_checkpoint_size`: Lower bound. Defaults to 1.
22+
23+
When `min_checkpoint_size == max_checkpoint_size`, adaptive sizing is disabled and checkpoints are fixed-size.
24+
25+
You can set adaptive bounds in two places:
26+
27+
- On the UDF definition via `@udf(..., min_checkpoint_size=..., max_checkpoint_size=...)`
28+
- On the backfill call via `table.backfill(..., min_checkpoint_size=..., max_checkpoint_size=...)`
29+
30+
Backfill-level values take precedence over UDF defaults.
31+
32+
<CodeGroup>
33+
```python Python icon="python"
34+
@udf(min_checkpoint_size=25, max_checkpoint_size=200)
35+
def embed_udf(text):
36+
...
37+
38+
# Override the UDF defaults for this run
39+
tbl.backfill("embedding", min_checkpoint_size=10, max_checkpoint_size=100)
40+
```
41+
</CodeGroup>
42+
1443
## Managing concurrency
1544

1645
One way to speed up the execution of a job to give it more resources and to have it work in parallel. There are a few settings you can use on the backfill command to tune this.
@@ -95,4 +124,4 @@ tbl.backfill("embedding", where="content is not null and embeddding is not null"
95124

96125
Reference:
97126
* [`backfill` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill)
98-
* [`backfill_async` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill_async)
127+
* [`backfill_async` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill_async)

docs/geneva/jobs/performance.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ The `Table.backfill(..) ` method has several optional arguments to tune performa
2626

2727
`commit_granularity` controls how frequently fragments are committed so that partical results can be come visible to table readers.
2828

29-
Setting `batch_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed. This is useful if the computation on your data is expensive.
29+
Setting `checkpoint_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed. This is useful if the computation on your data is expensive.
3030

3131
Reference:
3232
* [`backfill` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill)
@@ -44,4 +44,4 @@ Certain jobs that take a small data set and expand it may appear as if the write
4444

4545
An example is table that contains a list of URLs pointing to large media files. This list is relatively small (&lt; 100MB) and can fit into a single fragment. A UDF that downloads will fetch all the data and then attempt to write all of it out through the single writer. This single writer then can be responsible for serially writing out 500+GB of data to a single file!
4646

47-
To mitigate this, you can load your initial table so that there will be multipe fragments. Each fragment with new outputs can be written in parallel with higher write throughput.
47+
To mitigate this, you can load your initial table so that there will be multipe fragments. Each fragment with new outputs can be written in parallel with higher write throughput.

0 commit comments

Comments
 (0)