docs: add Geneva adaptive batching doc (#95)

BubbleCal · web-flow · commit ea6055b2e890 · 2026-01-24T00:09:52.000+08:00
* Add Geneva checkpointing doc

* docs: document adaptive checkpoint sizing

* docs

Signed-off-by: BubbleCal &lt;bubble-cal@outlook.com&gt;

* docs: rename batch_size to checkpoint_size

* refine

Signed-off-by: BubbleCal &lt;bubble-cal@outlook.com&gt;

---------

Signed-off-by: BubbleCal &lt;bubble-cal@outlook.com&gt;
diff --git a/docs/geneva/jobs/backfilling.mdx b/docs/geneva/jobs/backfilling.mdx
@@ -11,6 +11,35 @@ Triggering backfill creates a distributed job to run the UDF and populate the co
 
 **Checkpoints**: Each batch of UDF execution is checkpointed so that partial results are not lost in case of job failures. Jobs can resume and avoid most of the expense of having to recalculate values.
 
+## Adaptive checkpoint sizing
+
+Geneva can automatically adjust checkpoint sizes during a backfill. It starts with small checkpoints (faster proof-of-life) and grows them as it observes stable throughput, while staying within safe bounds. Planning still uses your configured checkpoint size (`checkpoint_size`), but the actual checkpoint chunks can be smaller when adaptive sizing is enabled.
+
+Adaptive sizing is always clamped to bounds:
+
+- `max_checkpoint_size`: Upper bound. Defaults to the job's checkpoint size (`checkpoint_size`) and is capped at that value if you set a larger max.
+- `min_checkpoint_size`: Lower bound. Defaults to 1.
+
+When `min_checkpoint_size == max_checkpoint_size`, adaptive sizing is disabled and checkpoints are fixed-size.
+
+You can set adaptive bounds in two places:
+
+- On the UDF definition via `@udf(..., min_checkpoint_size=..., max_checkpoint_size=...)`
+- On the backfill call via `table.backfill(..., min_checkpoint_size=..., max_checkpoint_size=...)`
+
+Backfill-level values take precedence over UDF defaults.
+
+<CodeGroup>
+```python Python icon="python"
+@udf(min_checkpoint_size=25, max_checkpoint_size=200)
+def embed_udf(text):
+    ...
+
+# Override the UDF defaults for this run
+tbl.backfill("embedding", min_checkpoint_size=10, max_checkpoint_size=100)
+```
+</CodeGroup>
+
 ## Managing concurrency
 
 One way to speed up the execution of a job to give it more resources and to have it work in parallel.  There are a few settings you can use on the backfill command to tune this.
@@ -95,4 +124,4 @@ tbl.backfill("embedding", where="content is not null and embeddding is not null"
 
 Reference:
 * [`backfill` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill)
-* [`backfill_async` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill_async)
+* [`backfill_async` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill_async)
diff --git a/docs/geneva/jobs/performance.mdx b/docs/geneva/jobs/performance.mdx
@@ -26,7 +26,7 @@ The `Table.backfill(..) ` method has several optional arguments to tune performa
 
 `commit_granularity` controls how frequently fragments are committed so that partical results can be come visible to table readers.  
 
-Setting `batch_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed.  This is useful if the computation on your data is expensive.
+Setting `checkpoint_size` smaller introduces finer-grained checkpoints and can help provide more frequent proof of life as a job is being executed.  This is useful if the computation on your data is expensive.
 
 Reference:
 * [`backfill` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.backfill)
@@ -44,4 +44,4 @@ Certain jobs that take a small data set and expand it may appear as if the write
 
 An example is table that contains a list of URLs pointing to large media files.  This list is relatively small (&lt; 100MB) and can fit into a single fragment.  A UDF that downloads will fetch all the data and then attempt to write all of it out through the single writer.  This single writer then can be responsible for serially writing out 500+GB of data to a single file!
 
-To mitigate this, you can load your initial table so that there will be multipe fragments.  Each fragment with new outputs can be written in parallel with higher write throughput.
+To mitigate this, you can load your initial table so that there will be multipe fragments.  Each fragment with new outputs can be written in parallel with higher write throughput.