|
| 1 | +--- |
| 2 | +title: Job Metrics (Diagnostics) |
| 3 | +sidebarTitle: Job Metrics |
| 4 | +description: Use metrics from Geneva to diagnose why a backfill/refresh job is slow. |
| 5 | +icon: chart-simple |
| 6 | +--- |
| 7 | + |
| 8 | +## How to find metrics |
| 9 | + |
| 10 | +Job metrics can be found in the [Geneva Console UI](https://docs.lancedb.com/geneva/jobs/console), by clicking on a job's ID to get to the "Job details" page. |
| 11 | + |
| 12 | +## Core diagnostic metrics |
| 13 | + |
| 14 | +| Metric | What it means | Common signal | |
| 15 | +| --- | --- | --- | |
| 16 | +| `rows_checkpointed` | Rows finished by read/UDF/checkpoint stage. | High value means upstream compute is progressing. | |
| 17 | +| `rows_ready_for_commit` | Rows ready for atomic commit (becoming visible to other DB connections). | If much lower than `rows_checkpointed`, writer path is likely bottlenecked. | |
| 18 | +| `rows_committed` | Rows already visible to other DB connections. | If lagging far behind `rows_ready_for_commit`, commit stage may be bottlenecked. | |
| 19 | +| `cnt_geneva_workers_active` | Current parallel UDF executors. | Lower than expected means reduced effective parallelism. | |
| 20 | +| `cnt_geneva_workers_pending` | Deficit from desired parallelism. | Persistently high value usually means scheduling/resource pressure. | |
| 21 | +| `read_io_time_ms` | Cumulative read IO time. | Dominant value suggests storage/read bottleneck. | |
| 22 | +| `udf_processing_time` | Cumulative UDF execution time. | Dominant value suggests compute/UDF bottleneck. | |
| 23 | +| `batch_checkpointing_time` | Cumulative batch checkpoint overhead. | High value suggests checkpoint overhead is expensive. | |
| 24 | +| `writer_write_time` | Cumulative writer output time. | High value often points to object storage throughput/throttling issues. | |
| 25 | +| `writer_queue_wait_time_ms` | Cumulative writer queue wait time. | High value can indicate writer starvation/backpressure. | |
| 26 | +| `commit_time_ms` | Cumulative commit time. | High value means commit itself is expensive. | |
| 27 | +| `commit_conflict_retries` | Commit retries due to version conflicts. | Non-trivial counts indicate commit contention. | |
| 28 | +| `commit_backoff_time_ms` | Time spent backing off during commit retries. | High value indicates contention/retry pressure. | |
| 29 | +| `commit_concurrent_writer_retries` | Retries from "Too many concurrent writers". | High value indicates writer concurrency contention. | |
| 30 | + |
| 31 | +## Quick diagnosis workflow |
| 32 | + |
| 33 | +1. Check `rows_checkpointed` vs `rows_ready_for_commit`. |
| 34 | + - If `rows_checkpointed` is high but `rows_ready_for_commit` is low, fragment |
| 35 | + writer is usually the bottleneck. |
| 36 | + - This often indicates object storage read/write pressure (for example S3). |
| 37 | +2. Compare read, UDF, and checkpoint timing. |
| 38 | + - High `read_io_time_ms`: storage or scan bottleneck. |
| 39 | + - High `udf_processing_time`: UDF compute bottleneck. |
| 40 | + - High `batch_checkpointing_time`: checkpoint overhead bottleneck. |
| 41 | + - Typical mitigations: increase `checkpoint_size`, increase |
| 42 | + `max_checkpoint_size`, or compact the table to produce larger fragments. |
| 43 | +3. Check writer timing. |
| 44 | + - High `writer_write_time` is commonly object storage throttling/throughput |
| 45 | + limit. |
| 46 | + - Typical mitigations: use higher network-bandwidth node types, and keep |
| 47 | + object storage and compute nodes in the same region. |
| 48 | +4. Check commit pressure. |
| 49 | + - High `commit_conflict_retries`, `commit_backoff_time_ms`, or |
| 50 | + `commit_concurrent_writer_retries` indicates commit contention. |
| 51 | +5. Check parallelism deficit. |
| 52 | + - If `cnt_geneva_workers_pending` stays high while |
| 53 | + `cnt_geneva_workers_active` stays low, the job is running below desired |
| 54 | + parallelism due to cluster/resource constraints. |
| 55 | + |
| 56 | +## Notes |
| 57 | + |
| 58 | +- Timing metrics are cumulative and may overlap; do not sum them as exact wall |
| 59 | + time. |
| 60 | +- For completed jobs, row counters should settle to stable final values. |
0 commit comments