-
Notifications
You must be signed in to change notification settings - Fork 7
chore: move Job Metrics docs from Geneva #184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| --- | ||
| title: Job Metrics (Diagnostics) | ||
| sidebarTitle: Job Metrics | ||
| description: Use metrics from Geneva to diagnose why a backfill/refresh job is slow. | ||
| icon: chart-simple | ||
| --- | ||
|
|
||
| ## Where metrics come from | ||
|
|
||
| Job metrics are attached to each job record in the Console/API response under | ||
| `metrics`. | ||
|
|
||
| - Job list: | ||
| - `GET /api/v1/jobs?table_name=<table>&db_uri_encoded=<...>` | ||
| - Job detail: | ||
| - `GET /api/v1/jobs/<job_id>?db_uri_encoded=<...>` | ||
|
|
||
| ## Core diagnostic metrics | ||
|
|
||
| | Metric | What it means | Common signal | | ||
| | --- | --- | --- | | ||
| | `rows_checkpointed` | Rows finished by read/UDF/checkpoint stage. | High value means upstream compute is progressing. | | ||
| | `rows_ready_for_commit` | Rows ready for atomic commit (becoming visible to other DB connections). | If much lower than `rows_checkpointed`, writer path is likely bottlenecked. | | ||
| | `rows_committed` | Rows already visible to other DB connections. | If lagging far behind `rows_ready_for_commit`, commit stage may be bottlenecked. | | ||
| | `cnt_geneva_workers_active` | Current parallel UDF executors. | Lower than expected means reduced effective parallelism. | | ||
| | `cnt_geneva_workers_pending` | Deficit from desired parallelism. | Persistently high value usually means scheduling/resource pressure. | | ||
| | `read_io_time_ms` | Cumulative read IO time. | Dominant value suggests storage/read bottleneck. | | ||
| | `udf_processing_time` | Cumulative UDF execution time. | Dominant value suggests compute/UDF bottleneck. | | ||
| | `batch_checkpointing_time` | Cumulative batch checkpoint overhead. | High value suggests checkpoint overhead is expensive. | | ||
| | `writer_write_time` | Cumulative writer output time. | High value often points to object storage throughput/throttling issues. | | ||
| | `writer_queue_wait_time_ms` | Cumulative writer queue wait time. | High value can indicate writer starvation/backpressure. | | ||
| | `commit_time_ms` | Cumulative commit time. | High value means commit itself is expensive. | | ||
| | `commit_conflict_retries` | Commit retries due to version conflicts. | Non-trivial counts indicate commit contention. | | ||
| | `commit_backoff_time_ms` | Time spent backing off during commit retries. | High value indicates contention/retry pressure. | | ||
| | `commit_concurrent_writer_retries` | Retries from "Too many concurrent writers". | High value indicates writer concurrency contention. | | ||
|
|
||
| ## Quick diagnosis workflow | ||
|
|
||
| 1. Check `rows_checkpointed` vs `rows_ready_for_commit`. | ||
| - If `rows_checkpointed` is high but `rows_ready_for_commit` is low, fragment | ||
| writer is usually the bottleneck. | ||
| - This often indicates object storage read/write pressure (for example S3). | ||
| 2. Compare read, UDF, and checkpoint timing. | ||
| - High `read_io_time_ms`: storage or scan bottleneck. | ||
| - High `udf_processing_time`: UDF compute bottleneck. | ||
| - High `batch_checkpointing_time`: checkpoint overhead bottleneck. | ||
| - Typical mitigations: increase `checkpoint_size`, increase | ||
| `max_checkpoint_size`, or compact the table to produce larger fragments. | ||
| 3. Check writer timing. | ||
| - High `writer_write_time` is commonly object storage throttling/throughput | ||
| limit. | ||
| - Typical mitigations: use higher network-bandwidth node types, and keep | ||
| object storage and compute nodes in the same region. | ||
| 4. Check commit pressure. | ||
| - High `commit_conflict_retries`, `commit_backoff_time_ms`, or | ||
| `commit_concurrent_writer_retries` indicates commit contention. | ||
| 5. Check parallelism deficit. | ||
| - If `cnt_geneva_workers_pending` stays high while | ||
| `cnt_geneva_workers_active` stays low, the job is running below desired | ||
| parallelism due to cluster/resource constraints. | ||
|
|
||
| ## Notes | ||
|
|
||
| - Timing metrics are cumulative and may overlap; do not sum them as exact wall | ||
| time. | ||
| - For completed jobs, row counters should settle to stable final values. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.. do we expose this anywhere currently?
How would a user get to these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah I guess we don't. So we'll have to:
I'll do these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, I don't really want to open the can of worms of documenting the existence of the API now. Then we'd have to document it or say "it's experimental, um, don't use it" - and we'd lose flexibility to change it in the future. Plus, it doesn't really get us anything more than "look at it in the UI." So, I've just removed these API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we show them in the job history browser? Can we point them at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, in the Job Details page. That's what I did:
And I'm just now adding a link in that Job Details page to this doc.