Commit be54ab6
authored
Rework metrics: Prometheus-friendly counters, flush duration, diagnostics (#48)
* Translate METRICS.md to English, retain Portuguese as METRICS-pt.md
* Wire up unrecorded metrics: schema ops, batch processing, record counts
Sensors for createTable, evolveSchema, schema-operation-*, batch-size-*,
and records-processed-* were registered but never recorded to in
production code. This connects them:
- DucklakeTableManager: instrument createTable/evolveTableSchema with
schema operation timers
- DucklakeSinkTask: create DucklakeMetrics in start(), record
batch/record metrics in flushBatches(), close in stop()
- DucklakeWriter: pass metrics through to DucklakeTableManager
- MetricTimer.close(): narrow to no checked exception
* Remove Portuguese METRICS-pt.md and translate remaining PT comment
* Reformat METRICS.md to use tables instead of bulleted lists
* Add per-partition flush duration metric using cumulative counters
Tracks flush write time (excluding consolidation) per partition via
flush-duration-total-ms and flush-count cumulative sums. These are
Prometheus-friendly — use rate() to derive averages, and they aggregate
correctly across partitions/tasks unlike windowed Avg/Max stats.
* Convert all metrics from windowed Avg/Max/Rate to CumulativeSum counters
Windowed stats (Avg, Max, Rate) don't aggregate correctly through the
JMX Exporter → Prometheus pipeline. Replace them with cumulative
counters that work with Prometheus rate() for correct cross-partition
and cross-task aggregation.
Renamed metrics:
- jdbc-query-time-avg/max/rate → jdbc-query-duration-total-ms
- schema-operation-time-avg/max → schema-operation-duration-total-ms
- operation-time-avg/max, operation-rate → operation-duration-total-ms
- records-processed-rate → removed (use rate(records-processed-total))
- batch-size-avg/max → batch-records-total + batch-count
* Prefix all metric names with ducklake- to avoid collisions
Kafka Connect shares a single Metrics registry across all connectors
in a worker. Generic names like jdbc-query-count could collide with
other connectors. Prefixing with ducklake- makes them unambiguous
regardless of tag filtering.
* Add diagnostic metrics: schema mismatch, flush skips, DLQ, spill, consolidation
New metrics for diagnosing perf and correctness issues:
- schema-mismatch-count: detects the many-small-files problem
- flush-skip-count: lock contention precursor to rebalance
- errant-record-count: records sent to DLQ (silent data loss signal)
- spill-batch-count/spill-bytes-total: disk spill volume
- consolidation-duration-total-ms/count: batch consolidation overhead
* Fix thread-safety: use ConcurrentHashMap for per-partition flush sensors
Flush sensors are lazily created via computeIfAbsent and accessed from
the put thread, scheduled flush thread, and partition executor threads.
HashMap.computeIfAbsent is not thread-safe and can corrupt internal
state under concurrent access.
* Fix thread leak: close Metrics registry in DucklakeMetrics.close()
new Metrics() starts an internal MetricsReporter thread and JMX
registrations. Without closing it, each task rebalance leaks a thread
and stale JMX beans, leading to OOM over hours in busy clusters.
* Only record evolveSchema metric when actual DDL executes
Previously the timer wrapped the entire method including the PRAGMA
metadata check that runs on every batch. Now it only fires when
columns are actually added, making the metric meaningful.
* Standardize all counter metric names with -total suffix
OpenMetrics/Prometheus convention: counters end with _total. Without
it, metrics like ducklake_batch_count look like gauges after JMX
Exporter converts hyphens to underscores.
* Use DucklakeMetricsInterface instead of concrete class in Writer/Factory
Allows injecting mock or no-op metrics implementations for testing.
* Add NoopDucklakeMetrics to eliminate null-check duplication
Null object pattern replaces all if-metrics-not-null branches with
unconditional calls. Reduces code duplication at every instrumentation
site and prevents forgetting a null check when adding new metrics.
* Change recordBatchProcessed to accept long instead of int
Avoids silent overflow on the long-to-int cast, even though current
defaults make it unlikely to hit.
* Remove reference to non-existent Grafana dashboard JSON
* Switch DucklakeMetrics from System.Logger to SLF4J
System.Logger uses JDK Platform Logging which may not bridge to the
same backend as SLF4J in Kafka Connect. Log output from DucklakeMetrics
(including unknown operation type warnings) could be lost or go to a
different handler.
* Add tests for new metrics: consolidation, spill, mismatch, DLQ, flush skip
Covers all metrics added in this branch plus concurrent flush sensor
creation and cleanup after close.
* Fix Metrics registry ownership: close at call site, not in DucklakeMetrics
DucklakeMetrics should not close a registry it doesn't own — if a
shared registry were ever passed in, close() would kill it for all
consumers. Move registry lifecycle to DucklakeSinkTask.stop() where
it was created. Also stores the registry reference so it's closed
even if start() fails partway through.
* Remove duplicate batch-records-total metric
records-processed-total and batch-records-total recorded identical
values. Keep records-processed-total + batch-count-total; derive
average batch size via rate(records) / rate(batch_count).
* Fix schema mismatch metric: record 1 per event, not output batch count
The metric now counts flushes that had schema mismatches, rather than
recording the number of output batches. Simpler, matches the docs,
and works correctly as a counter with rate().
* Clean up nits: dead null checks, unmodifiable maps, sensor naming, comment
- Remove dead null checks on final fields in close()
- Make operationTimingSensors/operationCountSensors unmodifiable after
construction to prevent accidental mutation from other threads
- Normalize internal sensor names (remove inconsistent -sensor suffix)
- Add comment explaining load-bearing NoopDucklakeMetrics initializer
* Include type upgrades in evolveSchema metric timing
performTypeUpgrade() executes ALTER COLUMN DDL but was called inside
the field loop before the timer started. If only type upgrades occurred
(no new columns, _inserted_at exists), the method returned early
without recording the schema operation. Defer upgrades to execute
inside the timed block alongside column additions.
* Drop unused partition parameter from recordFlushSkip
The partition key was computed at every call site but ignored by the
implementation. Overall skip rate is sufficient for alerting; correlate
with per-partition flush duration to identify the hot partition.
* Move metrics close to finally block in stop() to prevent leak on error
connectionFactory.close() can throw, which would skip metrics cleanup
in the same try block. Moving to finally ensures the Metrics registry
thread and JMX registrations are always cleaned up.1 parent e61e884 commit be54ab6
File tree
11 files changed
+828
-620
lines changed- src
- main/java/com/inyo/ducklake
- connect
- ingestor
- test/java/com/inyo/ducklake/connect
11 files changed
+828
-620
lines changedLarge diffs are not rendered by default.
Lines changed: 232 additions & 251 deletions
Large diffs are not rendered by default.
Lines changed: 21 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
39 | 55 | | |
40 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
41 | 60 | | |
Lines changed: 44 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| |||
88 | 89 | | |
89 | 90 | | |
90 | 91 | | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
91 | 95 | | |
92 | 96 | | |
93 | 97 | | |
| |||
150 | 154 | | |
151 | 155 | | |
152 | 156 | | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
153 | 161 | | |
154 | 162 | | |
155 | 163 | | |
| |||
299 | 307 | | |
300 | 308 | | |
301 | 309 | | |
| 310 | + | |
302 | 311 | | |
303 | 312 | | |
304 | 313 | | |
| |||
376 | 385 | | |
377 | 386 | | |
378 | 387 | | |
| 388 | + | |
379 | 389 | | |
380 | 390 | | |
381 | 391 | | |
| |||
427 | 437 | | |
428 | 438 | | |
429 | 439 | | |
430 | | - | |
| 440 | + | |
| 441 | + | |
431 | 442 | | |
432 | 443 | | |
433 | 444 | | |
| |||
453 | 464 | | |
454 | 465 | | |
455 | 466 | | |
456 | | - | |
| 467 | + | |
| 468 | + | |
457 | 469 | | |
458 | 470 | | |
459 | 471 | | |
| |||
782 | 794 | | |
783 | 795 | | |
784 | 796 | | |
785 | | - | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
786 | 803 | | |
787 | 804 | | |
788 | 805 | | |
789 | 806 | | |
790 | 807 | | |
791 | 808 | | |
792 | 809 | | |
793 | | - | |
794 | | - | |
795 | | - | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
796 | 816 | | |
797 | 817 | | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
798 | 821 | | |
799 | 822 | | |
800 | 823 | | |
| |||
1058 | 1081 | | |
1059 | 1082 | | |
1060 | 1083 | | |
| 1084 | + | |
1061 | 1085 | | |
1062 | 1086 | | |
1063 | 1087 | | |
| |||
1208 | 1232 | | |
1209 | 1233 | | |
1210 | 1234 | | |
| 1235 | + | |
| 1236 | + | |
| 1237 | + | |
| 1238 | + | |
| 1239 | + | |
| 1240 | + | |
| 1241 | + | |
| 1242 | + | |
| 1243 | + | |
| 1244 | + | |
| 1245 | + | |
| 1246 | + | |
| 1247 | + | |
| 1248 | + | |
1211 | 1249 | | |
1212 | 1250 | | |
1213 | 1251 | | |
Lines changed: 3 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
| 34 | + | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| |||
Lines changed: 92 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
Lines changed: 10 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| 47 | + | |
47 | 48 | | |
48 | 49 | | |
49 | 50 | | |
| |||
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
76 | 84 | | |
| 85 | + | |
77 | 86 | | |
78 | 87 | | |
79 | 88 | | |
| |||
106 | 115 | | |
107 | 116 | | |
108 | 117 | | |
| 118 | + | |
109 | 119 | | |
110 | 120 | | |
111 | 121 | | |
| |||
0 commit comments