Skip to content

Commit 3eedc1f

Browse files
committed
Update signals/panels descriptions
1 parent 1a97203 commit 3eedc1f

File tree

8 files changed

+154
-40
lines changed

8 files changed

+154
-40
lines changed

kafka-observ-lib/signals/broker.libsonnet

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,10 @@ function(this)
1515
signals: {
1616
brokerMessagesInPerSec: {
1717
name: 'Broker messages in',
18-
description: 'Broker messages in.',
18+
description: |||
19+
Rate of incoming messages published to this broker across all topics.
20+
Tracks producer throughput and write workload.
21+
|||,
1922
type: 'counter',
2023
unit: 'mps',
2124
sources: {
@@ -32,7 +35,10 @@ function(this)
3235
},
3336
brokerBytesInPerSec: {
3437
name: 'Broker bytes in',
35-
description: 'Broker bytes in rate.',
38+
description: |||
39+
Rate of incoming data in bytes published to this broker from producers.
40+
Measures network and storage write load.
41+
|||,
3642
type: 'counter',
3743
unit: 'Bps',
3844
sources: {
@@ -49,7 +55,10 @@ function(this)
4955
},
5056
brokerBytesOutPerSec: {
5157
name: 'Broker bytes out',
52-
description: 'Broker bytes out rate.',
58+
description: |||
59+
Rate of outgoing data in bytes sent from this broker to consumers and followers.
60+
Measures network read load and consumer throughput.
61+
|||,
5362
type: 'counter',
5463
unit: 'Bps',
5564
sources: {

kafka-observ-lib/signals/brokerReplicaManager.libsonnet

Lines changed: 34 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,10 @@ function(this)
8080
onlinePartitions: {
8181
name: 'Online partitions',
8282
description: |||
83-
Online partitions.
83+
Number of partitions that are currently online and available on this broker. This includes
84+
partitions where this broker is either the leader or a follower replica. The total count
85+
reflects the broker's share of the topic partitions across the cluster. A sudden drop in
86+
online partitions may indicate broker issues, partition reassignments, or cluster rebalancing.
8487
|||,
8588
type: 'gauge',
8689
unit: 'short',
@@ -103,7 +106,11 @@ function(this)
103106
offlinePartitions: {
104107
name: 'Offline partitions',
105108
description: |||
106-
Number of partitions that dont have an active leader and are hence not writable or readable.
109+
Number of partitions that don't have an active leader and are hence not writable or readable.
110+
Offline partitions indicate a critical availability issue as producers cannot write to these
111+
partitions and consumers cannot read from them. This typically occurs when all replicas for
112+
a partition are down or when there are not enough in-sync replicas to elect a new leader.
113+
Any non-zero value requires immediate investigation and remediation to restore service availability.
107114
|||,
108115
type: 'gauge',
109116
unit: 'short',
@@ -125,7 +132,11 @@ function(this)
125132
underReplicatedPartitions: {
126133
name: 'Under replicated partitions',
127134
description: |||
128-
Number of under replicated partitions (| ISR | < | all replicas |).
135+
Number of partitions that have fewer in-sync replicas (ISR) than the configured replication factor.
136+
Under-replicated partitions indicate potential data availability issues, as there are fewer copies
137+
of the data than desired. This could be caused by broker failures, network issues, or brokers
138+
falling behind in replication. A high number of under-replicated partitions poses a risk to
139+
data durability and availability, as the loss of additional brokers could result in data loss.
129140
|||,
130141
type: 'gauge',
131142
unit: 'short',
@@ -145,7 +156,11 @@ function(this)
145156
underMinISRPartitions: {
146157
name: 'Under min ISR partitions',
147158
description: |||
148-
Under min ISR(In-Sync replicas) partitions.
159+
Number of partitions that have fewer in-sync replicas (ISR) than the configured minimum ISR threshold.
160+
When the number of ISRs for a partition falls below the min.insync.replicas setting, the partition
161+
becomes unavailable for writes (if acks=all is configured), which helps prevent data loss but impacts
162+
availability. This metric indicates potential issues with broker availability, network connectivity,
163+
or replication lag that need immediate attention to restore write availability.
149164
|||,
150165
type: 'gauge',
151166
unit: 'short',
@@ -165,7 +180,12 @@ function(this)
165180
uncleanLeaderElection: {
166181
name: 'Unclean leader election',
167182
description: |||
168-
Unclean leader election rate.
183+
Rate of unclean leader elections occurring in the cluster. An unclean leader election happens
184+
when a partition leader fails and a replica that was not fully in-sync (not in the ISR) is
185+
elected as the new leader. This results in potential data loss as the new leader may be missing
186+
messages that were committed to the previous leader. Unclean elections occur when unclean.leader.election.enable
187+
is set to true and there are no in-sync replicas available. Any occurrence of unclean elections
188+
indicates a serious problem with cluster availability and replication health that risks data integrity.
169189
|||,
170190
type: 'raw',
171191
unit: 'short',
@@ -182,10 +202,16 @@ function(this)
182202
},
183203
},
184204
},
185-
preferredReplicaInbalance: {
186-
name: 'Preferred replica inbalance',
205+
preferredReplicaImbalance: {
206+
name: 'Preferred replica imbalance',
187207
description: |||
188-
The count of topic partitions for which the leader is not the preferred leader.
208+
The count of topic partitions for which the leader is not the preferred leader. In Kafka,
209+
each partition has a preferred leader replica (typically the first replica in the replica list).
210+
When leadership is not on the preferred replica, the cluster may experience uneven load distribution
211+
across brokers, leading to performance imbalances. This can occur after broker failures and restarts,
212+
or during cluster maintenance. Running the preferred replica election can help rebalance leadership
213+
and optimize cluster performance. A consistently high imbalance may indicate issues with automatic
214+
leader rebalancing or the need for manual intervention.
189215
|||,
190216
type: 'gauge',
191217
unit: 'short',

kafka-observ-lib/signals/cluster.libsonnet

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,9 @@ function(this)
1717
activeControllers: {
1818
name: 'Active kafka controllers',
1919
description: |||
20-
Active kafka controllers count.
20+
Number of active controllers in the cluster. Should always be exactly 1.
21+
Zero indicates no controller elected, preventing cluster operations.
22+
More than one indicates split-brain requiring immediate attention.
2123
|||,
2224
type: 'gauge',
2325
unit: 'short',
@@ -41,7 +43,10 @@ function(this)
4143
role: {
4244
name: 'Current role',
4345
description: |||
44-
0 - follower, 1 - controller.
46+
Broker's current controller role: 0 indicates follower, 1 indicates active controller.
47+
Only one broker should have value 1 at any time.
48+
Used to identify which broker is managing cluster metadata and leadership.
49+
Current controller role: 0 - follower, 1 - controller.
4550
|||,
4651
type: 'gauge',
4752
unit: 'short',
@@ -119,7 +124,9 @@ function(this)
119124
kraftBrokerRole: {
120125
name: 'Current role (kraft)',
121126
description: |||
122-
Any value - broker in kraft.
127+
Broker state in KRaft mode (Kafka without ZooKeeper).
128+
Any value indicates the broker is running in KRaft mode.
129+
Used to identify KRaft-enabled brokers in the cluster.
123130
|||,
124131
type: 'gauge',
125132
unit: 'short',
@@ -155,7 +162,7 @@ function(this)
155162
brokersCount: {
156163
name: 'Brokers count',
157164
description: |||
158-
Active brokers count.
165+
Total number of active brokers currently registered and reporting in the cluster.
159166
|||,
160167
type: 'gauge',
161168
unit: 'short',
@@ -178,7 +185,10 @@ function(this)
178185

179186
clusterMessagesInPerSec: {
180187
name: 'Cluster messages in',
181-
description: 'Cluster messages in.',
188+
description: |||
189+
Aggregate rate of incoming messages across all brokers and topics in the cluster.
190+
Represents total producer throughput and write workload.
191+
|||,
182192
type: 'counter',
183193
unit: 'mps',
184194
sources: {
@@ -195,7 +205,10 @@ function(this)
195205
},
196206
clusterBytesInPerSec: {
197207
name: 'Cluster bytes in',
198-
description: 'Cluster bytes in rate.',
208+
description: |||
209+
Aggregate rate of incoming data in bytes across all brokers from producers.
210+
Measures total network ingress and storage write load.
211+
|||,
199212
type: 'counter',
200213
unit: 'Bps',
201214
sources: {
@@ -212,7 +225,10 @@ function(this)
212225
},
213226
clusterBytesOutPerSec: {
214227
name: 'Cluster bytes out',
215-
description: 'Cluster bytes out rate.',
228+
description: |||
229+
Aggregate rate of outgoing data in bytes across all brokers to consumers and followers.
230+
Measures total network egress load and consumer throughput.
231+
|||,
216232
type: 'counter',
217233
unit: 'Bps',
218234
sources: {

kafka-observ-lib/signals/consumerGroup.libsonnet

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,11 @@ function(this)
1616
signals: {
1717
consumerGroupLag: {
1818
name: 'Consumer group lag',
19-
description: 'Current approximate lag of a ConsumerGroup at Topic/Partition.',
19+
description: |||
20+
Number of messages a consumer group is behind the latest available offset for a topic partition.
21+
High or growing lag indicates consumers can't keep up with producer throughput.
22+
Critical metric for consumer health and real-time processing requirements.
23+
|||,
2024
type: 'gauge',
2125
unit: 'short',
2226
aggFunction: 'sum',
@@ -35,7 +39,11 @@ function(this)
3539

3640
consumerGroupLagTime: {
3741
name: 'Consumer group lag in ms',
38-
description: 'Current approximate lag of a ConsumerGroup at Topic/Partition.',
42+
description: |||
43+
Time lag in milliseconds between message production and consumption for a consumer group.
44+
Represents real-time delay in message processing.
45+
More intuitive than message count lag for understanding business impact of delays.
46+
|||,
3947
type: 'gauge',
4048
unit: 'ms',
4149
optional: true,
@@ -50,7 +58,11 @@ function(this)
5058

5159
consumerGroupConsumeRate: {
5260
name: 'Consumer group consume rate',
53-
description: 'Consumer group consume rate.',
61+
description: |||
62+
Rate at which a consumer group is consuming and committing offsets for a topic.
63+
Measures consumer throughput and processing speed.
64+
Should match or exceed producer rate to prevent growing lag.
65+
|||,
5466
type: 'counter',
5567
unit: 'mps',
5668
sources: {

kafka-observ-lib/signals/conversion.libsonnet

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@ function(this)
1515
signals: {
1616
producerConversion: {
1717
name: 'Message conversion (producer)',
18-
description: 'The number of messages produced converted to match the log.message.format.version.',
18+
description: |||
19+
Rate of producer messages requiring format conversion to match broker's log.message.format.version.
20+
Conversions add CPU overhead and latency.
21+
Non-zero values suggest producer and broker version mismatches requiring alignment.
22+
|||,
1923
type: 'counter',
2024
unit: 'mps',
2125
sources: {
@@ -32,7 +36,11 @@ function(this)
3236
},
3337
consumerConversion: {
3438
name: 'Message conversion (consumer)',
35-
description: 'The number of messages consumed converted at consumer to match the log.message.format.version.',
39+
description: |||
40+
Rate of messages requiring format conversion during consumer fetch to match log.message.format.version.
41+
Conversions impact broker CPU and consumer latency.
42+
Indicates version mismatch between stored messages and consumer expectations.
43+
|||,
3644
type: 'counter',
3745
unit: 'mps',
3846
sources: {

kafka-observ-lib/signals/topic.libsonnet

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,11 @@ function(this)
1616
signals: {
1717
topicMessagesPerSec: {
1818
name: 'Messages in per second',
19-
description: 'Messages in per second.',
19+
description: |||
20+
Rate of messages produced to this topic across all partitions.
21+
Indicates topic write activity and producer throughput.
22+
Use to identify hot topics and understand data flow patterns.
23+
|||,
2024
type: 'counter',
2125
unit: 'mps',
2226
sources: {
@@ -31,7 +35,9 @@ function(this)
3135
// used in table:
3236
topicMessagesPerSecByPartition: {
3337
name: 'Messages in per second',
34-
description: 'Messages in per second.',
38+
description: |||
39+
Rate of messages produced to each partition within this topic.
40+
|||,
3541
type: 'counter',
3642
unit: 'mps',
3743
legendCustomTemplate: '{{ topic }}/{{ partition }}',
@@ -47,7 +53,9 @@ function(this)
4753
// JMX exporter extras
4854
topicBytesInPerSec: {
4955
name: 'Topic bytes in',
50-
description: 'Topic bytes in rate.',
56+
description: |||
57+
Rate of incoming data in bytes written to this topic from producers.
58+
|||,
5159
type: 'counter',
5260
unit: 'Bps',
5361
sources: {
@@ -67,7 +75,9 @@ function(this)
6775
},
6876
topicBytesOutPerSec: {
6977
name: 'Topic bytes out',
70-
description: 'Topic bytes out rate.',
78+
description: |||
79+
Rate of outgoing data in bytes read from this topic by consumers.
80+
|||,
7181
type: 'counter',
7282
unit: 'Bps',
7383
sources: {
@@ -87,7 +97,9 @@ function(this)
8797
},
8898
topicLogStartOffset: {
8999
name: 'Topic start offset',
90-
description: 'Topic start offset.',
100+
description: |||
101+
Earliest available offset for each partition due to retention or deletion.
102+
|||,
91103
type: 'gauge',
92104
unit: 'none',
93105
aggFunction: 'max',
@@ -109,7 +121,11 @@ function(this)
109121
},
110122
topicLogEndOffset: {
111123
name: 'Topic end offset',
112-
description: 'Topic end offset.',
124+
description: |||
125+
Latest offset (high water mark) for each partition representing newest available message.
126+
Continuously increases as new messages arrive.
127+
Difference between end and start offsets indicates total messages available.
128+
|||,
113129
type: 'gauge',
114130
unit: 'none',
115131
aggFunction: 'max',
@@ -126,7 +142,10 @@ function(this)
126142
},
127143
topicLogSize: {
128144
name: 'Topic log size',
129-
description: 'Size in bytes of the current topic-partition.',
145+
description: |||
146+
Total size in bytes of data stored on disk for each topic partition.
147+
Grows with incoming messages and shrinks with retention cleanup.
148+
|||,
130149
type: 'gauge',
131150
unit: 'decbytes',
132151
aggFunction: 'max',

kafka-observ-lib/signals/totalTime.libsonnet

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,23 +16,27 @@ function(this)
1616
signals: {
1717

1818
local commonRequestQueueDescription = |||
19-
A high value can imply there aren't enough IO threads or the CPU is a bottleneck,
20-
or the request queue isnt large enough. The request queue size should match the number of connections.
19+
High values indicate insufficient IO threads, CPU bottlenecks, or undersized request queue.
20+
Queue size should match connection count.
2121
|||,
2222

2323
local commonLocalDescription = |||
24-
In most cases, a high value can imply slow local storage or the storage is a bottleneck. One should also investigate LogFlushRateAndTimeMs to know how long page flushes are taking, which will also indicate a slow disk. In the case of FetchFollower requests, time spent in LocalTimeMs can be the result of a ZooKeeper write to change the ISR.
24+
High values often indicate slow storage or disk bottlenecks.
25+
Check LogFlushRateAndTimeMs for disk performance issues.
2526
|||,
2627

2728
local commonRemoteDescription = |||
29+
For fetch requests, high values may indicate caught-up consumers with no new data (normal if near max wait time).
30+
Configure via replica.fetch.wait.max.ms and fetch.max.wait.ms.
2831
|||,
2932

3033
local commonResponseQueueDescription = |||
31-
A high value can imply there aren't enough network threads or the network cant dequeue responses quickly enough, causing back pressure in the response queue.
34+
High values indicate insufficient network threads or slow network dequeue causing backpressure.
3235
|||,
3336

3437
local commonResponseDescription = |||
35-
A high value can imply the zero-copy from disk to the network is slow, or the network is the bottleneck because the network cant dequeue responses of the TCP socket as quickly as theyre being created. If the network buffer gets full, Kafka will block.
38+
High values indicate slow zero-copy operations or network saturation.
39+
Network buffer fullness can cause Kafka to block.
3640
|||,
3741

3842
fetchQueueTime: {

0 commit comments

Comments
 (0)