Kafka: Emit additional consumer metrics.#17919
Conversation
Adds more metrics to the ones originally added in apache#14582. These metrics help provide insight into the workings of the Kafka consumer. As a bonus, one of the metrics, "kafka/consumer/recordsLag", has a dimension "partition" which can be used to more easily determine which tasks are reading which partitions. It also provides a view of lag from the task perspective rather than the Overlord perspective, which is useful if the Overlord is ever unable to report metrics properly.
kfaraz
left a comment
There was a problem hiding this comment.
Looks good, minor queries.
| final KafkaConsumerMetric kafkaConsumerMetric = METRICS.get(metricName.name()); | ||
|
|
||
| if (kafkaConsumerMetric != null && | ||
| kafkaConsumerMetric.getDimensions().equals(metricName.tags().keySet())) { |
There was a problem hiding this comment.
Instead of requiring both the sets to be equal, should we just check for a contains relationship?
kafkaConsumerMetric != null &&
metricName.tags().keySet().containsAll(kafkaConsumerMetric.getDimensions())There was a problem hiding this comment.
Please also add a comment on why we need to check for the dimensions. The older code had this comment:
// Certain metrics are emitted both as grand totals and broken down by topic; we want to ignore the grand total and
// only look at the per-topic metrics. See https://kafka.apache.org/documentation/#consumer_fetch_monitoring.There was a problem hiding this comment.
Oops, yes, that comment is the reason why we need to look for equal sets. I have restored it.
| new KafkaConsumerMetric( | ||
| "outgoing-byte-total", | ||
| "kafka/consumer/outgoingBytes", | ||
| Set.of(CLIENT_ID_TAG, "node-id"), |
There was a problem hiding this comment.
We should also add constants for "topic", "partition" and "node-id".
| .getAndSet(newValueAsLong); | ||
| emitValue = newValueAsLong - priorValue; | ||
| } else { | ||
| throw DruidException.defensive("Unexpected metric type[%s]", kafkaConsumerMetric.getMetricType()); |
There was a problem hiding this comment.
Maybe we should also log an error and then throw the exception, because I am not sure where this exception would be caught.
There was a problem hiding this comment.
It would only happen if there is an enum value for KafkaConsumerMetric.MetricType that isn't handled here. Since KafkaConsumerMetric is only used by this monitor, it seems hard to believe it would ever happen. That is why I chose to use a defensive check.
Tracing through the code, it looks like exceptions here are ultimately caught in ScheduledExecutors.scheduleAtFixedRate and logged at ERROR level.
…r config Adds prometheus emitter metrics for #17919
Adds more metrics to the ones originally added in #14582. These metrics help provide insight into the workings of the Kafka consumer.
As a bonus, one of the metrics, "kafka/consumer/recordsLag", has a dimension "partition" which can be used to more easily determine which tasks are reading which partitions. It also provides a view of lag from the task perspective rather than the Overlord perspective, which is useful if the Overlord is ever unable to report metrics properly.