Skip to content

Conversation

@schmikei
Copy link
Contributor

This PR refactors the Redis-Enterprise mixin to use the signals API to create its dashboards and follow best practices using g.libsonnet and commonlib.

If you need access for testdata we can provide it at keithschmitty.grafana.net

Screenshots with minimal setup (no loadgen so some metrics will not be populated in these screenshots)

Screenshots Overview image image

Nodes
image
image

Database

image image

Logs
image
The way my lab was configured didn't allow for logs but it should work

@schmikei schmikei marked this pull request as ready for review November 10, 2025 16:02
@schmikei schmikei requested a review from a team as a code owner November 10, 2025 16:02
Copy link

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, no linting errors locally, a few typos and an issue with alerts, everything else is minor.

redisEnterpriseNodes:
link.link.new('Redis Enterprise nodes', '/d/' + this.grafana.dashboards['redis-enterprise-node-overview.json'].uid),
redisEnterpriseDatabases:
link.link.new('Redis Enterprise databases', '/d/' + this.grafana.dashboards['redis-enterprise-database-overview.jsonn'].uid),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? .jsonn --> .json

refresh,
period,
),
'redis-enterprise-database-overview.jsonn':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? .jsonn --> .json

clusterTotalRequests: {
name: 'Cluster total requests',
nameShort: 'Total requests',
type: 'raw',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be of type gauge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope at least as currently written I want this query with its aggregation to not be modified:

- counter: is wrapped into `<rangeFunction>(<base>[<interval>])`
- gauge: no transformation
- histogram: wrapped into `histogram_quantile(0.95, <aggFunction>(rate(<base>[<interval>])) by (le,<agg>))`
- info: no transformation
- raw: no transformation. You can write your own complex expression and make sure is kept as is.

Also, regardless of type (with the exception of `raw`), additional wrapper can be added when aggLevel `group` or `instance` is selected:

`<aggFunction> by (<agg>) (<base>)`

I don't want these to do further aggregations since all of the raw expressions are already doing aggregations

clusterTotalMemoryUsed: {
name: 'Cluster total memory used',
nameShort: 'Total memory used',
type: 'raw',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be of type gauge?

clusterTotalConnections: {
name: 'Cluster total connections',
nameShort: 'Total connections',
type: 'raw',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be of type gauge?

databaseReadRequests: {
name: 'Database read requests',
nameShort: 'Read requests',
type: 'raw',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be of type gauge?

databaseCacheHitRatio: {
name: 'Database cache hit ratio',
nameShort: 'Cache hit ratio',
type: 'raw',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be of type gauge?

databaseMemoryUtilization: {
name: 'Database memory utilization',
nameShort: 'Memory utilization',
type: 'raw',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be of type gauge?

expr: |||
bdb_avg_latency / 1000 > %(alertsDatabaseHighLatencyMs)s
||| % $._config,
bdb_avg_latency{%(filteringSelector)s} / 1000 > %(alertsDatabaseHighLatencyMs)s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we divide by 1000 that would convert the value to seconds, but the alert threshold is in milliseconds - alertsDatabaseHighLatencyMs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this seems like a bug with the original implementation. Think it should be bdb_avg_latency * 1000 based off the context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! 🚀

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thank you!

+ g.panel.timeSeries.standardOptions.withUnit('percent'),

nodesNodeCPUUtilizationPanel:
commonlib.panels.generic.timeSeries.base.new(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on how this would look versus the commonlib.panels.cpu.timeSeries.base.new? I see that the base is empty in the common lib but it has utilization by core and mode overrides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

+ g.panel.barGauge.queryOptions.withTargets([
signals.overview.nodeUp.asTarget(),
])
+ g.panel.barGauge.panelOptions.withDescription('Up/down status for each node in the cluster.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I just realised and commented on in the couchDB PR, this is duplicating the descriptions unnecessarily. Should be imported by default I believe, but if not you can just call the signal's description directly, e.g. signals.overview.nodeUp.description

local signals = this.signals,

// Overview panels - Status panels
overviewNodesUpPanel:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me, but I find this info a little cluttered (for this, the db up , and shards up panels)

Image

Specifically here, the legends are a little cluttered with cluster - number, plus the value of 1 for each row, which doesn't really match the visualisation being somwhere not 100%

Could we drop the cluster prefix and make the instance name much clearer and consistent across?
Same thing for the absolute value, is it up or not, and if so the visualisation shouldn't be in-between :)

Comment on lines +72 to +74
clusterTotalMemoryUsed: {
name: 'Cluster total memory used',
nameShort: 'Total memory used',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a total value panel, we need to remember to stack the values as this could grow very hard to read with multiple DBs

Additionally, and this is just my opinion at this point, but given that the cardinality is per-database, it may be worth doing this one as a 12-width right-aligned table legend with min/max/avg etc values?

Image

Comment on lines +86 to +87
clusterTotalConnections: {
name: 'Cluster total connections',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, both on stacking and table legend concerns

+ g.panel.timeSeries.panelOptions.withDescription('Ratio of database cache key hits against hits and misses.')
+ g.panel.timeSeries.standardOptions.withUnit('percent'),

overviewClusterEvictionsVsExpiredObjectsPanel:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on potential illegible legend with many databases defined

sources: {
prometheus: {
expr: 'node_avg_latency{%(queriesSelector)s}',
legendCustomTemplate: '{{ redis_cluster }} - node: {{ node }}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another inconsistent handling of instance / node here with how the legend is written (as compared to the panels at the top row)

panels.overviewClusterEvictionsVsExpiredObjectsPanel { gridPos+: { w: 8, h: 6 } },
]),

overviewNodesKPIsRow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned on the couchDB node overview there's the opportunity to do a table-view of nodes here, which I think makes a lot more sense for Redis EE with only having four panels.

Specifically thinking similar to how the topic overview works in the new kafka observ lib

Image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking about it some more, for the overview page I think it's OK to sum it, and we could perhaps do a table on the nodes dashboard?

+ g.panel.timeSeries.panelOptions.withDescription('Memory utilization % for each node in the cluster.')
+ g.panel.timeSeries.standardOptions.withUnit('percent'),

overviewNodeCPUUtilizationPanel:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one shows legend in yet another way (without cluster this time), which I think looks a fair bit cleaner personally, but we'd want to be a bit more consistent, at least in the KPIs row.

]),

nodesMetricsRow:
g.panel.row.new('Node metrics')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github is throwing a fit whenever I try to do multi-line select right now, but could you please give all the panels in this row a consistency pass on their legends (these ones just have the raw node label)

]),

databasesCRDBRow:
g.panel.row.new('Active-Active')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this name tell us?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants