-
Notifications
You must be signed in to change notification settings - Fork 33
Add instance filter to TiCDC Grafana dashboards for better metric isolation #3929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add instance filter to TiCDC Grafana dashboards for better metric isolation #3929
Conversation
Summary of ChangesHello @tenfyzhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the monitoring experience for TiCDC by integrating an instance-level filter into its Grafana dashboards. This change addresses previous challenges with data accuracy and performance when operating multiple TiCDC instances, allowing users to view metrics specific to a chosen instance, thereby streamlining troubleshooting and operational oversight. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great improvement for monitoring multi-instance TiCDC setups. It consistently adds an instance filter to Prometheus queries across three Grafana dashboard files, which will enhance metric isolation and dashboard performance. The changes are correct and align with the PR's goal. I've added a couple of minor suggestions to improve the readability of some query strings by removing unnecessary newline characters.
11ca7e4 to
53439e1
Compare
- Add `instance=~"$ticdc_instance"` selector to metrics queries for better instance-level filtering - Update maintainer checkpoint/resolved TS lag metrics to include instance filtering - Add instance filter to KV client pull event count metric - Include instance selector in sink batch row count sum aggregation - Update system metrics (CPU, goroutines, memory, file descriptors) to filter by TiCDC instance - Add instance filter to owner ownership counter metrics - Include instance selector in subscription client region count metrics - Update resolve lock task drop count rate calculation with instance filter - Add instance filter to sink MQ checkpoint TS message count metric - Include instance selector in redo total rows count rate calculation Signed-off-by: tenfyzhong <[email protected]>
53439e1 to
0f26332
Compare
|
/gemini Which panels have had the instance filter added by this PR? |
|
Hello @tenfyzhong, I'd be happy to clarify which panels have had the instance filter added by this PR. The
|
|
/gemini Which panels have had the instance filter added by this PR? Please respond with the titles of the panels. |
|
Hello @tenfyzhong, I understand you're looking for the panel titles. Based on the metric names identified in the changes, and without direct access to the
|
|
/check-issue-triage-complete |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly adds the instance=~\"$ticdc_instance\" filter to many Prometheus queries across the Grafana dashboards, which will improve metric isolation and performance as intended. The changes are consistent and well-executed.
However, I've noticed a few minor inconsistencies and potential omissions:
- Several queries have extra spaces or are missing spaces, which could be fixed for better consistency. I've added specific suggestions for these.
- I've also identified a few queries that appear to be missing the new
instancefilter. Since I cannot comment on lines not in the diff, I'm listing some examples here for your consideration. It would be beneficial to perform a quick check to ensure all relevant queries are updated for complete metric isolation.- In
metrics/grafana/ticdc_new_arch.json:ticdc_owner_checkpoint_ts_lag(line 199),ticdc_dispatchermanager_checkpoint_ts_lag(line 1558), andticdc_dispatchermanager_resolved_ts_lag(line 1656). - Similar omissions might exist in the other dashboard files.
- In
Overall, this is a great improvement. Addressing these points will make the dashboards even more robust and consistent.
| { | ||
| "exemplar": true, | ||
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"}) by (namespace, changefeed)", | ||
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (namespace, changefeed)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra space in the PromQL query before changefeed=~\"$changefeed\". While Prometheus is tolerant of this, removing it would improve consistency with other queries in this file. A similar issue exists on line 6497.
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (namespace, changefeed)", | |
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (namespace, changefeed)", |
| { | ||
| "exemplar": true, | ||
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\"}) by (namespace,changefeed)", | ||
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (namespace,changefeed)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other queries, consider adding a space after the comma in the by clause: by (namespace, changefeed). A similar issue exists on line 6600.
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (namespace,changefeed)", | |
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", namespace=~\"$namespace\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (namespace, changefeed)", |
| "targets": [ | ||
| { | ||
| "expr": " go_goroutines{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\"}", | ||
| "expr": " go_goroutines{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a leading space in the PromQL query before go_goroutines. While Prometheus is tolerant of this, removing it would improve consistency.
| "expr": " go_goroutines{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}", | |
| "expr": "go_goroutines{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}", |
| { | ||
| "exemplar": true, | ||
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"}) by (keyspace_name, changefeed)", | ||
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra space in the PromQL query before changefeed=~\"$changefeed\". While Prometheus is tolerant of this, removing it would improve consistency with other queries in this file. A similar issue exists on line 6497.
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", | |
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", |
| { | ||
| "exemplar": true, | ||
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"}) by (keyspace_name,changefeed)", | ||
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name,changefeed)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other queries, consider adding a space after the comma in the by clause: by (keyspace_name, changefeed). A similar issue exists on line 6600.
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name,changefeed)", | |
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", |
| "targets": [ | ||
| { | ||
| "expr": " go_goroutines{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\"}", | ||
| "expr": " go_goroutines{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a leading space in the PromQL query before go_goroutines. While Prometheus is tolerant of this, removing it would improve consistency.
| "expr": " go_goroutines{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}", | |
| "expr": "go_goroutines{k8s_cluster=\"$k8s_cluster\", sharedpool_id=\"$tidb_cluster\", job=~\".*ticdc.*\", instance=~\"$ticdc_instance\"}", |
| { | ||
| "exemplar": true, | ||
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"}) by (keyspace_name, changefeed)", | ||
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an extra space in the PromQL query before changefeed=~\"$changefeed\". While Prometheus is tolerant of this, removing it would improve consistency with other queries in this file. A similar issue exists on line 2950.
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", | |
| "expr": "max(ticdc_maintainer_checkpoint_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", |
| { | ||
| "exemplar": true, | ||
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\"}) by (keyspace_name,changefeed)", | ||
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name,changefeed)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other queries, consider adding a space after the comma in the by clause: by (keyspace_name, changefeed). A similar issue exists on line 3053.
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name,changefeed)", | |
| "expr": "max(ticdc_maintainer_resolved_ts_lag{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", keyspace_name=~\"$keyspace_name\", changefeed=~\"$changefeed\", instance=~\"$ticdc_instance\"}) by (keyspace_name, changefeed)", |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 3AceShowHand, flowbehappy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
- Update Prometheus query in ticdc_new_arch.json to include instance filter - Update Prometheus query in ticdc_new_arch_next_gen.json to include instance filter - Allows filtering TiCDC process uptime metrics by specific instance in Grafana dashboards Signed-off-by: tenfyzhong <[email protected]>
|
/retest |
14 similar comments
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/retest |
|
/override pull-cdc-kafka-integration-light |
|
@tenfyzhong: Overrode contexts on behalf of tenfyzhong: pull-cdc-kafka-integration-light DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override pull-cdc-kafka-integration-heavy |
|
@tenfyzhong: Overrode contexts on behalf of tenfyzhong: pull-cdc-kafka-integration-heavy DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/override pull-cdc-kafka-integration-light |
|
@tenfyzhong: Overrode contexts on behalf of tenfyzhong: pull-cdc-kafka-integration-light DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/tide |


What problem does this PR solve?
Issue Number: close #3925
What is changed and how it works?
This PR enhances the TiCDC Grafana dashboards by adding the
instance=~"$ticdc_instance"filter to Prometheus queries across multiple panels. Previously, many dashboard panels were missing theinstancelabel selector, which could lead to inaccurate data visualization and performance issues when monitoring TiCDC clusters with multiple instances. The$ticdc_instancevariable allows users to filter metrics for specific TiCDC instances, improving dashboard responsiveness and data accuracy.The changes are applied consistently across three dashboard files:
ticdc_new_arch.jsonticdc_new_arch_next_gen.jsonticdc_new_arch_with_keyspace_name.jsonKey improvements include:
instancefilter across all relevant dashboard panelsCheck List
Tests
Manual Test Steps:
instancefilter work correctly when selecting different TiCDC instances$ticdc_instancevariable properly filters metrics for the selected instanceQuestions
Will it cause performance regression or break compatibility?
No, this change does not cause performance regression or break compatibility. The
$ticdc_instancevariable defaults to a wildcard (.*) if not specifically set by the user, maintaining backward compatibility. The added filters actually improve performance by reducing the amount of data Prometheus needs to process and return.Do you need to update user documentation, design documentation or monitoring documentation?
No documentation updates are required. This is an internal improvement to the monitoring dashboards that enhances existing functionality without changing user-facing behavior or APIs.
Release note