[Observability] Update metrics reference and add metrics-based observability guide#967
[Observability] Update metrics reference and add metrics-based observability guide#967jabellard wants to merge 1 commit intokarmada-io:mainfrom
Conversation
Summary of ChangesHello @jabellard, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Karmada's observability documentation by introducing a detailed guide on monitoring its multi-cluster environment using Prometheus metrics and Grafana dashboards. It provides users with structured information on critical metrics, alerting, and troubleshooting, enabling more effective management of Karmada deployments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances Karmada's observability documentation by adding a comprehensive metrics-based observability guide and greatly expanding the metrics reference. The new guide is well-structured, detailed, and covers everything from critical metrics to Grafana dashboards and troubleshooting. The updated metrics reference is also a massive improvement, providing detailed information on each metric.
I've found a few minor issues in the documentation and one area for improvement in the new Grafana dashboards to ensure future compatibility. My detailed comments are below.
Overall, this is an excellent contribution that will be incredibly valuable for users operating Karmada.
| "targets": [ | ||
| { | ||
| "refId": "A", | ||
| "expr": "max by (cluster_name) (cluster_ready_state{cluster_name=~\"$cluster\"})", |
There was a problem hiding this comment.
The metrics.md documentation in this same PR indicates that the cluster_name label is deprecated in favor of member_cluster. This dashboard should be updated to use the new label to avoid breaking in future versions. This includes updating all PromQL queries (like this one), legendFormat fields, and the cluster template variable definition in this file.
| "expr": "max by (cluster_name) (cluster_ready_state{cluster_name=~\"$cluster\"})", | |
| "expr": "max by (member_cluster) (cluster_ready_state{member_cluster=~\"$cluster\"})", |
There was a problem hiding this comment.
@RainbowMango , I want to add a task for us to remember to make these updates once the old label name is deprecated. Can't remember where we're tracking that. Can you point me to it?
| rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m])) | ||
| ``` | ||
|
|
||
| #### cluster_cpu_allocated_number / cluster_cpu_allocatable_number |
There was a problem hiding this comment.
This section title is a PromQL expression. For better readability, consider using a more descriptive title like "Cluster CPU Utilization". The expression itself is correctly shown in the example query.
| #### cluster_cpu_allocated_number / cluster_cpu_allocatable_number | |
| #### Cluster CPU Utilization |
There was a problem hiding this comment.
I don't think we should change the name. This level of title is named after the metric name, so that it would be more direct when read.
|
|
||
| </details> | ||
|
|
||
| **Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) below. For example: |
There was a problem hiding this comment.
The text refers to recording rules "below", but the "Example Prometheus Recording Rules" section is located above this note. Please change "below" to "above" for clarity.
| **Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) below. For example: | |
| **Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) above. For example: |
| rate(karmada_scheduler_schedule_attempts_total[5m]) | ||
|
|
||
| # Scheduling error rate | ||
| rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m]) |
There was a problem hiding this comment.
This query calculates the rate of errors, but to get the error rate (as a ratio or percentage), it should be divided by the total rate of all attempts. This would make it consistent with the success rate query example just above.
| rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m]) | |
| rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m]) | |
| / | |
| rate(karmada_scheduler_schedule_attempts_total[5m]) |
9b37632 to
24d4c45
Compare
|
Hey @RainbowMango. Please take a look. I have this running locally, and not sure why CI is failing since I can't see the logs. |
|
OK, thanks, I will take a look. |
|
It's strange, there are no logs available to check. Could this be caused by the absence of the corresponding zh document? |
RainbowMango
left a comment
There was a problem hiding this comment.
/assign
After a quick go through, it looks pretty good.
I will try to help fix the CI issue so that I can continue reviewing it with a preview.
|
Echo from https://app.netlify.com/projects/karmada/deploys/697d744975eeae00081ac3cd It says the link can not be resolved. @jabellard |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
@RainbowMango thanks for the logs. Just fixed. Turns out ref resolution works different for a production build when compared to local development (i.e., |
|
Hey @RainbowMango . I can see the preview now. Please take a look and let me know if you have any comments. I'm in a docs party phase and want to send some docs enhancements once this one is completed. Party party party!!!🎉🎉🎉🥳🥳🥳 |
|
It is in my queue... will do ASAP |
RainbowMango
left a comment
There was a problem hiding this comment.
Looks great! Just some nits from me.
also cc some guys who might be interested here, to see if they have any comments.
@XiShanYongYe-Chang, who is working on monitor-related documenations.
@CharlesQQ @zach593 @LivingCcj, who is running a monitor dashboard with their Karmada deployments.
@windsonsea, who is an expert in documentation
| rate(karmada_scheduler_schedule_attempts_total[5m]) | ||
|
|
||
| # Scheduling error rate | ||
| rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m]) |
Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>
@RainbowMango thanks for reviewing. Just addressed comments and squashed. |
Also, please take a look at the note above regarding the deprecated label. |
XiShanYongYe-Chang
left a comment
There was a problem hiding this comment.
Thanks a lot~
I feel that the writing is very detailed and the content is also very interesting. I am still reviewing it, so I will submit some comments first.
| - **Labels**: `result` (success/error), `schedule_type` | ||
| - **Description**: Count of scheduling attempts by result. | ||
| - **Why it matters**: High error rates mean workloads cannot be placed, blocking deployments. | ||
| - **Alert threshold**: Error rate > 5% |
There was a problem hiding this comment.
This threshold is for the example given below, right?
I would like to know if this is the community's recommended value, and whether this value can be directly used if similar indicators are added in the future.
| #### karmada_scheduler_schedule_attempts_total | ||
|
|
||
| - **Type**: Counter | ||
| - **Labels**: `result` (success/error), `schedule_type` |
There was a problem hiding this comment.
I think it's a good idea to list the enumeration values behind the label. If other labels could also list them if possible, that would be even better. If feasible, I think we could add subtasks after this PR to do so, wdyt
| - **Labels**: `result`, `schedule_type` | ||
| - **Description**: End-to-end time to schedule a resource. | ||
| - **Why it matters**: High latency delays application deployments. | ||
| - **Alert threshold**: P95 > 5s, P99 > 10s |
There was a problem hiding this comment.
Should this threshold correspond to a certain scale? Would the threshold settings for Karmada differ depending on the scale?
| rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m])) | ||
| ``` | ||
|
|
||
| #### cluster_cpu_allocated_number / cluster_cpu_allocatable_number |
There was a problem hiding this comment.
I don't think we should change the name. This level of title is named after the metric name, so that it would be more direct when read.
| sum(rate(karmada_create_resource_to_cluster{result="error"}[5m])) | ||
| + | ||
| sum(rate(karmada_update_resource_to_cluster{result="error"}[5m])) | ||
| + | ||
| sum(rate(karmada_delete_resource_from_cluster{result="error"}[5m])) |
There was a problem hiding this comment.
These metrics currently seems do not have the Karmada prefix. Is there already a PR to modify this? I apologize if I might have missed it.
| - **Scheduling SLO**: 99% of scheduling attempts succeed within 5 seconds | ||
| - **Propagation SLO**: 99.5% of resource updates sync to member clusters within 10 seconds | ||
| - **Cluster Health SLO**: 99.9% uptime for member clusters | ||
| - **Failover SLO**: 95% of evictions complete within 30 seconds |
There was a problem hiding this comment.
Are these descriptions only for illustration purposes and do not reflect the current capabilities of the Karmada system?
| ### Scenario 1: Workloads Not Deploying | ||
|
|
||
| **Symptoms**: Users report workloads not appearing in member clusters | ||
|
|
There was a problem hiding this comment.
Is there a large number of workloads here? If there are only a few occasional ones, would it be better to check the logs?
| # Max scheduling latency in last 5 minutes | ||
| max_over_time(karmada_scheduler_e2e_scheduling_duration_seconds_sum[5m]) | ||
| / | ||
| max_over_time(karmada_scheduler_e2e_scheduling_duration_seconds_count[5m]) |
There was a problem hiding this comment.
Does this comment and the actual meaning of the command not match? If I have misunderstood, please correct me.
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: