Skip to content

[Observability] Update metrics reference and add metrics-based observability guide#967

Open
jabellard wants to merge 1 commit intokarmada-io:mainfrom
jabellard:obs-guide
Open

[Observability] Update metrics reference and add metrics-based observability guide#967
jabellard wants to merge 1 commit intokarmada-io:mainfrom
jabellard:obs-guide

Conversation

@jabellard
Copy link
Member

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

@karmada-bot karmada-bot added kind/documentation Categorizes issue or PR as related to documentation. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 31, 2026
@karmada-bot karmada-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 31, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @jabellard, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Karmada's observability documentation by introducing a detailed guide on monitoring its multi-cluster environment using Prometheus metrics and Grafana dashboards. It provides users with structured information on critical metrics, alerting, and troubleshooting, enabling more effective management of Karmada deployments.

Highlights

  • New Observability Guide: Introduced a comprehensive metrics-based observability guide for Karmada, detailing how to monitor its components using Prometheus and Grafana.
  • Pre-built Grafana Dashboards: Added five production-ready Grafana dashboards for various Karmada components, including API Server, Controller Manager, Member Cluster, Scheduler, and Resource Propagation insights.
  • Enhanced Metrics Reference: Significantly expanded and restructured the Karmada metrics reference documentation, providing detailed information on metric conventions, types, stability levels, and categorization.
  • Documentation Integration: Integrated the new observability guide and dashboards into the project's documentation structure, making them easily discoverable via the sidebar and dedicated READMEs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances Karmada's observability documentation by adding a comprehensive metrics-based observability guide and greatly expanding the metrics reference. The new guide is well-structured, detailed, and covers everything from critical metrics to Grafana dashboards and troubleshooting. The updated metrics reference is also a massive improvement, providing detailed information on each metric.

I've found a few minor issues in the documentation and one area for improvement in the new Grafana dashboards to ensure future compatibility. My detailed comments are below.

Overall, this is an excellent contribution that will be incredibly valuable for users operating Karmada.

"targets": [
{
"refId": "A",
"expr": "max by (cluster_name) (cluster_ready_state{cluster_name=~\"$cluster\"})",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The metrics.md documentation in this same PR indicates that the cluster_name label is deprecated in favor of member_cluster. This dashboard should be updated to use the new label to avoid breaking in future versions. This includes updating all PromQL queries (like this one), legendFormat fields, and the cluster template variable definition in this file.

Suggested change
"expr": "max by (cluster_name) (cluster_ready_state{cluster_name=~\"$cluster\"})",
"expr": "max by (member_cluster) (cluster_ready_state{member_cluster=~\"$cluster\"})",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RainbowMango , I want to add a task for us to remember to make these updates once the old label name is deprecated. Can't remember where we're tracking that. Can you point me to it?

rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
```

#### cluster_cpu_allocated_number / cluster_cpu_allocatable_number

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section title is a PromQL expression. For better readability, consider using a more descriptive title like "Cluster CPU Utilization". The expression itself is correctly shown in the example query.

Suggested change
#### cluster_cpu_allocated_number / cluster_cpu_allocatable_number
#### Cluster CPU Utilization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the name. This level of title is named after the metric name, so that it would be more direct when read.


</details>

**Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) below. For example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The text refers to recording rules "below", but the "Example Prometheus Recording Rules" section is located above this note. Please change "below" to "above" for clarity.

Suggested change
**Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) below. For example:
**Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) above. For example:

rate(karmada_scheduler_schedule_attempts_total[5m])

# Scheduling error rate
rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This query calculates the rate of errors, but to get the error rate (as a ratio or percentage), it should be divided by the total rate of all attempts. This would make it consistent with the success rate query example just above.

Suggested change
rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])
rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])
/
rate(karmada_scheduler_schedule_attempts_total[5m])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jabellard, please take a look at this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@jabellard jabellard force-pushed the obs-guide branch 2 times, most recently from 9b37632 to 24d4c45 Compare January 31, 2026 02:55
@jabellard
Copy link
Member Author

Hey @RainbowMango. Please take a look. I have this running locally, and not sure why CI is failing since I can't see the logs.

@jabellard jabellard marked this pull request as ready for review January 31, 2026 03:18
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2026
@karmada-bot karmada-bot requested a review from Tingtal January 31, 2026 03:18
@RainbowMango
Copy link
Member

OK, thanks, I will take a look.

@XiShanYongYe-Chang
Copy link
Member

It's strange, there are no logs available to check. Could this be caused by the absence of the corresponding zh document?

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign
After a quick go through, it looks pretty good.
I will try to help fix the CI issue so that I can continue reviewing it with a preview.

@RainbowMango
Copy link
Member

Echo from https://app.netlify.com/projects/karmada/deploys/697d744975eeae00081ac3cd

11:20:07 AM: [success] [webpackbar] Server: Compiled with some errors in 9.36s
11:20:16 AM: [success] [webpackbar] Client: Compiled with some errors in 19.08s
11:20:16 AM: [ERROR] Client bundle compiled with errors therefore further build is impossible.
11:20:16 AM: Error: MDX compilation failed for file "/opt/build/repo/docs/administrator/monitoring/karmada-observability.md"
11:20:16 AM: Cause: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:16 AM: Details:
11:20:16 AM: Error: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:17 AM: Failed during stage 'building site': Build script returned non-zero exit code: 2 (https://ntl.fyi/exit-code-2)
11:20:17 AM: error Command failed with exit code 1. (https://ntl.fyi/exit-code-1)
11:20:17 AM: info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
11:20:17 AM: ​
11:20:17 AM: "build.command" failed                                        

It says the link can not be resolved. @jabellard

@karmada-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jabellard
Copy link
Member Author

Echo from https://app.netlify.com/projects/karmada/deploys/697d744975eeae00081ac3cd

11:20:07 AM: [success] [webpackbar] Server: Compiled with some errors in 9.36s
11:20:16 AM: [success] [webpackbar] Client: Compiled with some errors in 19.08s
11:20:16 AM: [ERROR] Client bundle compiled with errors therefore further build is impossible.
11:20:16 AM: Error: MDX compilation failed for file "/opt/build/repo/docs/administrator/monitoring/karmada-observability.md"
11:20:16 AM: Cause: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:16 AM: Details:
11:20:16 AM: Error: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:17 AM: Failed during stage 'building site': Build script returned non-zero exit code: 2 (https://ntl.fyi/exit-code-2)
11:20:17 AM: error Command failed with exit code 1. (https://ntl.fyi/exit-code-1)
11:20:17 AM: info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
11:20:17 AM: ​
11:20:17 AM: "build.command" failed                                        

It says the link can not be resolved. @jabellard

@RainbowMango thanks for the logs. Just fixed. Turns out ref resolution works different for a production build when compared to local development (i.e., npm run start)

@jabellard
Copy link
Member Author

Hey @RainbowMango . I can see the preview now. Please take a look and let me know if you have any comments.

I'm in a docs party phase and want to send some docs enhancements once this one is completed. Party party party!!!🎉🎉🎉🥳🥳🥳

@RainbowMango
Copy link
Member

It is in my queue... will do ASAP

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just some nits from me.

also cc some guys who might be interested here, to see if they have any comments.
@XiShanYongYe-Chang, who is working on monitor-related documenations.
@CharlesQQ @zach593 @LivingCcj, who is running a monitor dashboard with their Karmada deployments.
@windsonsea, who is an expert in documentation

rate(karmada_scheduler_schedule_attempts_total[5m])

# Scheduling error rate
rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jabellard, please take a look at this.

Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>
@jabellard
Copy link
Member Author

Looks great! Just some nits from me.

also cc some guys who might be interested here, to see if they have any comments. @XiShanYongYe-Chang, who is working on monitor-related documenations. @CharlesQQ @zach593 @LivingCcj, who is running a monitor dashboard with their Karmada deployments. @windsonsea, who is an expert in documentation

@RainbowMango thanks for reviewing. Just addressed comments and squashed.

@jabellard
Copy link
Member Author

Looks great! Just some nits from me.
also cc some guys who might be interested here, to see if they have any comments. @XiShanYongYe-Chang, who is working on monitor-related documenations. @CharlesQQ @zach593 @LivingCcj, who is running a monitor dashboard with their Karmada deployments. @windsonsea, who is an expert in documentation

@RainbowMango thanks for reviewing. Just addressed comments and squashed.

Also, please take a look at the note above regarding the deprecated label.

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot~
I feel that the writing is very detailed and the content is also very interesting. I am still reviewing it, so I will submit some comments first.

- **Labels**: `result` (success/error), `schedule_type`
- **Description**: Count of scheduling attempts by result.
- **Why it matters**: High error rates mean workloads cannot be placed, blocking deployments.
- **Alert threshold**: Error rate &gt; 5%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This threshold is for the example given below, right?
I would like to know if this is the community's recommended value, and whether this value can be directly used if similar indicators are added in the future.

#### karmada_scheduler_schedule_attempts_total

- **Type**: Counter
- **Labels**: `result` (success/error), `schedule_type`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to list the enumeration values behind the label. If other labels could also list them if possible, that would be even better. If feasible, I think we could add subtasks after this PR to do so, wdyt

- **Labels**: `result`, `schedule_type`
- **Description**: End-to-end time to schedule a resource.
- **Why it matters**: High latency delays application deployments.
- **Alert threshold**: P95 > 5s, P99 > 10s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this threshold correspond to a certain scale? Would the threshold settings for Karmada differ depending on the scale?

rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
```

#### cluster_cpu_allocated_number / cluster_cpu_allocatable_number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the name. This level of title is named after the metric name, so that it would be more direct when read.

Comment on lines +708 to +712
sum(rate(karmada_create_resource_to_cluster{result="error"}[5m]))
+
sum(rate(karmada_update_resource_to_cluster{result="error"}[5m]))
+
sum(rate(karmada_delete_resource_from_cluster{result="error"}[5m]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics currently seems do not have the Karmada prefix. Is there already a PR to modify this? I apologize if I might have missed it.

Comment on lines +1040 to +1043
- **Scheduling SLO**: 99% of scheduling attempts succeed within 5 seconds
- **Propagation SLO**: 99.5% of resource updates sync to member clusters within 10 seconds
- **Cluster Health SLO**: 99.9% uptime for member clusters
- **Failover SLO**: 95% of evictions complete within 30 seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these descriptions only for illustration purposes and do not reflect the current capabilities of the Karmada system?

### Scenario 1: Workloads Not Deploying

**Symptoms**: Users report workloads not appearing in member clusters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a large number of workloads here? If there are only a few occasional ones, would it be better to check the logs?

Comment on lines +1402 to +1405
# Max scheduling latency in last 5 minutes
max_over_time(karmada_scheduler_e2e_scheduling_duration_seconds_sum[5m])
/
max_over_time(karmada_scheduler_e2e_scheduling_duration_seconds_count[5m])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment and the actual meaning of the command not match? If I have misunderstood, please correct me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/documentation Categorizes issue or PR as related to documentation. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants