[Observability] Update metrics reference and add metrics-based observability guide by jabellard · Pull Request #967 · karmada-io/website

jabellard · 2026-01-31T02:32:36Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

gemini-code-assist · 2026-01-31T02:33:11Z

Summary of Changes

Hello @jabellard, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Karmada's observability documentation by introducing a detailed guide on monitoring its multi-cluster environment using Prometheus metrics and Grafana dashboards. It provides users with structured information on critical metrics, alerting, and troubleshooting, enabling more effective management of Karmada deployments.

Highlights

New Observability Guide: Introduced a comprehensive metrics-based observability guide for Karmada, detailing how to monitor its components using Prometheus and Grafana.
Pre-built Grafana Dashboards: Added five production-ready Grafana dashboards for various Karmada components, including API Server, Controller Manager, Member Cluster, Scheduler, and Resource Propagation insights.
Enhanced Metrics Reference: Significantly expanded and restructured the Karmada metrics reference documentation, providing detailed information on metric conventions, types, stability levels, and categorization.
Documentation Integration: Integrated the new observability guide and dashboards into the project's documentation structure, making them easily discoverable via the sidebar and dedicated READMEs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly enhances Karmada's observability documentation by adding a comprehensive metrics-based observability guide and greatly expanding the metrics reference. The new guide is well-structured, detailed, and covers everything from critical metrics to Grafana dashboards and troubleshooting. The updated metrics reference is also a massive improvement, providing detailed information on each metric.

I've found a few minor issues in the documentation and one area for improvement in the new Grafana dashboards to ensure future compatibility. My detailed comments are below.

Overall, this is an excellent contribution that will be incredibly valuable for users operating Karmada.

gemini-code-assist · 2026-01-31T02:35:01Z

static/grafana-dashboards/member-cluster-insights/member-cluster-insights.json

+      "targets": [
+        {
+          "refId": "A",
+          "expr": "max by (cluster_name) (cluster_ready_state{cluster_name=~\"$cluster\"})",


The metrics.md documentation in this same PR indicates that the cluster_name label is deprecated in favor of member_cluster. This dashboard should be updated to use the new label to avoid breaking in future versions. This includes updating all PromQL queries (like this one), legendFormat fields, and the cluster template variable definition in this file.

Suggested change

"expr": "max by (cluster_name) (cluster_ready_state{cluster_name=~\"$cluster\"})",

"expr": "max by (member_cluster) (cluster_ready_state{member_cluster=~\"$cluster\"})",

@RainbowMango , I want to add a task for us to remember to make these updates once the old label name is deprecated. Can't remember where we're tracking that. Can you point me to it?

gemini-code-assist · 2026-01-31T02:35:02Z

docs/administrator/monitoring/karmada-observability.md

+    rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
+  ```
+
+#### cluster_cpu_allocated_number / cluster_cpu_allocatable_number


This section title is a PromQL expression. For better readability, consider using a more descriptive title like "Cluster CPU Utilization". The expression itself is correctly shown in the example query.

Suggested change

#### cluster_cpu_allocated_number / cluster_cpu_allocatable_number

#### Cluster CPU Utilization

I don't think we should change the name. This level of title is named after the metric name, so that it would be more direct when read.

docs/administrator/monitoring/karmada-observability.md

gemini-code-assist · 2026-01-31T02:35:02Z

docs/administrator/monitoring/karmada-observability.md

+
+</details>
+
+**Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) below. For example:


The text refers to recording rules "below", but the "Example Prometheus Recording Rules" section is located above this note. Please change "below" to "above" for clarity.

Suggested change

**Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) below. For example:

**Note:** The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) above. For example:

gemini-code-assist · 2026-01-31T02:35:02Z

docs/reference/instrumentation/metrics.md

+  rate(karmada_scheduler_schedule_attempts_total[5m])
+
+  # Scheduling error rate
+  rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])


This query calculates the rate of errors, but to get the error rate (as a ratio or percentage), it should be divided by the total rate of all attempts. This would make it consistent with the success rate query example just above.

Suggested change

rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])

rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])

/

rate(karmada_scheduler_schedule_attempts_total[5m])

@jabellard, please take a look at this.

jabellard · 2026-01-31T03:18:23Z

Hey @RainbowMango. Please take a look. I have this running locally, and not sure why CI is failing since I can't see the logs.

RainbowMango · 2026-01-31T03:30:03Z

OK, thanks, I will take a look.

XiShanYongYe-Chang · 2026-01-31T03:30:43Z

It's strange, there are no logs available to check. Could this be caused by the absence of the corresponding zh document?

RainbowMango

/assign
After a quick go through, it looks pretty good.
I will try to help fix the CI issue so that I can continue reviewing it with a preview.

RainbowMango · 2026-02-03T09:00:15Z

Echo from https://app.netlify.com/projects/karmada/deploys/697d744975eeae00081ac3cd

11:20:07 AM: [success] [webpackbar] Server: Compiled with some errors in 9.36s
11:20:16 AM: [success] [webpackbar] Client: Compiled with some errors in 19.08s
11:20:16 AM: [ERROR] Client bundle compiled with errors therefore further build is impossible.
11:20:16 AM: Error: MDX compilation failed for file "/opt/build/repo/docs/administrator/monitoring/karmada-observability.md"
11:20:16 AM: Cause: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:16 AM: Details:
11:20:16 AM: Error: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:17 AM: Failed during stage 'building site': Build script returned non-zero exit code: 2 (https://ntl.fyi/exit-code-2)
11:20:17 AM: error Command failed with exit code 1. (https://ntl.fyi/exit-code-1)
11:20:17 AM: info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
11:20:17 AM: 
11:20:17 AM: "build.command" failed

It says the link can not be resolved. @jabellard

karmada-bot · 2026-02-04T00:19:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jabellard · 2026-02-04T00:45:58Z

Echo from https://app.netlify.com/projects/karmada/deploys/697d744975eeae00081ac3cd

11:20:07 AM: [success] [webpackbar] Server: Compiled with some errors in 9.36s
11:20:16 AM: [success] [webpackbar] Client: Compiled with some errors in 19.08s
11:20:16 AM: [ERROR] Client bundle compiled with errors therefore further build is impossible.
11:20:16 AM: Error: MDX compilation failed for file "/opt/build/repo/docs/administrator/monitoring/karmada-observability.md"
11:20:16 AM: Cause: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:16 AM: Details:
11:20:16 AM: Error: Markdown link with URL `./working-with-prometheus-in-control-plane.md` in source file "docs/administrator/monitoring/karmada-observability.md" (22:44) couldn't be resolved.
Make sure it references a local Markdown file that exists within the current plugin.
11:20:16 AM: To ignore this error, use the `siteConfig.markdown.hooks.onBrokenMarkdownLinks` option, or apply the `pathname://` protocol to the broken link URLs.
11:20:17 AM: Failed during stage 'building site': Build script returned non-zero exit code: 2 (https://ntl.fyi/exit-code-2)
11:20:17 AM: error Command failed with exit code 1. (https://ntl.fyi/exit-code-1)
11:20:17 AM: info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
11:20:17 AM: 
11:20:17 AM: "build.command" failed

It says the link can not be resolved. @jabellard

@RainbowMango thanks for the logs. Just fixed. Turns out ref resolution works different for a production build when compared to local development (i.e., npm run start)

jabellard · 2026-02-05T03:13:31Z

Hey @RainbowMango . I can see the preview now. Please take a look and let me know if you have any comments.

I'm in a docs party phase and want to send some docs enhancements once this one is completed. Party party party!!!🎉🎉🎉🥳🥳🥳

RainbowMango · 2026-02-05T03:48:11Z

It is in my queue... will do ASAP

RainbowMango

Looks great! Just some nits from me.

also cc some guys who might be interested here, to see if they have any comments.
@XiShanYongYe-Chang, who is working on monitor-related documenations.
@CharlesQQ @zach593 @LivingCcj, who is running a monitor dashboard with their Karmada deployments.
@windsonsea, who is an expert in documentation

docs/administrator/monitoring/karmada-observability.md

RainbowMango · 2026-02-06T02:59:34Z

docs/reference/instrumentation/metrics.md

+  rate(karmada_scheduler_schedule_attempts_total[5m])
+
+  # Scheduling error rate
+  rate(karmada_scheduler_schedule_attempts_total{result="error"}[5m])


@jabellard, please take a look at this.

Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>

jabellard · 2026-02-07T19:39:01Z

Looks great! Just some nits from me.

also cc some guys who might be interested here, to see if they have any comments. @XiShanYongYe-Chang, who is working on monitor-related documenations. @CharlesQQ @zach593 @LivingCcj, who is running a monitor dashboard with their Karmada deployments. @windsonsea, who is an expert in documentation

@RainbowMango thanks for reviewing. Just addressed comments and squashed.

jabellard · 2026-02-07T19:42:38Z

Looks great! Just some nits from me.
also cc some guys who might be interested here, to see if they have any comments. @XiShanYongYe-Chang, who is working on monitor-related documenations. @CharlesQQ @zach593 @LivingCcj, who is running a monitor dashboard with their Karmada deployments. @windsonsea, who is an expert in documentation

@RainbowMango thanks for reviewing. Just addressed comments and squashed.

Also, please take a look at the note above regarding the deprecated label.

XiShanYongYe-Chang

Thanks a lot~
I feel that the writing is very detailed and the content is also very interesting. I am still reviewing it, so I will submit some comments first.

XiShanYongYe-Chang · 2026-02-07T06:59:15Z

docs/administrator/monitoring/karmada-observability.md

+- **Labels**: `result` (success/error), `schedule_type`
+- **Description**: Count of scheduling attempts by result.
+- **Why it matters**: High error rates mean workloads cannot be placed, blocking deployments.
+- **Alert threshold**: Error rate &gt; 5%


This threshold is for the example given below, right?
I would like to know if this is the community's recommended value, and whether this value can be directly used if similar indicators are added in the future.

XiShanYongYe-Chang · 2026-02-07T07:03:57Z

docs/administrator/monitoring/karmada-observability.md

+#### karmada_scheduler_schedule_attempts_total
+
+- **Type**: Counter
+- **Labels**: `result` (success/error), `schedule_type`


I think it's a good idea to list the enumeration values behind the label. If other labels could also list them if possible, that would be even better. If feasible, I think we could add subtasks after this PR to do so, wdyt

XiShanYongYe-Chang · 2026-02-07T07:05:32Z

docs/administrator/monitoring/karmada-observability.md

+- **Labels**: `result`, `schedule_type`
+- **Description**: End-to-end time to schedule a resource.
+- **Why it matters**: High latency delays application deployments.
+- **Alert threshold**: P95 > 5s, P99 > 10s


Should this threshold correspond to a certain scale? Would the threshold settings for Karmada differ depending on the scale?

XiShanYongYe-Chang · 2026-02-07T07:08:34Z

docs/administrator/monitoring/karmada-observability.md

+    rate(karmada_scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
+  ```
+
+#### cluster_cpu_allocated_number / cluster_cpu_allocatable_number


I don't think we should change the name. This level of title is named after the metric name, so that it would be more direct when read.

XiShanYongYe-Chang · 2026-02-09T07:55:09Z

docs/administrator/monitoring/karmada-observability.md

+          sum(rate(karmada_create_resource_to_cluster{result="error"}[5m]))
+          +
+          sum(rate(karmada_update_resource_to_cluster{result="error"}[5m]))
+          +
+          sum(rate(karmada_delete_resource_from_cluster{result="error"}[5m]))


These metrics currently seems do not have the Karmada prefix. Is there already a PR to modify this? I apologize if I might have missed it.

XiShanYongYe-Chang · 2026-02-09T09:21:48Z

docs/administrator/monitoring/karmada-observability.md

+- **Scheduling SLO**: 99% of scheduling attempts succeed within 5 seconds
+- **Propagation SLO**: 99.5% of resource updates sync to member clusters within 10 seconds
+- **Cluster Health SLO**: 99.9% uptime for member clusters
+- **Failover SLO**: 95% of evictions complete within 30 seconds


Are these descriptions only for illustration purposes and do not reflect the current capabilities of the Karmada system?

XiShanYongYe-Chang · 2026-02-09T09:49:47Z

docs/administrator/monitoring/karmada-observability.md

+### Scenario 1: Workloads Not Deploying
+
+**Symptoms**: Users report workloads not appearing in member clusters
+


Is there a large number of workloads here? If there are only a few occasional ones, would it be better to check the logs?

XiShanYongYe-Chang · 2026-02-09T10:26:35Z

docs/administrator/monitoring/karmada-observability.md

+# Max scheduling latency in last 5 minutes
+max_over_time(karmada_scheduler_e2e_scheduling_duration_seconds_sum[5m])
+  /
+max_over_time(karmada_scheduler_e2e_scheduling_duration_seconds_count[5m])


Does this comment and the actual meaning of the command not match? If I have misunderstood, please correct me.

karmada-bot added kind/documentation Categorizes issue or PR as related to documentation. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 31, 2026

karmada-bot requested review from Poor12 and rgrupesh January 31, 2026 02:32

karmada-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 31, 2026

gemini-code-assist bot reviewed Jan 31, 2026

View reviewed changes

jabellard force-pushed the obs-guide branch 2 times, most recently from 9b37632 to 24d4c45 Compare January 31, 2026 02:55

jabellard marked this pull request as ready for review January 31, 2026 03:18

karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2026

karmada-bot requested a review from Tingtal January 31, 2026 03:18

RainbowMango reviewed Feb 3, 2026

View reviewed changes

karmada-bot assigned RainbowMango Feb 3, 2026

RainbowMango reviewed Feb 6, 2026

View reviewed changes

jabellard force-pushed the obs-guide branch from 2a6908a to ac9300b Compare February 7, 2026 19:32

Update metrics reference and add metrics-based observability guide

ea3c466

Signed-off-by: Joe Nathan Abellard <contact@jabellard.com>

jabellard force-pushed the obs-guide branch from ac9300b to ea3c466 Compare February 7, 2026 19:34

XiShanYongYe-Chang reviewed Feb 9, 2026

View reviewed changes

	"expr": "max by (cluster_name) (cluster_ready_state{cluster_name=~\"$cluster\"})",
	"expr": "max by (member_cluster) (cluster_ready_state{member_cluster=~\"$cluster\"})",

	#### cluster_cpu_allocated_number / cluster_cpu_allocatable_number
	#### Cluster CPU Utilization


		</details>

		Note: The alerts above use full PromQL expressions for clarity. For production deployments, you can simplify many of these alerts using the [recording rules](#example-prometheus-recording-rules) below. For example:

		### Scenario 1: Workloads Not Deploying

		Symptoms: Users report workloads not appearing in member clusters

Conversation

jabellard commented Jan 31, 2026

Uh oh!

gemini-code-assist bot commented Jan 31, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jabellard commented Jan 31, 2026

Uh oh!

RainbowMango commented Jan 31, 2026

Uh oh!

XiShanYongYe-Chang commented Jan 31, 2026

Uh oh!

RainbowMango left a comment

Choose a reason for hiding this comment

Uh oh!

RainbowMango commented Feb 3, 2026

Uh oh!

karmada-bot commented Feb 4, 2026

Uh oh!

jabellard commented Feb 4, 2026

Uh oh!

jabellard commented Feb 5, 2026

Uh oh!

RainbowMango commented Feb 5, 2026

Uh oh!

RainbowMango left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jabellard commented Feb 7, 2026

Uh oh!

jabellard commented Feb 7, 2026

Uh oh!

XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment