Skip to content

Conversation

@clcc2019
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

- 为scheduler和device-plugin组件添加ServiceMonitor模板
- 在values.yaml中添加servicemonitor配置选项
- 支持自定义监控间隔、标签和注解
- 默认启用ServiceMonitor功能

Signed-off-by: clcc2019 <[email protected]>
- 为scheduler组件添加ServiceMonitor配置参数文档
- 为device-plugin组件添加ServiceMonitor配置参数文档
- 包含enabled、labels、annotations、interval和honorLabels参数说明
- 提供默认值和详细描述

Signed-off-by: clcc2019 <[email protected]>
- Updated values.yaml to set servicemonitor.enabled to false for both scheduler and devicePlugin components, disabling the ServiceMonitor feature by default.

Signed-off-by: clcc2019 <[email protected]>
@hami-robot hami-robot bot added the kind/feature new function label Nov 11, 2025
@hami-robot hami-robot bot requested a review from FouoF November 11, 2025 03:16
@hami-robot hami-robot bot requested a review from ouyangluwei163 November 11, 2025 03:16
@hami-robot
Copy link
Contributor

hami-robot bot commented Nov 11, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clcc2019
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions bot removed the kind/feature new function label Nov 11, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @clcc2019, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the observability of the Hami project by integrating Prometheus ServiceMonitors for its core components: the scheduler and the device plugin. It provides new configuration options within the Helm chart to manage these ServiceMonitors, allowing for flexible and customizable monitoring setups. The changes include adding the necessary Kubernetes manifest files and updating the documentation to reflect these new capabilities.

Highlights

  • Prometheus ServiceMonitor Integration: Introduced new ServiceMonitor configurations for both the hami-scheduler and hami-device-plugin components, enabling Prometheus to scrape metrics from these services.
  • Configurable Monitoring Options: Added new Helm chart values (scheduler.servicemonitor and devicePlugin.monitor.servicemonitor) to allow users to enable/disable ServiceMonitors, and customize labels, annotations, scrape intervals, and honorLabels behavior.
  • Documentation Update: Updated the charts/hami/README.md to clearly document all the new ServiceMonitor configuration parameters for both the scheduler and device plugin.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hami-robot hami-robot bot added the size/L label Nov 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces ServiceMonitor configurations for both the scheduler and the device-plugin, which is a valuable addition for enabling Prometheus-based monitoring. The changes to values.yaml and the README.md are well-executed. However, I've identified a critical copy-paste error in the device-plugin's ServiceMonitor template that incorrectly links its creation to the scheduler's configuration. Additionally, there are a few minor areas for cleanup, such as removing a leftover code comment and ensuring files end with a newline.

@@ -0,0 +1,33 @@
{{- if .Values.scheduler.servicemonitor.enabled }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a copy-paste error in this conditional check. The creation of the device-plugin ServiceMonitor should be controlled by its own configuration flag, .Values.devicePlugin.monitor.servicemonitor.enabled, not the scheduler's flag. As it is, this ServiceMonitor would be incorrectly created only when the scheduler's monitor is enabled.

{{- if .Values.devicePlugin.monitor.servicemonitor.enabled }}

matchLabels:
app.kubernetes.io/component: hami-device-plugin
{{- include "hami-vgpu.labels" . | nindent 6 }}
{{- if .Values.devicePlugin.service.labels }} # Use devicePlugin instead of scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment appears to be a leftover from copying another template. It should be removed to improve code clarity.

      {{- if .Values.devicePlugin.service.labels }}

{{- if .Values.devicePlugin.service.labels }} # Use devicePlugin instead of scheduler
{{ toYaml .Values.devicePlugin.service.labels | indent 6 }}
{{- end }}
{{- end }} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's a best practice for files to end with a newline character. This ensures compatibility with various tools and follows POSIX standards.

{{- end }}

{{- if .Values.scheduler.service.labels }}
{{ toYaml .Values.scheduler.service.labels | indent 6 }}
{{- end }}
{{- end }} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's a best practice for files to end with a newline character. This ensures compatibility with various tools and follows POSIX standards.

{{- end }}

@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
unittests 64.34% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Fix device-plugin servicemonitor to use correct Values path
- Add missing newline at end of servicemonitor files

Signed-off-by: clcc2019 <[email protected]>
name: {{ include "hami-vgpu.device-plugin" . }}
namespace: {{ include "hami-vgpu.namespace" . }}
labels:
{{- include "hami-vgpu.labels" . | nindent 4 }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering add default label release: prometheus for prometheus to select this ServiceMonitor.

@FouoF
Copy link
Contributor

FouoF commented Nov 11, 2025

#1255 did the same work, can you try to merge it?

@clcc2019
Copy link
Contributor Author

#1255 did the same work, can you try to merge it?

The release tag in serviceMonitorSelector is not required, and I don't think it's necessary to configure it by default.

@FouoF
Copy link
Contributor

FouoF commented Nov 11, 2025

The release tag in serviceMonitorSelector is not required, and I don't think it's necessary to configure it by default.

I think the label is required as the servicemonitor can't work with no label. But It may work with the default label.

Copy link
Contributor

@clarifai-fmarceau clarifai-fmarceau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to set interval in the ServiceMonitors by default. It should only be set if the user defines something. This chart should not be dictating what the default. Prometheus-operator prometheus CR already has a global default set which is what I'd expect to kick in if interval is not set.

HAMi's ServiceMonitors should also support setting other endpoints configs such as : scrapeTimeout & relabelings. As stated previously, if these are not defined by the user, they should not be set at all in favour of the Prometheus default for scrapeTimeout.

If you want to future proof this and don't necessarily want to support every possibility that the CRD offers, you should also add a generic servicemonitor.extraEndpointProperties (something like that, it could also be servicemonitor.extraEndpointConfig).

One last thing regarding the namespace where these get created, I think it makes sense by default to create them in the same namespace where HAMi is deployed however most of the community charts that I've seen over the past years generally offer the possibility to overwrite that and deploy them in a custom namespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants