Skip to content

feat(health): add API support for NVL domain health records#1854

Open
jayzhudev wants to merge 2 commits into
NVIDIA:mainfrom
jayzhudev:feat/platform-and-nvl-records
Open

feat(health): add API support for NVL domain health records#1854
jayzhudev wants to merge 2 commits into
NVIDIA:mainfrom
jayzhudev:feat/platform-and-nvl-records

Conversation

@jayzhudev
Copy link
Copy Markdown
Contributor

Description

Adds NVLink domain health report support across the API, DB, admin CLI, and admin web UI.

This PR:

  • Adds NVLink domain health-report RPCs for list, insert, and remove.
  • Adds a standalone nvlink_domain_health_reports table using the existing health report JSONB storage pattern.
  • Adds API handlers and RBAC permissions for NVLink domain health reports.
  • Adds carbide-admin-cli nvl-domain health-report show|remove|print-empty-template.
  • Adds admin web UI pages for NVLink domain health and links from NVLink-related views.
  • Adds pagination/search for the NVLink domain health index page.
  • Adds API, web, and CLI tests.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues

Closes #1832

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

@jayzhudev jayzhudev requested a review from a team as a code owner May 21, 2026 03:34
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jayzhudev jayzhudev force-pushed the feat/platform-and-nvl-records branch from bf9c383 to 7aed280 Compare May 21, 2026 03:35
@jayzhudev jayzhudev requested a review from Matthias247 May 21, 2026 03:40
@jayzhudev
Copy link
Copy Markdown
Contributor Author

Web UI additions:

image image image

admin-cli show command output example:

carbide-admin-cli -c https://<url>:1079 nvl-domain health-report show 11111111-1111-1111-1111-111111111115

output:

Health report entries: 1
+-------------------+-------+-----------------------------+--------+
| Source            | Mode  | Observed At                 | Alerts |
+===================+=======+=============================+========+
| haas-log-analyzer | Merge | 2026-05-21T01:58:08.693498Z | 3      |
+-------------------+-------+-----------------------------+--------+

Alerts for source haas-log-analyzer (Merge)
+-------------------------------------+--------------------------------------+--------------------------------+--------------------------------------------------------------+------------------------------+
| Id                                  | Target                               | Since                          | Message                                                      | Classifications              |
+=====================================+======================================+================================+==============================================================+==============================+
| haas.nvlink_domain.component_health | node-001                             | 2026-05-21T01:58:08.737266716Z | {                                                            | component_type:compute_node  |
|                                     |                                      |                                |   "component": "node-001",                                   |                              |
|                                     |                                      |                                |   "component_type": "compute_node",                          |                              |
|                                     |                                      |                                |   "health_status": "unhealthy",                              |                              |
|                                     |                                      |                                |   "nvl_domain": "11111111-1111-1111-1111-111111111115"       |                              |
|                                     |                                      |                                | }                                                            |                              |
+-------------------------------------+--------------------------------------+--------------------------------+--------------------------------------------------------------+------------------------------+
| haas.nvlink_domain.component_health | node-002                             | 2026-05-21T01:58:08.737267300Z | {                                                            | component_type:power_shelf   |
|                                     |                                      |                                |   "component": "node-002",                                   |                              |
|                                     |                                      |                                |   "component_type": "power_shelf",                           |                              |
|                                     |                                      |                                |   "health_status": "unhealthy",                              |                              |
|                                     |                                      |                                |   "nvl_domain": "11111111-1111-1111-1111-111111111115"       |                              |
|                                     |                                      |                                | }                                                            |                              |
+-------------------------------------+--------------------------------------+--------------------------------+--------------------------------------------------------------+------------------------------+
| haas.nvlink_domain.component_health | sw100ns038bg3qsho433vkg684heguv282qa | 2026-05-21T01:58:08.737267466Z | {                                                            | component_type:nvlink_switch |
|                                     | ggmrsh2ugn1qk096n2c6hcg              |                                |   "component": "sw100ns038bg3qsho433vkg684heguv282qaggmrsh2u |                              |
|                                     |                                      |                                | gn1qk096n2c6hcg",                                            |                              |
|                                     |                                      |                                |   "component_type": "nvlink_switch",                         |                              |
|                                     |                                      |                                |   "health_status": "unhealthy",                              |                              |
|                                     |                                      |                                |   "nvl_domain": "11111111-1111-1111-1111-111111111115"       |                              |
|                                     |                                      |                                | }                                                            |                              |
+-------------------------------------+--------------------------------------+--------------------------------+--------------------------------------------------------------+------------------------------+

@jayzhudev jayzhudev force-pushed the feat/platform-and-nvl-records branch 5 times, most recently from 6642eae to f3eb9c2 Compare May 22, 2026 22:09
## Description
Adds NVLink domain health report support across the API, DB, admin CLI, and admin web UI.

This PR:
- Adds NVLink domain health-report RPCs for list, insert, and remove.
- Adds a standalone `nvlink_domain_health_reports` table using the existing health report JSONB storage pattern.
- Adds API handlers and RBAC permissions for NVLink domain health reports.
- Adds `carbide-admin-cli nvl-domain health-report show|remove|print-empty-template`.
- Adds admin web UI pages for NVLink domain health and links from NVLink-related views.
- Adds pagination/search for the NVLink domain health index page.
- Adds API, web, and CLI tests.

## Type of Change
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues
Closes NVIDIA#1832

## Breaking Changes
- [ ] This PR contains breaking changes

## Testing
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
@jayzhudev jayzhudev force-pushed the feat/platform-and-nvl-records branch from f3eb9c2 to 97b3e61 Compare May 26, 2026 14:27
// Lists all health report sources for an NVLink domain
rpc ListNVLinkDomainHealthReports(ListNVLinkDomainHealthReportsRequest) returns (ListHealthReportResponse);
// Adds a health report source for an NVLink domain
rpc InsertNVLinkDomainHealthReport(InsertNVLinkDomainHealthReportRequest) returns (google.protobuf.Empty);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to call it NVLinkDomain or ScaleUpDomain like its called in other places?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like technically, ScaleUp includes NVLink Domain but is not limited to it. However, in NICo core the common term is NVLink domain (see commom.NVLinkDomainId), so we probably don't want to call it ScaleUpDomain while using NVLinkDomainId I assume? 😃

@@ -0,0 +1,4 @@
CREATE TABLE nvlink_domain_health_reports (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we want to already introduce the NVLink domain as a separate object or whether we just take an intermediate step where the handler would resolve attached machines and apply the report there.

I think sooner or later we might introduce a top level NVLinkDomain object, but it would likely have more fields (including its own state handler).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the NVLinkDomain object may come in the future.

Because the health services now have a feature to send reports of associated machines (using existing RPCs) along with the NVL domain report, my thoughts are:

  • machines' NVL domain reports follow the existing ingestion paths and objects (resolved to machine level)
  • we use this NVL health reports table to hold NVL domain reports for now
    • this is an intermediate step because the table is not associated to anything else
  • in the future, migrate/adapt to the NVLinkDomain object once it's defined and is part of the system

@@ -0,0 +1,4 @@
CREATE TABLE nvlink_domain_health_reports (
id uuid PRIMARY KEY,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it always be a UUID, or could it be a different format?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this must be a UUID in the current implementation given that we use commom.NVLinkDomainId and there's validation and comments for it in the existing implementation:

/// NvLinkDomainId is a strongly typed UUID for NvLink domains.
pub type NvLinkDomainId = TypedUuid<NvLinkDomainIdMarker>;

And we have this to invoke the parser and validation for UUID:

rpc/build.rs

.extern_path(".common.NVLinkDomainId", "::carbide_uuid::nvlink::NvLinkDomainId")

A non UUID is rejected. If we have new ideas in the future, the ID type in the table should be migrated along with the definition of commom.NVLinkDomainId.

@jayzhudev jayzhudev self-assigned this May 26, 2026
@jayzhudev jayzhudev requested a review from Matthias247 May 27, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add list/insert/remove RPCs for NVL domain health records

2 participants