feat(health): add API support for NVL domain health records#1854
feat(health): add API support for NVL domain health records#1854jayzhudev wants to merge 2 commits into
Conversation
bf9c383 to
7aed280
Compare
6642eae to
f3eb9c2
Compare
## Description Adds NVLink domain health report support across the API, DB, admin CLI, and admin web UI. This PR: - Adds NVLink domain health-report RPCs for list, insert, and remove. - Adds a standalone `nvlink_domain_health_reports` table using the existing health report JSONB storage pattern. - Adds API handlers and RBAC permissions for NVLink domain health reports. - Adds `carbide-admin-cli nvl-domain health-report show|remove|print-empty-template`. - Adds admin web UI pages for NVLink domain health and links from NVLink-related views. - Adds pagination/search for the NVLink domain health index page. - Adds API, web, and CLI tests. ## Type of Change - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues Closes NVIDIA#1832 ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [x] Unit tests added/updated - [x] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
f3eb9c2 to
97b3e61
Compare
| // Lists all health report sources for an NVLink domain | ||
| rpc ListNVLinkDomainHealthReports(ListNVLinkDomainHealthReportsRequest) returns (ListHealthReportResponse); | ||
| // Adds a health report source for an NVLink domain | ||
| rpc InsertNVLinkDomainHealthReport(InsertNVLinkDomainHealthReportRequest) returns (google.protobuf.Empty); |
There was a problem hiding this comment.
do we want to call it NVLinkDomain or ScaleUpDomain like its called in other places?
There was a problem hiding this comment.
Seems like technically, ScaleUp includes NVLink Domain but is not limited to it. However, in NICo core the common term is NVLink domain (see commom.NVLinkDomainId), so we probably don't want to call it ScaleUpDomain while using NVLinkDomainId I assume? 😃
| @@ -0,0 +1,4 @@ | |||
| CREATE TABLE nvlink_domain_health_reports ( | |||
There was a problem hiding this comment.
I'm wondering whether we want to already introduce the NVLink domain as a separate object or whether we just take an intermediate step where the handler would resolve attached machines and apply the report there.
I think sooner or later we might introduce a top level NVLinkDomain object, but it would likely have more fields (including its own state handler).
There was a problem hiding this comment.
Agreed that the NVLinkDomain object may come in the future.
Because the health services now have a feature to send reports of associated machines (using existing RPCs) along with the NVL domain report, my thoughts are:
- machines' NVL domain reports follow the existing ingestion paths and objects (resolved to machine level)
- we use this NVL health reports table to hold NVL domain reports for now
- this is an intermediate step because the table is not associated to anything else
- in the future, migrate/adapt to the NVLinkDomain object once it's defined and is part of the system
| @@ -0,0 +1,4 @@ | |||
| CREATE TABLE nvlink_domain_health_reports ( | |||
| id uuid PRIMARY KEY, | |||
There was a problem hiding this comment.
will it always be a UUID, or could it be a different format?
There was a problem hiding this comment.
Yes, this must be a UUID in the current implementation given that we use commom.NVLinkDomainId and there's validation and comments for it in the existing implementation:
/// NvLinkDomainId is a strongly typed UUID for NvLink domains.
pub type NvLinkDomainId = TypedUuid<NvLinkDomainIdMarker>;And we have this to invoke the parser and validation for UUID:
rpc/build.rs
.extern_path(".common.NVLinkDomainId", "::carbide_uuid::nvlink::NvLinkDomainId")A non UUID is rejected. If we have new ideas in the future, the ID type in the table should be migrated along with the definition of commom.NVLinkDomainId.



Description
Adds NVLink domain health report support across the API, DB, admin CLI, and admin web UI.
This PR:
nvlink_domain_health_reportstable using the existing health report JSONB storage pattern.carbide-admin-cli nvl-domain health-report show|remove|print-empty-template.Type of Change
Related Issues
Closes #1832
Breaking Changes
Testing