Skip to content

[Question]: Is it possible to deploy NVSentinel on-prem on an HPC cluster with a GPU partition managed by SLURM. #1036

@ovalerio

Description

@ovalerio

Prerequisites

  • I searched existing issues and docs

Question

Dear NVSentinel Team,

I learned in the GTC News, about NVSentinel being a new tool for self-remediation of an organization GPU resources. I am interested in the hardware/node health detection components of the software.

Our system is not a kubernetes/cloud, but the more traditional on-prem HPC setup. We are already using the DCGM Exporter and found it useful to log and gather intelligence on the system status and utilization. This motivates my question:

Would it be possible to operate NVSentinel without all the bells and whistles? More like a traditional alarming system?

I would argue that would be a very valuable tool for our local sys admin team. :)

Thanks!

Category

Installation/Deployment

Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions