This guide covers project structure, development workflows, and best practices for contributing to NVSentinel.
- Getting Started
- Project Architecture
- Development Environment Setup
- Development Workflows
- Module Development
- Testing
- Code Standards
- CI/CD Pipeline
- Debugging
- Makefile Reference
git clone https://github.com/nvidia/nvsentinel.git
cd nvsentinel
make dev-env-setup # Install all dependenciesNVSentinel uses .versions.yaml for centralized version management across:
- Local development
- CI/CD pipelines
- Container builds
View current versions:
make show-versionsCore Tools (required):
- Go 1.25+ - See
.versions.yamlfor exact version - Docker
- kubectl
- Helm 3.0+
- Protocol Buffers Compiler
- yq - YAML processor for version management
Development Tools:
Optional (for local Kubernetes development):
Quick install (installs and configures all tools):
make dev-env-setupThis will:
- Detect your OS (Linux/macOS) and architecture
- Install yq and check for required tools
- Install development and Go tools
- Configure Python gRPC tools
Automated installation: To skip interactive prompts and auto-install all dependencies:
make dev-env-setup AUTO_MODE=trueDebugging setup issues: If the setup script fails, enable debug mode for detailed output:
DEBUG=true make dev-env-setup AUTO_MODE=trueThis will show:
- Architecture detection and mappings
- URL construction for all downloads
- HTTP response codes for failed downloads
- Detailed error messages with suggestions
Common setup issues and solutions are documented in the Debugging section.
Unified build system features:
- Consistent interface: All modules support common targets (
all,lint-test,clean) - Technology-aware: Appropriate tooling for Go, Python, and shell scripts
- Delegation pattern: Top-level Makefiles delegate to individual modules
- Repo-root context: Docker builds use consistent paths
- Multi-platform support: Built-in
linux/arm64,linux/amd64via Docker buildx
NVSentinel follows a microservices architecture with event-driven communication:
- Independence: Modules operate autonomously
- Event-Driven: Communication through MongoDB change streams
- Modular: Pluggable health monitors
- Cloud-Native: Kubernetes-first design
nvsentinel/
├── health-monitors/ # Hardware/software fault detection
│ ├── gpu-health-monitor/ # Python - DCGM GPU monitoring
│ ├── syslog-health-monitor/ # Go - System log monitoring
│ └── csp-health-monitor/ # Go - Cloud provider monitoring
├── platform-connectors/ # gRPC event ingestion service
├── fault-quarantine/ # CEL-based event quarantine logic
├── fault-remediation/ # Kubernetes controller for remediation
├── health-events-analyzer/ # Event analysis and correlation
├── health-event-client/ # Event streaming client
├── labeler/ # Node labeling controller
├── node-drainer/ # Graceful workload eviction
├── store-client/ # MongoDB interaction library (tested in CI)
└── log-collector/ # Log aggregation (shell scripts)
sequenceDiagram
participant HM as Health Monitor
participant PC as Platform Connectors
participant DB as MongoDB
participant FM as Fault Module
HM->>PC: gRPC health event
PC->>DB: Store event
DB->>FM: Change stream notification
FM->>DB: Query related events
FM->>K8s: Execute remediation action
Tilt provides the fastest development experience with hot reloading.
# Quick start - create cluster and start Tilt in one command
make dev-env # Creates cluster and starts Tilt
# Manual step-by-step approach
make cluster-create # Creates ctlptl-managed Kind cluster with registry
make tilt-up # Starts Tilt with UI (runs: tilt up -f tilt/Tiltfile)
# Check status
make cluster-status # Check cluster and registry status
# View Tilt UI
# Navigate to http://localhost:10350
# Stop everything when done
make dev-env-clean # Stops Tilt and deletes cluster
# Or stop individually
make tilt-down # Stops Tilt (runs: tilt down -f tilt/Tiltfile)
make cluster-delete # Deletes the clusterctlptl Cluster Features:
- Declarative cluster configuration with YAML
- Multi-node Kind cluster (3 control-plane, 2 worker nodes)
- Cluster name:
kind-nvsentinel(requiredkind-prefix) - Integrated local container registry at
localhost:5001 - Automatic registry configuration for Tilt
- Simplified cluster lifecycle management
- No external dependencies beyond Docker, ctlptl, and Kind
For module-specific development without full cluster:
# Set up Go environment
export GOPATH=$(go env GOPATH)
export GO_CACHE_DIR=$(go env GOCACHE)
# Install development dependencies
go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest
go install gotest.tools/gotestsum@latest
go install github.com/boumenot/gocover-cobertura@latest
# For controller modules
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latestGo module dependencies are handled automatically:
# Dependencies managed via go.mod files with replace directives for local development
# No manual GOPRIVATE configuration needed
# Private repository authentication handled via SSH keys-
Start Development Session
git checkout main git pull origin main git checkout -b feature/your-feature-name # Start local development environment make dev-env # Creates ctlptl-managed cluster and starts Tilt
-
Develop with Live Reload
# Edit code - Tilt automatically rebuilds and redeploys vim health-monitors/syslog-health-monitor/pkg/monitor/monitor.go # View logs in Tilt UI at http://localhost:10350 # Or use kubectl for specific logs (note: syslog-health-monitor runs as DaemonSet with -regular and -kata variants) kubectl logs -f daemonset/nvsentinel-syslog-health-monitor-regular -n nvsentinel
-
Test Changes
# Run tests locally (while Tilt is running) make health-monitors-lint-test-all # All health monitors make health-events-analyzer-lint-test # Specific Go module make platform-connectors-lint-test # Another Go module # Or run individual module tests directly (using standardized targets) make -C health-monitors/syslog-health-monitor lint-test make -C platform-connectors lint-test make -C health-events-analyzer lint-test # Test integration with other services via Tilt UI # Access services via port-forwards set up by Tilt
-
Validate Before Commit
# Run full test suite make lint-test-all # Stop Tilt for final testing if needed make tilt-down
-
Commit and Push
git add . git commit -s -m "feat: add new monitoring capability" git push origin feature/your-feature-name # Clean up development environment make dev-env-clean
When modifying .proto files:
# Generate protobuf files
make protos-lint
# This runs:
# - protoc generation for Go modules
# - Python protobuf generation for GPU monitor
# - Import path fixes for Python
# - Git diff check to ensure files are up to dateThe project provides a unified Docker build system with consistent patterns across all modules. All builds support multi-platform architecture, build caching, and proper context management.
Set these for production-like builds:
# Docker configuration (standardized across all modules)
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username" # Defaults to repository owner
export CI_COMMIT_REF_NAME="feature-branch" # Or your branch name
# These are computed automatically by common.mk:
# SAFE_REF_NAME=$(echo $CI_COMMIT_REF_NAME | sed 's/\//-/g')
# PLATFORMS="linux/arm64,linux/amd64"
# MODULE_NAME=$(basename $(CURDIR))Build System Overview
The Docker build system uses shared patterns via common.mk for Go modules, with specialized handling for Python and container-only modules. Each module maintains its own Docker configuration.
Main build targets (delegated to individual modules):
# Local Development (--load) - builds images into local Docker daemon
make docker-all # All images locally (delegates to docker/Makefile)
make docker-health-monitors # All health monitor images locally
make docker-main-modules # All non-health-monitor images locally
# CI/Production (--push) - builds and pushes directly to registry
make docker-publish-all # Build and push all images to registry
make docker-publish-health-monitors # Build and push health monitor images
make docker-publish-main-modules # Build and push main module images
# Individual module targets (via common.mk)
make docker-syslog-health-monitor # Build syslog health monitor locally
make docker-publish-syslog-health-monitor # Build and push to registry
make docker-platform-connectors # Build platform connectors locally
make docker-publish-platform-connectors # Build and push to registry
# Special cases
make docker-gpu-health-monitor # Both DCGM 3.x and 4.x versions locally
make docker-log-collector # Container-only module (shell + Python)Direct docker/ Makefile usage:
cd docker
# Local development builds (--load)
make build-all # Build all 12 images locally
make build-health-monitors # Build health monitor group locally
make build-syslog-health-monitor # Build specific module locally
# CI/production builds (--push)
make publish-all # Build and push all images to registry
make publish-syslog-health-monitor # Build and push specific image to registry
# Utility commands
make setup-buildx # Setup multi-platform builder
make clean # Remove all nvsentinel images
make list # List built nvsentinel images
make help # Show all available targetsIndividual module usage:
# Go modules (common.mk patterns)
make -C health-monitors/syslog-health-monitor docker-build # Local build with remote cache
make -C health-monitors/syslog-health-monitor docker-build-local # Local build, no remote cache (faster)
make -C health-monitors/syslog-health-monitor docker-publish # CI build
make -C platform-connectors docker-build # Local build with remote cache
make -C platform-connectors docker-build-local # Local build, no remote cache (faster)
make -C platform-connectors docker-publish # CI build
make -C health-events-analyzer docker-build-local # Local build, no remote cache (faster)
make -C health-events-analyzer docker-publish # CI build
# Python module (specialized patterns)
make -C health-monitors/gpu-health-monitor docker-build-dcgm3 # DCGM 3.x local
make -C health-monitors/gpu-health-monitor docker-publish-dcgm4 # DCGM 4.x CI
# Container-only module (shell + Python)
make -C log-collector docker-build-log-collector # Local build
make -C log-collector docker-publish-log-collector # CI buildEach module provides Docker targets with common patterns:
# Go modules (common.mk patterns)
make -C health-monitors/syslog-health-monitor docker-build # Local with remote cache
make -C health-monitors/syslog-health-monitor docker-build-local # Local, no remote cache (recommended)
make -C health-monitors/syslog-health-monitor docker-publish # CI/production
make -C platform-connectors docker-build # Local with remote cache
make -C platform-connectors docker-build-local # Local, no remote cache (recommended)
make -C platform-connectors docker-publish # CI/production
# Python module (specialized patterns)
make -C health-monitors/gpu-health-monitor docker-build-dcgm3 # DCGM 3.x local
make -C health-monitors/gpu-health-monitor docker-build-dcgm4 # DCGM 4.x local
make -C health-monitors/gpu-health-monitor docker-publish # Push both versions
# Container-only module (shell + Python)
make -C log-collector docker-build # Both log-collector and file-server-cleanup
make -C log-collector docker-publish # Push both components
# Legacy compatibility (all modules)
make -C [module] image # Calls docker-build
make -C [module] publish # Calls docker-publishAll builds support consistent features:
- Multi-Platform Support:
linux/arm64,linux/amd64viacommon.mk(docker-build);linux/amd64fordocker-build-local - Build Caching: Registry-based build cache for faster builds (
docker-build); local cache only (docker-build-local) - Repo-Root Context: All builds use consistent repo-root context
- Dynamic Tagging: Uses branch/tag name (
${SAFE_REF_NAME}) fordocker-build; simplemodule:localfordocker-build-local - Registry Integration: NVCR.io registry paths for
docker-buildanddocker-publish - Module Auto-Detection: Automatic module name detection via
$(MODULE_NAME)
For Local Development, use docker-build-local to avoid registry authentication issues and build faster.
Local Development:
# Recommended: Fast local build (single platform, no remote cache)
make -C health-monitors/syslog-health-monitor docker-build-local
# Alternative: Local build with remote cache (multi-platform, slower)
make -C health-monitors/syslog-health-monitor docker-build
# Legacy: Quick local build
make -C health-monitors/syslog-health-monitor imageCI-like Build:
**CI-like Build:**
```bash
# Set up environment like GitHub Actions
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username"
export CI_COMMIT_REF_NAME="main"
# Build all images with full CI features (standardized)
make docker-all
# Images will be tagged like:
# ghcr.io/your-github-username/syslog-health-monitor:main
# ghcr.io/nvidia/nvsentinel/gpu-health-monitor:main-dcgm-3.x
# ghcr.io/nvidia/nvsentinel/gpu-health-monitor:main-dcgm-4.xTesting Specific Module:
# Recommended: Build and test individual module (fast, local)
make -C platform-connectors docker-build-local
docker run --rm platform-connectors:local --help
# Alternative: Build with full CI features
make docker-platform-connectors
docker run --rm ghcr.io/nvidia/nvsentinel/platform-connectors:fix-make-file-targets --help
# Build private repo module (fast, local)
make -C health-events-analyzer docker-build-localThe new system uses Docker BuildKit registry cache:
- First build: Downloads and caches layers
- Subsequent builds: Reuses cached layers for 10x+ speed improvement
- Multi-developer: Cache shared across team via registry
Build failures:
# Check buildx setup
make -C docker setup-buildx
# Clean and retry
make -C docker clean
docker system prune -f
make docker-syslog-health-monitorPrivate repo access:
# Verify SSH key access
git ls-remote git@github.com:dgxcloud/mk8s/some-private-repo.git
# Build with debug output
BUILDKIT_PROGRESS=plain make docker-csp-health-monitorRegistry issues:
# Test registry login
docker login nvcr.io -u '$oauthtoken' -p "$NGC_PASSWORD"
# Check image tags
make -C docker listProblem: On macOS with Docker Desktop, Unix domain sockets require the /var/run directory to exist inside containers, but this directory is not created by default in minimal container images.
Symptoms:
- Services fail to start with errors like:
failed to listen on unix socket /var/run/nvsentinel.sock: no such file or directory - Tilt-based tests fail on macOS but pass on Linux
- gRPC Unix socket connections fail
Solution: The project includes a Tilt-specific Helm values file that creates the /var/run directory using an initContainer:
# File: distros/kubernetes/nvsentinel/values-tilt-socket.yaml
#
# This values file is automatically included when running Tilt on macOS/Docker Desktop.
# It adds an initContainer to create /var/run directory for Unix socket communication.
global:
initContainers:
- name: create-run-dir
image: busybox:latest
command: ['sh', '-c', 'mkdir -p /var/run']
volumeMounts:
- name: socket-dir
mountPath: /var/runHow it works:
- The
tilt/Tiltfileautomatically includesvalues-tilt-socket.yamlfor local development - The initContainer runs before each service starts and creates the
/var/rundirectory - Services can then create Unix sockets at
/var/run/nvsentinel.sock - The socket directory is shared via an
emptyDirvolume mount
Platform-specific behavior:
- macOS/Docker Desktop: Requires the initContainer workaround (automatically applied in Tilt)
- Linux: The
/var/rundirectory typically exists in the container runtime environment - Production/Kubernetes: Uses standard Helm values without the initContainer (not needed)
Note: This is a development-only workaround for local macOS environments. Production deployments on Linux do not require this configuration.
-
Create Module Structure
mkdir -p health-monitors/my-monitor/{cmd,pkg,internal} cd health-monitors/my-monitor -
Initialize Go Module
go mod init github.com/nvidia/nvsentinel/health-monitors/my-monitor
-
Create Module Makefile
# Copy template from existing health monitor cp ../syslog-health-monitor/Makefile ./Makefile # Update module-specific settings sed -i 's/syslog-health-monitor/my-monitor/g' Makefile sed -i 's/Syslog Health Monitor/My Monitor/g' Makefile
-
Implement gRPC Client
// pkg/monitor/monitor.go package monitor import ( "context" pb "github.com/nvidia/nvsentinel/platform-connectors/pkg/protos" "google.golang.org/grpc" ) type Monitor struct { client pb.PlatformConnectorClient } func (m *Monitor) SendEvent(ctx context.Context, event *pb.HealthEvent) error { _, err := m.client.SendHealthEvent(ctx, event) return err }
-
Update health-monitors/Makefile
# Add your module to the health monitors list # Edit health-monitors/Makefile: # - Add 'my-monitor' to GO_HEALTH_MONITORS list # - Add lint-test delegation target # - Add build delegation target # - Add clean delegation target
-
Test Your Module
# Test the individual module make -C health-monitors/my-monitor lint-test # Test via health-monitors coordination make -C health-monitors lint-test-my-monitor # Test via main Makefile delegation make health-monitors-lint-test-all
-
Add to CI Pipeline The module will automatically be included in GitHub Actions workflows due to the standardized patterns.
-
Follow Kubernetes Controller Pattern
# Use controller-runtime for Kubernetes controllers go get sigs.k8s.io/controller-runtime -
Implement MongoDB Change Streams
// Use store-client for MongoDB operations import "github.com/nvidia/nvsentinel/store-client/pkg/client"
-
Add Proper RBAC Create Kubernetes RBAC manifests in
distros/kubernetes/nvsentinel/templates/.
- Unit Tests: Test individual functions and methods
- Integration Tests: Test module interactions
- End-to-End Tests: Test complete workflows via CI
The unified Makefile structure provides consistent testing across all modules:
# Test all modules (delegates to all sub-Makefiles)
make lint-test-all # Main Makefile - runs everything
# Test by category
make health-monitors-lint-test-all # All health monitors
make go-lint-test-all # All Go modules (common.mk patterns)
# Test individual modules via delegation (main Makefile)
make health-events-analyzer-lint-test # Go module
make platform-connectors-lint-test # Go module
make store-client-lint-test # Go module
make log-collector-lint-test # Container module
# Test individual modules directly (common.mk patterns)
make -C health-monitors/syslog-health-monitor lint-test # Go module
make -C platform-connectors lint-test # Go module
make -C health-events-analyzer lint-test # Go module
make -C health-monitors/gpu-health-monitor lint-test # Python module
# Use individual targets for development (common.mk)
cd health-monitors/syslog-health-monitor
make vet # Just go vet
make lint # Just golangci-lint
make test # Just tests
make coverage # Tests + coverage
make build # Build module
make binary # Build main binary
# Run specific test with verbose output
cd platform-connectors
go test -v ./pkg/connectors/...Each module must include:
- Unit tests with
_test.gosuffix - Coverage reporting via
go test -coverprofile - Integration tests where applicable
- Mocks for external dependencies
# Using the module's Makefile (recommended)
make -C health-monitors/gpu-health-monitor lint-test # Full lint-test
make -C health-monitors/gpu-health-monitor setup # Just Poetry setup
make -C health-monitors/gpu-health-monitor lint # Just Black check
make -C health-monitors/gpu-health-monitor test # Just tests
make -C health-monitors/gpu-health-monitor format # Run Black formatter
# Manual Poetry commands
cd health-monitors/gpu-health-monitor
poetry install
poetry run pytest -v
poetry run black --check .
poetry run coverage run --source=gpu_health_monitor -m pytest- Linting: Use
golangci-lintwith project configuration - Formatting: Use
gofmt(enforced by linting) - Imports: Group standard, third-party, and local imports
- Error Handling: Always check and handle errors appropriately
- Context: Pass
context.Contextfor cancellation and timeouts
- All tests pass
- Code coverage maintained or improved
- No linting violations
- Proper error handling
- Documentation updated
- License headers present
- Signed commits (
git commit -s)
All source files must include the Apache 2.0 license header:
# Add license headers to new files
addlicense -f .github/headers/LICENSE .
# Check license headers
make license-headers-lintThe project uses GitHub Actions for continuous integration with the following workflows:
-
lint-test.yml: Code quality and testing- Runs
lint-teston all modules using standardized Makefile targets - Includes health monitors, Go modules, Python linting, shell script validation
- Uses matrix strategy for parallel execution across components
- Runs
-
container-build-test.yml: Container build validation- Validates Docker builds for all modules can complete successfully
- Uses the standardized
docker-buildtargets from individual modules - Runs on pull requests affecting container-related files
-
e2e-test.yml: End-to-end testing- Sets up Kind cluster with ctlptl for full integration testing
- Uses Tilt for deployment and testing
-
publish.yml: Container image publishing -
release.yml: Semantic release automation
# Run the same commands as GitHub Actions locally
make lint-test-all # Matches lint-test.yml workflow
# Individual module CI commands (common.mk patterns)
make -C health-monitors/syslog-health-monitor lint-test
make -C health-monitors/gpu-health-monitor lint-test
make -C platform-connectors lint-test # Uses common.mk patterns
make -C log-collector lint-test # Shell + Python linting
# Container builds (matches container-build-test.yml)
make -C health-monitors/syslog-health-monitor docker-build
make -C platform-connectors docker-build
# Or run individual steps for debugging (common.mk targets)
cd health-monitors/syslog-health-monitor
make vet # go vet ./...
make lint # golangci-lint run
make test # gotestsum with coverage
make coverage # generate coverage reports
# Manual commands (what common.mk executes)
go vet ./...
golangci-lint run --config ../.golangci.yml # Output format configured in .golangci.yml v2
gotestsum --junitfile report.xml -- -race $(go list ./...) -coverprofile=coverage.txt -covermode atomicThe CI environment uses:
- Consistent tool versions managed in
.versions.yaml - Shared build environment setup via
.github/actions/setup-build-env - Artifact uploads for test results and coverage reports
- Private repository access handled via SSH keys
-
Tilt Debugging
# Start Tilt with Makefile (recommended) make tilt-up # Navigate to http://localhost:10350 # Or run Tilt in CI mode (no UI, good for debugging) make tilt-ci # Stream logs for specific service kubectl logs -f deployment/platform-connectors -n nvsentinel # Access Tilt logs and resource status tilt get all tilt logs platform-connectors
-
gRPC Debugging
# Use grpcurl to test endpoints grpcurl -plaintext localhost:50051 list grpcurl -plaintext localhost:50051 platformconnector.PlatformConnector/SendHealthEvent
-
Module Dependencies
# Clean module cache if dependency issues go clean -modcache go mod download -
Private Repository Access
# Verify SSH key configuration ssh -T git@github.com # Test access git ls-remote git@github.com:dgxcloud/mk8s/k8s-addons/nvsentinel.git
-
Container Build Issues
# Clean Docker cache docker system prune -f # Rebuild without cache docker build --no-cache -t platform-connectors platform-connectors/
-
Shellcheck Version Differences (Log Collector)
# GitHub Actions uses a specific shellcheck version from setup-build-env # Local shellcheck version may differ, causing different linting results # Use standardized linting (matches GitHub Actions): make -C log-collector lint-test # Standardized pattern make log-collector-lint # Main Makefile delegation # Install shellcheck locally to match CI: # macOS: brew install shellcheck # Ubuntu: apt-get install shellcheck # See: https://github.com/koalaman/shellcheck#installing
The scripts/setup-dev-env.sh script installs all development dependencies. If you encounter issues:
# Run with detailed debugging output
DEBUG=true make dev-env-setup AUTO_MODE=true
# Or run the script directly
DEBUG=true AUTO_MODE=true ./scripts/setup-dev-env.shDebug output includes:
- Architecture detection (x86_64 → amd64, aarch64 → arm64)
- Architecture mappings for different tools (GO_ARCH vs PROTOC_ARCH)
- Complete URLs being constructed for downloads
- HTTP response codes from URL verification
- Version information from
.versions.yaml
1. Download Failures (404 errors)
If you see errors like "HTTP 404 Not Found":
# Enable debug mode to see the exact URL
DEBUG=true ./scripts/setup-dev-env.sh
# Verify the URL manually
curl -I "https://github.com/koalaman/shellcheck/releases/download/v0.11.0/shellcheck-v0.11.0.linux.x86_64.tar.xz"
# Check if the release exists on GitHub
# Visit: https://github.com/<owner>/<repo>/releasesCommon causes:
- Version in
.versions.yamldoesn't exist in GitHub releases - Architecture-specific filename doesn't match release assets
- Tool project changed their release naming convention
2. Architecture Mismatch
Different tools use different architecture naming:
- Go tools (yq, kubectl, helm):
amd64,arm64,darwin - Protocol Buffers:
x86_64,aarch_64,osx-universal_binary - Shellcheck:
x86_64,aarch64,darwin
The script automatically maps these:
# See mappings in debug output
DEBUG=true ./scripts/setup-dev-env.sh
# Look for: "DEBUG: Architecture mappings: Raw ARCH: x86_64, GO_ARCH: amd64, PROTOC_ARCH: x86_64"3. Permission Issues
# If installation to /usr/local/bin fails
sudo make dev-env-setup AUTO_MODE=true
# Or install to user directory (requires PATH modification)
mkdir -p ~/bin
export PATH="$HOME/bin:$PATH"
# Modify script to use ~/bin instead of /usr/local/bin4. Network/Proxy Issues
# Test connectivity to GitHub
curl -I https://github.com
# If behind proxy, configure:
export HTTP_PROXY="http://proxy.example.com:8080"
export HTTPS_PROXY="http://proxy.example.com:8080"
export NO_PROXY="localhost,127.0.0.1"
# Then retry
DEBUG=true make dev-env-setup AUTO_MODE=true5. Version Validation
# Check what versions are configured
cat .versions.yaml
# Verify specific tool version exists
TOOL_VERSION=$(yq eval '.SHELLCHECK_VERSION' .versions.yaml)
echo "Checking shellcheck version: $TOOL_VERSION"
curl -I "https://github.com/koalaman/shellcheck/releases/tag/$TOOL_VERSION"
# Update version if needed
yq eval '.SHELLCHECK_VERSION = "v0.11.0"' -i .versions.yamlTo test URL construction without running the full setup:
# Source the script functions
source scripts/setup-dev-env.sh
# Test specific tool URLs
echo "yq URL: $YQ_URL"
echo "kubectl URL: $KUBECTL_URL"
echo "protoc URL: $PROTOC_URL"
echo "shellcheck URL: $SHELLCHECK_URL"
# Test URL accessibility
curl -I "$SHELLCHECK_URL"-
Verify tool exists in releases:
# Check GitHub releases page open "https://github.com/koalaman/shellcheck/releases" # Or use API curl -s "https://api.github.com/repos/koalaman/shellcheck/releases/latest" | grep browser_download_url
-
Test download manually:
# Download specific asset wget "https://github.com/koalaman/shellcheck/releases/download/v0.11.0/shellcheck-v0.11.0.linux.x86_64.tar.xz" # Verify archive integrity tar -tzf shellcheck-v0.11.0.linux.x86_64.tar.xz
-
Validate script syntax:
# Check for syntax errors bash -n scripts/setup-dev-env.sh # Run shellcheck on the script itself shellcheck scripts/setup-dev-env.sh
When reporting setup script issues, include:
-
Debug output:
DEBUG=true make dev-env-setup AUTO_MODE=true 2>&1 | tee setup-debug.log
-
Environment details:
echo "OS: $(uname -s)" echo "Architecture: $(uname -m)" echo "Shell: $SHELL" echo "Bash version: $BASH_VERSION"
-
Version file content:
cat .versions.yaml
-
Failed URL (from debug output):
Look for lines like: ❌ ERROR: Failed to verify URL: https://... HTTP Status: 404 Not Found
This information helps diagnose whether the issue is:
- Version-specific (tool version doesn't exist)
- Architecture-specific (wrong filename for platform)
- Network-related (connectivity or proxy issues)
- Script bug (incorrect URL construction logic)
The project uses a unified Makefile structure with shared patterns for consistency:
Acts as the primary coordinator, delegating to specialized sub-Makefiles:
make help # Show all available targets
make lint-test-all # Run full test suite (delegates to all modules)
make health-monitors-lint-test-all # Delegate to health-monitors/Makefile
make docker-all # Delegate to docker/Makefile
make dev-env # Delegate to dev/Makefile
make kubernetes-distro-lint # Delegate to distros/kubernetes/MakefileShared build/test/Docker patterns for all Go modules:
# Included by all Go modules with: include ../common.mk
# Provides consistent targets:
all # Default target: lint-test
lint-test # Full lint and test (matches CI)
vet, lint, test, coverage, build, binary # Individual steps
docker-build, docker-publish # Docker targets (if HAS_DOCKER=1)
setup-buildx, clean, help # Utility targetsCoordinates all health monitoring modules:
make -C health-monitors help # Show health monitor targets
make -C health-monitors lint-test-all # Test all health monitors
make -C health-monitors go-lint-test-all # Test Go health monitors
make -C health-monitors python-lint-test-all # Test Python health monitors
make -C health-monitors build-all # Build all health monitorsEach Go module includes common.mk for consistent patterns:
# All Go modules have identical interface via common.mk
make -C health-monitors/syslog-health-monitor help # Help target
make -C health-monitors/syslog-health-monitor lint-test # Full lint-test
make -C platform-connectors lint-test # Same pattern
make -C health-events-analyzer lint-test # Same pattern
# Individual development steps (common.mk patterns)
make -C health-monitors/syslog-health-monitor vet # go vet ./...
make -C health-monitors/syslog-health-monitor lint # golangci-lint run
make -C health-monitors/syslog-health-monitor test # gotestsum
make -C health-monitors/syslog-health-monitor coverage # coverage reports
make -C health-monitors/syslog-health-monitor build # go build ./...
make -C health-monitors/syslog-health-monitor binary # go build main binaryDelegation-based Docker build system:
make -C docker help # Show all Docker targets and configuration
# Main build targets (delegates to individual modules)
make -C docker build-all # Build all images (delegates to modules)
make -C docker publish-all # Build and push all images
make -C docker setup-buildx # Setup Docker buildx builder
# Group targets
make -C docker build-health-monitors # Build all health monitor images
make -C docker build-main-modules # Build all non-health-monitor images
# Individual module targets (delegates to module Makefiles)
make -C docker build-syslog-health-monitor # Calls module's docker-build
make -C docker build-csp-health-monitor # Calls module's docker-build
make -C docker build-gpu-health-monitor-dcgm3 # Calls module's docker-build-dcgm3
make -C docker build-gpu-health-monitor-dcgm4 # Calls module's docker-build-dcgm4
make -C docker build-platform-connectors # Calls module's docker-build
make -C docker build-health-events-analyzer # Calls module's docker-build
make -C docker build-log-collector # Calls module's docker-build
# Publish targets (delegates to modules)
make -C docker publish-syslog-health-monitor # Calls module's docker-publish
make -C docker publish-all # Calls all modules' docker-publish
# Utility targets
make -C docker clean # Remove all nvsentinel images
make -C docker list # List built nvsentinel imagesKey Features:
- Delegation-based: Each module is single source of truth for its Docker config
- Multi-platform builds:
linux/arm64,linux/amd64viacommon.mk - Build caching: Registry-based cache for faster builds
- Consistent patterns: Go modules use
common.mk, specialized for Python/shell - Dynamic tagging: Uses
${SAFE_REF_NAME}from branch/tag names - Registry integration: Full NVCR.io paths and authentication
Focused on development environment:
make -C dev help # Show development targets
make -C dev env-up # Create cluster + start Tilt
make -C dev env-down # Stop Tilt + delete cluster
make -C dev cluster-create # Create Kind cluster
make -C dev tilt-up # Start Tilt
make -C dev cluster-status # Check cluster statusHelm and Kubernetes operations:
make -C distros/kubernetes help # Show Kubernetes targets
make -C distros/kubernetes lint # Lint Helm charts
make -C distros/kubernetes helm-publish # Publish Helm chart# 1. Full development cycle
make dev-env # Start development environment
make lint-test-all # Test all modules
make docker-all # Build containers (delegates to modules)
make dev-env-clean # Clean up
# 2. Individual module development (common.mk patterns)
make platform-connectors-lint-test # Test specific Go module (main Makefile)
make -C platform-connectors lint-test # Test directly (common.mk pattern)
make -C platform-connectors docker-build # Build container (common.mk)
# 3. Focused development on specific module (common.mk targets)
cd platform-connectors
make lint-test # Full module test
make vet # Quick syntax check
make test # Run tests only
make build # Build module
make binary # Build main binary
# 4. Health monitors (coordination + common.mk patterns)
make health-monitors-lint-test-all # All health monitors
make -C health-monitors/syslog-health-monitor lint-test # Specific health monitorAll Go modules use consistent patterns via common.mk:
- Consistent targets:
lint-test,vet,lint,test,build,binary - Docker integration:
docker-build,docker-publish(ifHAS_DOCKER=1) - Unified configuration: Same environment variables and build flags
- Backwards compatibility: Legacy targets (
image,publish) still work
-
For Go modules:
cd your-module/ go get github.com/new/dependency@v1.2.3 go mod tidy -
For Python modules:
cd health-monitors/gpu-health-monitor/ poetry add new-dependency
- Edit
.protofiles inprotobufs/directory - Regenerate code:
make protos-lint
- Update affected modules and test
- Update Helm values in
distros/kubernetes/nvsentinel/values.yaml - Update templates in
distros/kubernetes/nvsentinel/templates/ - Update module code to read new configuration
- Test with Tilt or manual Helm install
# Enable pprof in Go applications
import _ "net/http/pprof"
# Access profiles
go tool pprof http://localhost:6060/debug/pprof/profile
go tool pprof http://localhost:6060/debug/pprof/heap- Never break backward compatibility
- Add fields with default values
- Use MongoDB schema validation if needed
- Test with existing data
- Start Small: Make incremental changes
- Test Early: Write tests alongside code
- Document Changes: Update relevant documentation
- Review Dependencies: Minimize external dependencies
- Monitor Resources: Be aware of CPU/memory usage
- Resource Limits: Always set resource requests/limits
- Health Checks: Implement readiness and liveness probes
- Graceful Shutdown: Handle SIGTERM properly
- Security Context: Run with minimal privileges
- Observability: Emit metrics and structured logs
- Indexes: Create appropriate indexes for queries
- Connection Pooling: Reuse connections efficiently
- Change Streams: Use resume tokens for reliability
- Error Handling: Handle network partitions gracefully
🎯 Usage Examples:
Local Development Workflow:
# Build for local testing (loads into local Docker daemon)
make -C docker build-syslog-health-monitor # Individual module
make -C docker build-all # All modules
make -C health-monitors/gpu-health-monitor docker-build-dcgm3 # Specific variant
# Test the built images locally
# Test the built images locally
docker run ghcr.io/your-github-username/nvsentinel-syslog-health-monitor:localCI/Production Workflow:
# Environment setup (matches GitHub Actions)
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username"
export CI_COMMIT_REF_NAME="main"
# Authentication handled by docker login to ghcr.io
# Build and push directly to registry (standardized patterns)
make -C docker publish-syslog-health-monitor # Individual module
make -C docker publish-all # All modules
make -C health-monitors/gpu-health-monitor docker-publish # Both DCGM variantsDevelopment vs CI Behavior:
# Development: Fast local build (recommended)
make -C health-monitors/syslog-health-monitor docker-build-local
# Development: Full featured build (slower, like CI)
make -C health-monitors/syslog-health-monitor docker-build
# CI/Production: Build and push with --push (standardized)
make -C health-monitors/syslog-health-monitor docker-publish- Internal Documentation: Check module-specific READMEs and
make helptargets - GitHub Issues: Report bugs and feature requests
- Team Chat: Reach out to the development team
- Code Reviews: Learn from feedback on pull requests
- Makefile Help: Use
make helpin any module for target documentation - Common Patterns: All Go modules follow
common.mkpatterns for consistency
Happy coding! 🚀
For questions about this guide or the development process, please reach out to the NVSentinel development team.