Skip to content

[Epic] Support User-Provided Custom Templates in Holodeck Config #565

@ArangoGutierrez

Description

@ArangoGutierrez

Epic: Support User-Provided Custom Templates in Holodeck Config

Summary

Enable users to specify custom scripts and templates in the Holodeck configuration file that are executed during the provisioning stage. This allows further customization of target host setup beyond Holodeck's built-in templates.

Motivation

Currently, Holodeck only executes built-in provisioning templates. Users with advanced requirements need to:

  • Install additional system packages or tools
  • Configure custom environment variables
  • Run organization-specific initialization scripts
  • Install proprietary software or drivers
  • Apply security hardening policies
  • Set up monitoring or observability agents

By allowing custom templates, Holodeck becomes extensible without modifying core code.

Design Considerations

Execution Phases

Custom scripts can run at different provisioning phases:

  1. pre-install: Before any Holodeck components are installed
  2. post-runtime: After container runtime, before toolkit
  3. post-toolkit: After toolkit, before Kubernetes
  4. post-kubernetes: After Kubernetes is ready
  5. post-install (default): After all Holodeck components

Security Model

  • Scripts execute with same privileges as provisioning (sudo available)
  • Input validation required
  • Optional dry-run/preview mode
  • Script checksums for integrity verification

Subtasks

Phase 1: API Schema Design

  • Extend EnvironmentSpec for custom templates

    apiVersion: holodeck.nvidia.com/v1alpha1
    kind: Environment
    metadata:
      name: custom-env
    spec:
      # ... existing fields ...
      customTemplates:
        # Inline script
        - name: install-monitoring
          phase: post-kubernetes  # pre-install, post-runtime, post-toolkit, post-kubernetes, post-install
          inline: |
            #!/bin/bash
            apt-get install -y prometheus-node-exporter
            systemctl enable prometheus-node-exporter
        
        # External file reference
        - name: security-hardening
          phase: pre-install
          file: ./scripts/harden.sh
          checksum: sha256:abc123...  # Optional integrity check
        
        # Remote URL
        - name: company-setup
          phase: post-install
          url: https://internal.company.com/setup/gpu-host.sh
          checksum: sha256:def456...
          timeout: 300  # seconds
  • Define CustomTemplate type in API

    type CustomTemplate struct {
        // Name is a human-readable identifier for the template
        Name string `json:"name"`
        
        // Phase determines when the template is executed
        // +kubebuilder:validation:Enum=pre-install;post-runtime;post-toolkit;post-kubernetes;post-install
        Phase TemplatePhase `json:"phase,omitempty"`
        
        // Inline contains the script content directly
        // +optional
        Inline string `json:"inline,omitempty"`
        
        // File is a path to a local script file
        // +optional
        File string `json:"file,omitempty"`
        
        // URL is a remote location to fetch the script from
        // +optional
        URL string `json:"url,omitempty"`
        
        // Checksum is an optional SHA256 checksum for verification
        // +optional
        Checksum string `json:"checksum,omitempty"`
        
        // Timeout in seconds for script execution (default: 600)
        // +optional
        Timeout int `json:"timeout,omitempty"`
        
        // ContinueOnError allows provisioning to continue if this script fails
        // +optional
        ContinueOnError bool `json:"continueOnError,omitempty"`
        
        // Environment variables to set for the script
        // +optional
        Env map[string]string `json:"env,omitempty"`
    }

Phase 2: Template Loading

  • Implement template source resolver

    • Load inline scripts directly
    • Read local files (absolute or relative to config)
    • Fetch remote URLs with timeout
    type TemplateLoader interface {
        Load(ctx context.Context, template CustomTemplate) ([]byte, error)
    }
  • Implement checksum verification

    • Support SHA256 checksums
    • Fail if checksum mismatch
    • Log warning if no checksum provided for remote URLs
  • Handle template preprocessing

    • Inject Holodeck common functions if requested
    • Variable substitution (environment info, hostnames, etc.)
    • Shebang detection/addition

Phase 3: Dependency Resolver Integration

  • Extend DependencyResolver for custom templates
    • Insert custom templates at appropriate phases
    • Maintain execution order within phases
    func (d *DependencyResolver) Resolve() []ProvisionFunc {
        // Pre-install custom templates
        d.addCustomTemplates(PhasePreInstall)
        
        // Kernel
        if d.env.Spec.Kernel.Version != "" {
            d.withKernel()
        }
        
        // Driver
        if d.env.Spec.NVIDIADriver.Install {
            d.withNVDriver()
        }
        
        // Container Runtime
        if d.env.Spec.ContainerRuntime.Install {
            d.withContainerRuntime()
        }
        d.addCustomTemplates(PhasePostRuntime)
        
        // ... continue pattern
    }

Phase 4: Execution Framework

  • Implement custom template executor

    • Execute via SSH (consistent with built-in templates)
    • Capture stdout/stderr
    • Enforce timeouts
    • Handle exit codes
    func (p *Provisioner) executeCustomTemplate(template CustomTemplate) error {
        script, err := p.loader.Load(ctx, template)
        if err != nil {
            return fmt.Errorf("failed to load template %s: %w", template.Name, err)
        }
        
        // Add environment variables
        envPrefix := formatEnvVars(template.Env)
        
        // Execute with timeout
        ctx, cancel := context.WithTimeout(ctx, time.Duration(template.Timeout)*time.Second)
        defer cancel()
        
        return p.executeWithTimeout(ctx, envPrefix+string(script))
    }
  • Add execution progress reporting

    • Log template name before execution
    • Report success/failure clearly
    • Show execution time

Phase 5: Error Handling

  • Implement per-template error handling

    • Respect continueOnError flag
    • Aggregate errors for final report
    • Clear error messages with template name
    [CUSTOM] Running template 'install-monitoring' (phase: post-kubernetes)
    [CUSTOM] Template 'install-monitoring' completed in 45s
    [CUSTOM] Running template 'security-hardening' (phase: post-install)
    [ERROR] Template 'security-hardening' failed: exit code 1
    [ERROR] Output: apt-get: package not found: nonexistent-package
    
  • Implement dry-run support

    • Show what scripts would be executed
    • Display script content (or summary for large scripts)
    • Validate checksums without executing

Phase 6: Security Measures

  • Implement input validation

    • Validate URLs (no file:// or unsafe protocols)
    • Check for obvious dangerous patterns (optional warning)
    • Enforce HTTPS for remote URLs
  • Add audit logging

    • Log all custom template executions
    • Record template content hashes
    • Include in provisioning report
  • Implement template preview in dryrun

    holodeck dryrun -f env.yaml
    # ... built-in templates ...
    
    === Custom Templates ===
    
    [Phase: post-kubernetes] install-monitoring
    Source: inline
    Content:
      #!/bin/bash
      apt-get install -y prometheus-node-exporter
      systemctl enable prometheus-node-exporter
    
    [Phase: post-install] security-hardening  
    Source: ./scripts/harden.sh
    Checksum: sha256:abc123... (verified ✓)

Phase 7: Template Library (Optional)

  • Create example template library

    • Common monitoring setups
    • Security hardening scripts
    • Development tools installation
    • Located in examples/templates/
  • Document template best practices

    • Idempotency guidelines
    • Error handling recommendations
    • Environment variable usage
    • Testing templates locally

Phase 8: CLI Integration

  • Add template management commands
    # Validate templates in config
    holodeck validate -f env.yaml --templates
    
    # Extract templates for review
    holodeck template list -f env.yaml
    holodeck template show -f env.yaml --name install-monitoring
    
    # Test template execution on existing instance
    holodeck template run <instance-id> --file ./test-script.sh

Phase 9: Documentation

  • Create custom templates guide

    • Schema reference
    • Execution phases explained
    • Best practices
    • Security considerations
    • Example use cases
  • Add examples

    • Monitoring agent installation
    • Custom package installation
    • Environment configuration
    • Multi-script workflows

Example Configurations

Simple Post-Install Script

spec:
  customTemplates:
    - name: install-tools
      phase: post-install
      inline: |
        #!/bin/bash
        apt-get update
        apt-get install -y htop vim tmux

Multi-Phase Setup

spec:
  customTemplates:
    - name: pre-flight-checks
      phase: pre-install
      inline: |
        #!/bin/bash
        echo "Starting provisioning at $(date)"
        df -h
        free -m
    
    - name: configure-runtime
      phase: post-runtime
      file: ./scripts/custom-containerd-config.sh
    
    - name: deploy-monitoring
      phase: post-kubernetes
      url: https://example.com/monitoring-setup.sh
      checksum: sha256:abc123def456...
      env:
        PROMETHEUS_ENDPOINT: http://prometheus.company.com
        GRAFANA_API_KEY: ${GRAFANA_KEY}
    
    - name: final-validation
      phase: post-install
      inline: |
        #!/bin/bash
        echo "Provisioning complete!"
        nvidia-smi
        kubectl get nodes

Acceptance Criteria

  • Users can specify inline scripts in config
  • Users can reference local script files
  • Users can specify remote URLs with checksum verification
  • Scripts execute at the correct provisioning phase
  • continueOnError works as expected
  • Dryrun shows custom template information
  • Clear error messages when templates fail
  • Documentation covers all use cases

Supersedes

Labels

feature customization extensibility

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions