-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Epic: Support User-Provided Custom Templates in Holodeck Config
Summary
Enable users to specify custom scripts and templates in the Holodeck configuration file that are executed during the provisioning stage. This allows further customization of target host setup beyond Holodeck's built-in templates.
Motivation
Currently, Holodeck only executes built-in provisioning templates. Users with advanced requirements need to:
- Install additional system packages or tools
- Configure custom environment variables
- Run organization-specific initialization scripts
- Install proprietary software or drivers
- Apply security hardening policies
- Set up monitoring or observability agents
By allowing custom templates, Holodeck becomes extensible without modifying core code.
Design Considerations
Execution Phases
Custom scripts can run at different provisioning phases:
- pre-install: Before any Holodeck components are installed
- post-runtime: After container runtime, before toolkit
- post-toolkit: After toolkit, before Kubernetes
- post-kubernetes: After Kubernetes is ready
- post-install (default): After all Holodeck components
Security Model
- Scripts execute with same privileges as provisioning (sudo available)
- Input validation required
- Optional dry-run/preview mode
- Script checksums for integrity verification
Subtasks
Phase 1: API Schema Design
-
Extend EnvironmentSpec for custom templates
apiVersion: holodeck.nvidia.com/v1alpha1 kind: Environment metadata: name: custom-env spec: # ... existing fields ... customTemplates: # Inline script - name: install-monitoring phase: post-kubernetes # pre-install, post-runtime, post-toolkit, post-kubernetes, post-install inline: | #!/bin/bash apt-get install -y prometheus-node-exporter systemctl enable prometheus-node-exporter # External file reference - name: security-hardening phase: pre-install file: ./scripts/harden.sh checksum: sha256:abc123... # Optional integrity check # Remote URL - name: company-setup phase: post-install url: https://internal.company.com/setup/gpu-host.sh checksum: sha256:def456... timeout: 300 # seconds
-
Define CustomTemplate type in API
type CustomTemplate struct { // Name is a human-readable identifier for the template Name string `json:"name"` // Phase determines when the template is executed // +kubebuilder:validation:Enum=pre-install;post-runtime;post-toolkit;post-kubernetes;post-install Phase TemplatePhase `json:"phase,omitempty"` // Inline contains the script content directly // +optional Inline string `json:"inline,omitempty"` // File is a path to a local script file // +optional File string `json:"file,omitempty"` // URL is a remote location to fetch the script from // +optional URL string `json:"url,omitempty"` // Checksum is an optional SHA256 checksum for verification // +optional Checksum string `json:"checksum,omitempty"` // Timeout in seconds for script execution (default: 600) // +optional Timeout int `json:"timeout,omitempty"` // ContinueOnError allows provisioning to continue if this script fails // +optional ContinueOnError bool `json:"continueOnError,omitempty"` // Environment variables to set for the script // +optional Env map[string]string `json:"env,omitempty"` }
Phase 2: Template Loading
-
Implement template source resolver
- Load inline scripts directly
- Read local files (absolute or relative to config)
- Fetch remote URLs with timeout
type TemplateLoader interface { Load(ctx context.Context, template CustomTemplate) ([]byte, error) }
-
Implement checksum verification
- Support SHA256 checksums
- Fail if checksum mismatch
- Log warning if no checksum provided for remote URLs
-
Handle template preprocessing
- Inject Holodeck common functions if requested
- Variable substitution (environment info, hostnames, etc.)
- Shebang detection/addition
Phase 3: Dependency Resolver Integration
- Extend DependencyResolver for custom templates
- Insert custom templates at appropriate phases
- Maintain execution order within phases
func (d *DependencyResolver) Resolve() []ProvisionFunc { // Pre-install custom templates d.addCustomTemplates(PhasePreInstall) // Kernel if d.env.Spec.Kernel.Version != "" { d.withKernel() } // Driver if d.env.Spec.NVIDIADriver.Install { d.withNVDriver() } // Container Runtime if d.env.Spec.ContainerRuntime.Install { d.withContainerRuntime() } d.addCustomTemplates(PhasePostRuntime) // ... continue pattern }
Phase 4: Execution Framework
-
Implement custom template executor
- Execute via SSH (consistent with built-in templates)
- Capture stdout/stderr
- Enforce timeouts
- Handle exit codes
func (p *Provisioner) executeCustomTemplate(template CustomTemplate) error { script, err := p.loader.Load(ctx, template) if err != nil { return fmt.Errorf("failed to load template %s: %w", template.Name, err) } // Add environment variables envPrefix := formatEnvVars(template.Env) // Execute with timeout ctx, cancel := context.WithTimeout(ctx, time.Duration(template.Timeout)*time.Second) defer cancel() return p.executeWithTimeout(ctx, envPrefix+string(script)) }
-
Add execution progress reporting
- Log template name before execution
- Report success/failure clearly
- Show execution time
Phase 5: Error Handling
-
Implement per-template error handling
- Respect
continueOnErrorflag - Aggregate errors for final report
- Clear error messages with template name
[CUSTOM] Running template 'install-monitoring' (phase: post-kubernetes) [CUSTOM] Template 'install-monitoring' completed in 45s [CUSTOM] Running template 'security-hardening' (phase: post-install) [ERROR] Template 'security-hardening' failed: exit code 1 [ERROR] Output: apt-get: package not found: nonexistent-package - Respect
-
Implement dry-run support
- Show what scripts would be executed
- Display script content (or summary for large scripts)
- Validate checksums without executing
Phase 6: Security Measures
-
Implement input validation
- Validate URLs (no file:// or unsafe protocols)
- Check for obvious dangerous patterns (optional warning)
- Enforce HTTPS for remote URLs
-
Add audit logging
- Log all custom template executions
- Record template content hashes
- Include in provisioning report
-
Implement template preview in dryrun
holodeck dryrun -f env.yaml # ... built-in templates ... === Custom Templates === [Phase: post-kubernetes] install-monitoring Source: inline Content: #!/bin/bash apt-get install -y prometheus-node-exporter systemctl enable prometheus-node-exporter [Phase: post-install] security-hardening Source: ./scripts/harden.sh Checksum: sha256:abc123... (verified ✓)
Phase 7: Template Library (Optional)
-
Create example template library
- Common monitoring setups
- Security hardening scripts
- Development tools installation
- Located in
examples/templates/
-
Document template best practices
- Idempotency guidelines
- Error handling recommendations
- Environment variable usage
- Testing templates locally
Phase 8: CLI Integration
- Add template management commands
# Validate templates in config holodeck validate -f env.yaml --templates # Extract templates for review holodeck template list -f env.yaml holodeck template show -f env.yaml --name install-monitoring # Test template execution on existing instance holodeck template run <instance-id> --file ./test-script.sh
Phase 9: Documentation
-
Create custom templates guide
- Schema reference
- Execution phases explained
- Best practices
- Security considerations
- Example use cases
-
Add examples
- Monitoring agent installation
- Custom package installation
- Environment configuration
- Multi-script workflows
Example Configurations
Simple Post-Install Script
spec:
customTemplates:
- name: install-tools
phase: post-install
inline: |
#!/bin/bash
apt-get update
apt-get install -y htop vim tmuxMulti-Phase Setup
spec:
customTemplates:
- name: pre-flight-checks
phase: pre-install
inline: |
#!/bin/bash
echo "Starting provisioning at $(date)"
df -h
free -m
- name: configure-runtime
phase: post-runtime
file: ./scripts/custom-containerd-config.sh
- name: deploy-monitoring
phase: post-kubernetes
url: https://example.com/monitoring-setup.sh
checksum: sha256:abc123def456...
env:
PROMETHEUS_ENDPOINT: http://prometheus.company.com
GRAFANA_API_KEY: ${GRAFANA_KEY}
- name: final-validation
phase: post-install
inline: |
#!/bin/bash
echo "Provisioning complete!"
nvidia-smi
kubectl get nodesAcceptance Criteria
- Users can specify inline scripts in config
- Users can reference local script files
- Users can specify remote URLs with checksum verification
- Scripts execute at the correct provisioning phase
-
continueOnErrorworks as expected - Dryrun shows custom template information
- Clear error messages when templates fail
- Documentation covers all use cases
Supersedes
- Closes Feature: Support Custom Templates in Holodeck Config for Provisioning #487 (original custom templates issue)
Labels
feature customization extensibility