Skip to content

Image/Model Pre-loading to GPU Nodes #205

@Code2Life

Description

@Code2Life

Feature Description

Implement a pre-loading mechanism in TensorFusion that proactively downloads and caches AI model files and container images on GPU nodes before they are needed for inference workloads. This feature will integrate with the TensorFusion operator to intelligently distribute and manage model artifacts across the GPU cluster, ensuring optimal resource utilization and reduced cold-start times.

Motivation

AI models and inference container images are exceptionally large (often ranging from several GBs to hundreds of GBs), creating significant deployment bottlenecks in GPU clusters. Current challenges include:

  • Extended cold-start times: New model deployments can take 10-30+ minutes just for image/model downloading
  • Resource waste: GPU nodes remain idle during lengthy download processes, reducing overall cluster utilization
  • Scaling delays: Horizontal pod autoscaling is severely impacted by download overhead
  • Network congestion: Multiple concurrent downloads can saturate cluster networking
  • Inconsistent performance: Unpredictable deployment times affect SLA compliance
  • Cost implications: Idle GPU time during downloads represents significant operational costs

Acceptance Criteria

  • Pre-loading Controller: Implement controller that monitors deployment patterns and proactively downloads artifacts based on configured preloading jobs, defined in TensorFusion custom resource
  • Cache Management: Develop intelligent caching with configurable retention policies and LRU eviction
  • Progress Tracking: Sync download progress and cached items status to GPUNode status, like what kubelet does for Kubernetes node info

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions