-
Notifications
You must be signed in to change notification settings - Fork 160
Description
Description
Currently, KAITO allocates full GPUs to model workloads. For smaller LLMs (e.g., Phi-3, Mistral 7B with quantization), a full A100 or H100 is often overkill, leading to underutilized hardware and higher costs. We propose adding support for NVIDIA MIG, allowing users to partition a single physical GPU into multiple instances to run several small models concurrently on a single node.
Use Case
Efficient Multi-Model Hosting: A user wants to run 4 small model instances (e.g., 4x Phi-4-mini) on a single Azure Standard_ND96asr_v4 node (which has 8x A100 GPUs). By using MIG, they could theoretically host dozens of small models on that same node by partitioning each GPU.
Proposed Scope: BYO (Bring Your Own) Node First
To reduce initial complexity, this feature will focus on BYO Nodes:
- The infrastructure admin is responsible for enabling MIG mode on the GPUs and creating the partitions (e.g.,
1g.10gb) at the OS/Node level. - KAITO will focus on discovery and scheduling rather than hardware partitioning management.
Technical Challenges
- Resource Specification: Kubernetes represents MIG instances as unique extended resources (e.g.,
nvidia.com/mig-1g.10gb). KAITO's currentResourcespec needs to handle these instead of standardnvidia.com/gpu. - VRAM Estimator: The internal KAITO estimator must be updated to understand that a
1g.10gbprofile has specific memory limits, preventing it from trying to schedule a model that requires 20GB of VRAM onto a 10GB MIG partition. - Workflow Decoupling: This should be decoupled from the main Node Autoprovisioning (NAP) workflow initially, as NAP requires complex integration with cloud-specific APIs to request MIG-enabled pools.
Proposed API Changes
Add support for MIG resource strings in the Kaitocontrol / Workspace CRD.
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: phi-3-mig
spec:
offloading: false
resource:
instanceType: "" # Leave empty for BYO / Manual scheduling
labelSelector:
matchLabels:
apps: mig-enabled-node
# New capability: requesting specific MIG resources
resourceRequests:
nvidia.com/mig-1g.10gb: 1
inference:
preset:
name: "phi-3-mini-4k-instruct"
Implementation Strategy
- Partition Management: Customer Managed. The user configures the MIG strategy (Single Strategy or Mixed Strategy) on the node. KAITO simply consumes the available resources advertised by the NVIDIA Device Plugin.
- Scheduler: KAITO should ensure the
countin the resource request aligns with how many MIG instances are required per replica. - Estimator Logic: * Enhance the estimator to map
nvidia.com/mig-x.ygbto a memory capacity ofyGB. - If the model's estimated footprint >
y, the admission controller should reject the Workspace creation with a descriptive error.
References
Metadata
Metadata
Assignees
Labels
Type
Projects
Status