You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 2.0/overview/index.html
+14-12Lines changed: 14 additions & 12 deletions
Original file line number
Diff line number
Diff line change
@@ -3658,19 +3658,21 @@ <h1>Overview</h1>
3658
3658
<p>GPUStack is an open-source GPU cluster manager for running AI models.</p>
3659
3659
<h3id="key-features">Key Features</h3>
3660
3660
<ul>
3661
-
<li><strong>Broad GPU Compatibility:</strong> Seamlessly supports GPUs from various vendors.</li>
3662
-
<li><strong>Extensive Model Support:</strong> Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.</li>
3663
-
<li><strong>Flexible Inference Backends:</strong> Flexibly integrates with multiple inference backends including vLLM, SGLang, Ascend MindIE, and vox-box.</li>
3664
-
<li><strong>Multi-Version Backend Support:</strong> Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.</li>
3665
-
<li><strong>Distributed Inference:</strong> Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.</li>
3666
-
<li><strong>Scalable GPU Architecture:</strong> Easily scale up by adding more GPUs or nodes to your infrastructure.</li>
3667
-
<li><strong>Robust Model Stability:</strong> Ensures high availability with automatic failure recovery, multi-instance redundancy, and load balancing for inference requests.</li>
3668
-
<li><strong>Intelligent Deployment Evaluation:</strong> Automatically assess model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment-related factors.</li>
3669
-
<li><strong>Automated Scheduling:</strong> Dynamically allocate models based on available resources.</li>
3670
-
<li><strong>OpenAI-Compatible APIs:</strong> Fully compatible with OpenAI’s API specifications for seamless integration.</li>
3661
+
<li><strong>High Performance:</strong> Optimized for high-throughput and low-latency inference.</li>
3662
+
<li><strong>GPU Cluster Management:</strong> Efficiently manage multiple GPU clusters across different providers, including Docker-based, Kubernetes, and cloud platforms such as DigitalOcean.</li>
3663
+
<li><strong>Broad GPU Compatibility:</strong> Seamless support for GPUs from various vendors.</li>
3664
+
<li><strong>Extensive Model Support:</strong> Supports a wide range of models, including LLMs, VLMs, image models, audio models, embedding models, and rerank models.</li>
3665
+
<li><strong>Flexible Inference Backends:</strong> Built-in support for fast inference engines such as vLLM and SGLang, with the ability to integrate custom backends.</li>
3666
+
<li><strong>Multi-Version Backend Support:</strong> Run multiple versions of inference backends concurrently to meet diverse runtime requirements.</li>
3667
+
<li><strong>Distributed Inference:</strong> Supports single-node and multi-node, multi-GPU inference, including heterogeneous GPUs across vendors and environments.</li>
3668
+
<li><strong>Scalable GPU Architecture:</strong> Easily scale by adding more GPUs, nodes, or clusters to your infrastructure.</li>
3669
+
<li><strong>Robust Model Stability:</strong> Ensures high availability through automatic failure recovery, multi-instance redundancy, and intelligent load balancing.</li>
3670
+
<li><strong>Intelligent Deployment Evaluation:</strong> Automatically assesses model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment factors.</li>
3671
+
<li><strong>Automated Scheduling:</strong> Dynamically allocates models based on available resources.</li>
3672
+
<li><strong>OpenAI-Compatible APIs:</strong> Fully compatible with OpenAI API specifications for seamless integration.</li>
3671
3673
<li><strong>User & API Key Management:</strong> Simplified management of users and API keys.</li>
3672
-
<li><strong>Real-Time GPU Monitoring:</strong>Track GPU performance and utilization in real time.</li>
3673
-
<li><strong>Token and Rate Metrics:</strong>Monitor token usage and API request rates.</li>
3674
+
<li><strong>Real-Time GPU Monitoring:</strong>Monitor GPU performance and utilization in real time.</li>
3675
+
<li><strong>Token and Rate Metrics:</strong>Track token usage and API request rates.</li>
<p>This policy filters workers based on the cluster configuration of the model. Only those workers that belong to the specified cluster are retained for further evaluation.</p>
<p>This policy filters workers based on the label selectors configured for the model. If no label selectors are defined for the model, all workers are considered. Otherwise, the system checks whether the labels of each worker node match the model's label selectors, retaining only those workers that match.</p>
3760
3820
<h4id="status-policy">Status Policy</h4>
3761
3821
<p>This policy filters workers based on their status, retaining only those that are in a READY state.</p>
<p>This policy filters workers based on the backend framework required by the model (e.g., vLLM, SGLang). Only those workers with GPUs that support the specified backend framework are retained for further evaluation.</p>
3762
3824
<h4id="resource-fit-policy">Resource Fit Policy</h4>
3763
-
<p>The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected nodes without exceeding resource limits. The Resource Fit Policy prioritizes candidates in the following order:</p>
3825
+
<p>The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected workers. The Resource Fit Policy prioritizes candidates in the following order:</p>
3826
+
<p>Resource requirements are determined based on:</p>
3764
3827
<ul>
3765
-
<li>Single Worker Node, Single GPU Full Offload: Identifies candidates where a single GPU on a single worker can fully offload the model, which usually offers the best performance.</li>
3766
-
<li>Single Worker Node, Multiple GPU Full Offload: Identifies candidates where multiple GPUs on a single worker can fully the offload the model.</li>
3767
-
<li>Distributed Inference Across Multiple Workers: Identifies candidates where a combination of GPUs across multiple workers can handle full or partial offloading, used only when distributed inference across nodes is permitted.</li>
3768
-
<li>Single Worker Node Partial Offload: Identifies candidates on a single worker that can handle a partial offload, used only when partial offloading is allowed.</li>
3769
-
<li>Single Worker Node, CPU: When no GPUs are available, the system will use the CPU for inference, identifying candidates where memory resources on a single worker are sufficient.</li>
3828
+
<li>
3829
+
<p>For GGUF models: Uses the <ahref="https://github.com/gpustack/gguf-parser-go">GGUF parser</a> to estimate the model's resource requirements.</p>
3830
+
</li>
3831
+
<li>
3832
+
<p>For other model types: Estimated by the backend (e.g., vLLM, SGLang, MindIE, VoxBox).</p>
3833
+
</li>
3770
3834
</ul>
3835
+
<p>Backends have different capabilities:</p>
3836
+
<ul>
3837
+
<li>
3838
+
<p>vLLM, SGLang, MindIE: GPU-only, no CPU or partial offload.</p>
3839
+
</li>
3840
+
<li>
3841
+
<p>Custom backends, VoxBox: Support GPU offload or CPU execution.</p>
3842
+
</li>
3843
+
</ul>
3844
+
<p>Candidates are evaluated in the following order, and the process stops once the first valid placement is found:</p>
3845
+
<ol>
3846
+
<li>
3847
+
<p>Single Worker, Single GPU (Full Fit)
3848
+
A single GPU fully satisfies the model’s requirements.</p>
3849
+
</li>
3850
+
<li>
3851
+
<p>Single Worker, Multiple GPUs (Full Fit)
3852
+
Multiple GPUs on the same worker jointly satisfy the requirements.</p>
3853
+
</li>
3854
+
<li>
3855
+
<p>Distributed Inference (Across Workers)
3856
+
GPUs across multiple workers can be used when the backend supports distributed execution.</p>
3857
+
</li>
3858
+
<li>
3859
+
<p>Single Worker, CPU Execution
3860
+
CPU-only execution, supported only for Custom and VoxBox backends.</p>
3861
+
</li>
3862
+
</ol>
3771
3863
<h3id="scoring-phase">Scoring Phase</h3>
3772
3864
<p>The scoring phase evaluates the filtered candidates, scoring them to select the optimal deployment location. The primary strategy involved is:</p>
<p>This strategy seeks to distribute multiple model instances across different worker nodes as evenly as possible, improving system fault tolerance and load balancing.</p>
3876
+
<p>This strategy seeks to distribute multiple model instances across different workers as evenly as possible, improving system fault tolerance and load balancing.</p>
0 commit comments