Skip to content

Commit 27b6d75

Browse files
committed
Deployed 8e7fe69 to 2.0 with MkDocs 1.6.0 and mike 2.1.3
1 parent 8e7fe69 commit 27b6d75

File tree

9 files changed

+246
-147
lines changed

9 files changed

+246
-147
lines changed

2.0/cli-reference/start/index.html

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3813,6 +3813,11 @@ <h3 id="server-options">Server Options</h3>
38133813
<td>Port to bind the gpustack server to.</td>
38143814
</tr>
38153815
<tr>
3816+
<td><code>--database-port</code> value</td>
3817+
<td><code>5432</code></td>
3818+
<td>Port of the embedded PostgresSQL database.</td>
3819+
</tr>
3820+
<tr>
38163821
<td><code>--metrics-port</code> value</td>
38173822
<td><code>10161</code></td>
38183823
<td>Port to expose server metrics.</td>

2.0/overview/index.html

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3658,19 +3658,21 @@ <h1>Overview</h1>
36583658
<p>GPUStack is an open-source GPU cluster manager for running AI models.</p>
36593659
<h3 id="key-features">Key Features</h3>
36603660
<ul>
3661-
<li><strong>Broad GPU Compatibility:</strong> Seamlessly supports GPUs from various vendors.</li>
3662-
<li><strong>Extensive Model Support:</strong> Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.</li>
3663-
<li><strong>Flexible Inference Backends:</strong> Flexibly integrates with multiple inference backends including vLLM, SGLang, Ascend MindIE, and vox-box.</li>
3664-
<li><strong>Multi-Version Backend Support:</strong> Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.</li>
3665-
<li><strong>Distributed Inference:</strong> Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.</li>
3666-
<li><strong>Scalable GPU Architecture:</strong> Easily scale up by adding more GPUs or nodes to your infrastructure.</li>
3667-
<li><strong>Robust Model Stability:</strong> Ensures high availability with automatic failure recovery, multi-instance redundancy, and load balancing for inference requests.</li>
3668-
<li><strong>Intelligent Deployment Evaluation:</strong> Automatically assess model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment-related factors.</li>
3669-
<li><strong>Automated Scheduling:</strong> Dynamically allocate models based on available resources.</li>
3670-
<li><strong>OpenAI-Compatible APIs:</strong> Fully compatible with OpenAI’s API specifications for seamless integration.</li>
3661+
<li><strong>High Performance:</strong> Optimized for high-throughput and low-latency inference.</li>
3662+
<li><strong>GPU Cluster Management:</strong> Efficiently manage multiple GPU clusters across different providers, including Docker-based, Kubernetes, and cloud platforms such as DigitalOcean.</li>
3663+
<li><strong>Broad GPU Compatibility:</strong> Seamless support for GPUs from various vendors.</li>
3664+
<li><strong>Extensive Model Support:</strong> Supports a wide range of models, including LLMs, VLMs, image models, audio models, embedding models, and rerank models.</li>
3665+
<li><strong>Flexible Inference Backends:</strong> Built-in support for fast inference engines such as vLLM and SGLang, with the ability to integrate custom backends.</li>
3666+
<li><strong>Multi-Version Backend Support:</strong> Run multiple versions of inference backends concurrently to meet diverse runtime requirements.</li>
3667+
<li><strong>Distributed Inference:</strong> Supports single-node and multi-node, multi-GPU inference, including heterogeneous GPUs across vendors and environments.</li>
3668+
<li><strong>Scalable GPU Architecture:</strong> Easily scale by adding more GPUs, nodes, or clusters to your infrastructure.</li>
3669+
<li><strong>Robust Model Stability:</strong> Ensures high availability through automatic failure recovery, multi-instance redundancy, and intelligent load balancing.</li>
3670+
<li><strong>Intelligent Deployment Evaluation:</strong> Automatically assesses model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment factors.</li>
3671+
<li><strong>Automated Scheduling:</strong> Dynamically allocates models based on available resources.</li>
3672+
<li><strong>OpenAI-Compatible APIs:</strong> Fully compatible with OpenAI API specifications for seamless integration.</li>
36713673
<li><strong>User &amp; API Key Management:</strong> Simplified management of users and API keys.</li>
3672-
<li><strong>Real-Time GPU Monitoring:</strong> Track GPU performance and utilization in real time.</li>
3673-
<li><strong>Token and Rate Metrics:</strong> Monitor token usage and API request rates.</li>
3674+
<li><strong>Real-Time GPU Monitoring:</strong> Monitor GPU performance and utilization in real time.</li>
3675+
<li><strong>Token and Rate Metrics:</strong> Track token usage and API request rates.</li>
36743676
</ul>
36753677
<h2 id="supported-accelerators">Supported Accelerators</h2>
36763678
<p>GPUStack supports a variety of General-Purpose Accelerators, including:</p>

2.0/scheduler/index.html

Lines changed: 99 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2018,6 +2018,24 @@
20182018
<nav class="md-nav" aria-label="Filtering Phase">
20192019
<ul class="md-nav__list">
20202020

2021+
<li class="md-nav__item">
2022+
<a href="#cluster-matching-policy" class="md-nav__link">
2023+
<span class="md-ellipsis">
2024+
Cluster Matching Policy
2025+
</span>
2026+
</a>
2027+
2028+
</li>
2029+
2030+
<li class="md-nav__item">
2031+
<a href="#gpu-matching-policy" class="md-nav__link">
2032+
<span class="md-ellipsis">
2033+
GPU Matching Policy
2034+
</span>
2035+
</a>
2036+
2037+
</li>
2038+
20212039
<li class="md-nav__item">
20222040
<a href="#label-matching-policy" class="md-nav__link">
20232041
<span class="md-ellipsis">
@@ -2034,6 +2052,15 @@
20342052
</span>
20352053
</a>
20362054

2055+
</li>
2056+
2057+
<li class="md-nav__item">
2058+
<a href="#backend-framework-matching-policy" class="md-nav__link">
2059+
<span class="md-ellipsis">
2060+
Backend Framework Matching Policy
2061+
</span>
2062+
</a>
2063+
20372064
</li>
20382065

20392066
<li class="md-nav__item">
@@ -3652,6 +3679,24 @@
36523679
<nav class="md-nav" aria-label="Filtering Phase">
36533680
<ul class="md-nav__list">
36543681

3682+
<li class="md-nav__item">
3683+
<a href="#cluster-matching-policy" class="md-nav__link">
3684+
<span class="md-ellipsis">
3685+
Cluster Matching Policy
3686+
</span>
3687+
</a>
3688+
3689+
</li>
3690+
3691+
<li class="md-nav__item">
3692+
<a href="#gpu-matching-policy" class="md-nav__link">
3693+
<span class="md-ellipsis">
3694+
GPU Matching Policy
3695+
</span>
3696+
</a>
3697+
3698+
</li>
3699+
36553700
<li class="md-nav__item">
36563701
<a href="#label-matching-policy" class="md-nav__link">
36573702
<span class="md-ellipsis">
@@ -3668,6 +3713,15 @@
36683713
</span>
36693714
</a>
36703715

3716+
</li>
3717+
3718+
<li class="md-nav__item">
3719+
<a href="#backend-framework-matching-policy" class="md-nav__link">
3720+
<span class="md-ellipsis">
3721+
Backend Framework Matching Policy
3722+
</span>
3723+
</a>
3724+
36713725
</li>
36723726

36733727
<li class="md-nav__item">
@@ -3751,23 +3805,61 @@ <h2 id="scheduling-process">Scheduling Process</h2>
37513805
<h3 id="filtering-phase">Filtering Phase</h3>
37523806
<p>The filtering phase aims to narrow down the available workers or GPUs to those that meet specific criteria. The main policies involved are:</p>
37533807
<ul>
3808+
<li>Cluster Matching Policy</li>
3809+
<li>GPU Matching Policy</li>
37543810
<li>Label Matching Policy</li>
37553811
<li>Status Policy</li>
37563812
<li>Resource Fit Policy</li>
37573813
</ul>
3814+
<h4 id="cluster-matching-policy">Cluster Matching Policy</h4>
3815+
<p>This policy filters workers based on the cluster configuration of the model. Only those workers that belong to the specified cluster are retained for further evaluation.</p>
3816+
<h4 id="gpu-matching-policy">GPU Matching Policy</h4>
3817+
<p>This policy filters workers based on the user selected GPUs. Only workers that included the selected GPUs are retained for further evaluation.</p>
37583818
<h4 id="label-matching-policy">Label Matching Policy</h4>
37593819
<p>This policy filters workers based on the label selectors configured for the model. If no label selectors are defined for the model, all workers are considered. Otherwise, the system checks whether the labels of each worker node match the model's label selectors, retaining only those workers that match.</p>
37603820
<h4 id="status-policy">Status Policy</h4>
37613821
<p>This policy filters workers based on their status, retaining only those that are in a READY state.</p>
3822+
<h4 id="backend-framework-matching-policy">Backend Framework Matching Policy</h4>
3823+
<p>This policy filters workers based on the backend framework required by the model (e.g., vLLM, SGLang). Only those workers with GPUs that support the specified backend framework are retained for further evaluation.</p>
37623824
<h4 id="resource-fit-policy">Resource Fit Policy</h4>
3763-
<p>The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected nodes without exceeding resource limits. The Resource Fit Policy prioritizes candidates in the following order:</p>
3825+
<p>The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected workers. The Resource Fit Policy prioritizes candidates in the following order:</p>
3826+
<p>Resource requirements are determined based on:</p>
37643827
<ul>
3765-
<li>Single Worker Node, Single GPU Full Offload: Identifies candidates where a single GPU on a single worker can fully offload the model, which usually offers the best performance.</li>
3766-
<li>Single Worker Node, Multiple GPU Full Offload: Identifies candidates where multiple GPUs on a single worker can fully the offload the model.</li>
3767-
<li>Distributed Inference Across Multiple Workers: Identifies candidates where a combination of GPUs across multiple workers can handle full or partial offloading, used only when distributed inference across nodes is permitted.</li>
3768-
<li>Single Worker Node Partial Offload: Identifies candidates on a single worker that can handle a partial offload, used only when partial offloading is allowed.</li>
3769-
<li>Single Worker Node, CPU: When no GPUs are available, the system will use the CPU for inference, identifying candidates where memory resources on a single worker are sufficient.</li>
3828+
<li>
3829+
<p>For GGUF models: Uses the <a href="https://github.com/gpustack/gguf-parser-go">GGUF parser</a> to estimate the model's resource requirements.</p>
3830+
</li>
3831+
<li>
3832+
<p>For other model types: Estimated by the backend (e.g., vLLM, SGLang, MindIE, VoxBox).</p>
3833+
</li>
37703834
</ul>
3835+
<p>Backends have different capabilities:</p>
3836+
<ul>
3837+
<li>
3838+
<p>vLLM, SGLang, MindIE: GPU-only, no CPU or partial offload.</p>
3839+
</li>
3840+
<li>
3841+
<p>Custom backends, VoxBox: Support GPU offload or CPU execution.</p>
3842+
</li>
3843+
</ul>
3844+
<p>Candidates are evaluated in the following order, and the process stops once the first valid placement is found:</p>
3845+
<ol>
3846+
<li>
3847+
<p>Single Worker, Single GPU (Full Fit)
3848+
A single GPU fully satisfies the model’s requirements.</p>
3849+
</li>
3850+
<li>
3851+
<p>Single Worker, Multiple GPUs (Full Fit)
3852+
Multiple GPUs on the same worker jointly satisfy the requirements.</p>
3853+
</li>
3854+
<li>
3855+
<p>Distributed Inference (Across Workers)
3856+
GPUs across multiple workers can be used when the backend supports distributed execution.</p>
3857+
</li>
3858+
<li>
3859+
<p>Single Worker, CPU Execution
3860+
CPU-only execution, supported only for Custom and VoxBox backends.</p>
3861+
</li>
3862+
</ol>
37713863
<h3 id="scoring-phase">Scoring Phase</h3>
37723864
<p>The scoring phase evaluates the filtered candidates, scoring them to select the optimal deployment location. The primary strategy involved is:</p>
37733865
<ul>
@@ -3781,7 +3873,7 @@ <h4 id="placement-strategy-policy">Placement Strategy Policy</h4>
37813873
<ul>
37823874
<li>Spread</li>
37833875
</ul>
3784-
<p>This strategy seeks to distribute multiple model instances across different worker nodes as evenly as possible, improving system fault tolerance and load balancing.</p>
3876+
<p>This strategy seeks to distribute multiple model instances across different workers as evenly as possible, improving system fault tolerance and load balancing.</p>
37853877

37863878

37873879

2.0/search/search_index.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)