gpustack
diff --git a/‎2.0/cli-reference/start/index.html‎
Lines changed: 5 additions & 0 deletions b/‎2.0/cli-reference/start/index.html‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎2.0/overview/index.html‎
Lines changed: 14 additions & 12 deletions b/‎2.0/overview/index.html‎
Lines changed: 14 additions & 12 deletions
diff --git a/‎2.0/scheduler/index.html‎
Lines changed: 99 additions & 7 deletions b/‎2.0/scheduler/index.html‎
Lines changed: 99 additions & 7 deletions
diff --git a/‎2.0/search/search_index.json‎
Lines changed: 1 addition & 1 deletion b/‎2.0/search/search_index.json‎
Lines changed: 1 addition & 1 deletion
@@ -3813,6 +3813,11 @@ <h3 id="server-options">Server Options</h3>
 <td>Port to bind the gpustack server to.</td>
 </tr>
 <tr>
+<td><code>--database-port</code> value</td>
+<td><code>5432</code></td>
+<td>Port of the embedded PostgresSQL database.</td>
+</tr>
+<tr>
 <td><code>--metrics-port</code> value</td>
 <td><code>10161</code></td>
 <td>Port to expose server metrics.</td>
 
@@ -3658,19 +3658,21 @@ <h1>Overview</h1>
 <p>GPUStack is an open-source GPU cluster manager for running AI models.</p>
 <h3 id="key-features">Key Features</h3>
 <ul>
-<li><strong>Broad GPU Compatibility:</strong> Seamlessly supports GPUs from various vendors.</li>
-<li><strong>Extensive Model Support:</strong> Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.</li>
-<li><strong>Flexible Inference Backends:</strong> Flexibly integrates with multiple inference backends including vLLM, SGLang, Ascend MindIE, and vox-box.</li>
-<li><strong>Multi-Version Backend Support:</strong> Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.</li>
-<li><strong>Distributed Inference:</strong> Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.</li>
-<li><strong>Scalable GPU Architecture:</strong> Easily scale up by adding more GPUs or nodes to your infrastructure.</li>
-<li><strong>Robust Model Stability:</strong> Ensures high availability with automatic failure recovery, multi-instance redundancy, and load balancing for inference requests.</li>
-<li><strong>Intelligent Deployment Evaluation:</strong> Automatically assess model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment-related factors.</li>
-<li><strong>Automated Scheduling:</strong> Dynamically allocate models based on available resources.</li>
-<li><strong>OpenAI-Compatible APIs:</strong> Fully compatible with OpenAI’s API specifications for seamless integration.</li>
+<li><strong>High Performance:</strong> Optimized for high-throughput and low-latency inference.</li>
+<li><strong>GPU Cluster Management:</strong> Efficiently manage multiple GPU clusters across different providers, including Docker-based, Kubernetes, and cloud platforms such as DigitalOcean.</li>
+<li><strong>Broad GPU Compatibility:</strong> Seamless support for GPUs from various vendors.</li>
+<li><strong>Extensive Model Support:</strong> Supports a wide range of models, including LLMs, VLMs, image models, audio models, embedding models, and rerank models.</li>
+<li><strong>Flexible Inference Backends:</strong> Built-in support for fast inference engines such as vLLM and SGLang, with the ability to integrate custom backends.</li>
+<li><strong>Multi-Version Backend Support:</strong> Run multiple versions of inference backends concurrently to meet diverse runtime requirements.</li>
+<li><strong>Distributed Inference:</strong> Supports single-node and multi-node, multi-GPU inference, including heterogeneous GPUs across vendors and environments.</li>
+<li><strong>Scalable GPU Architecture:</strong> Easily scale by adding more GPUs, nodes, or clusters to your infrastructure.</li>
+<li><strong>Robust Model Stability:</strong> Ensures high availability through automatic failure recovery, multi-instance redundancy, and intelligent load balancing.</li>
+<li><strong>Intelligent Deployment Evaluation:</strong> Automatically assesses model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment factors.</li>
+<li><strong>Automated Scheduling:</strong> Dynamically allocates models based on available resources.</li>
+<li><strong>OpenAI-Compatible APIs:</strong> Fully compatible with OpenAI API specifications for seamless integration.</li>
 <li><strong>User &amp; API Key Management:</strong> Simplified management of users and API keys.</li>
-<li><strong>Real-Time GPU Monitoring:</strong> Track GPU performance and utilization in real time.</li>
-<li><strong>Token and Rate Metrics:</strong> Monitor token usage and API request rates.</li>
+<li><strong>Real-Time GPU Monitoring:</strong> Monitor GPU performance and utilization in real time.</li>
+<li><strong>Token and Rate Metrics:</strong> Track token usage and API request rates.</li>
 </ul>
 <h2 id="supported-accelerators">Supported Accelerators</h2>
 <p>GPUStack supports a variety of General-Purpose Accelerators, including:</p>
 
@@ -2018,6 +2018,24 @@
     <nav class="md-nav" aria-label="Filtering Phase">
       <ul class="md-nav__list">
 
+          <li class="md-nav__item">
+  <a href="#cluster-matching-policy" class="md-nav__link">
+    <span class="md-ellipsis">
+      Cluster Matching Policy
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#gpu-matching-policy" class="md-nav__link">
+    <span class="md-ellipsis">
+      GPU Matching Policy
+    </span>
+  </a>
+  
+</li>
+        
           <li class="md-nav__item">
   <a href="#label-matching-policy" class="md-nav__link">
     <span class="md-ellipsis">
@@ -2034,6 +2052,15 @@
     </span>
   </a>
 
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#backend-framework-matching-policy" class="md-nav__link">
+    <span class="md-ellipsis">
+      Backend Framework Matching Policy
+    </span>
+  </a>
+  
 </li>
 
           <li class="md-nav__item">
@@ -3652,6 +3679,24 @@
     <nav class="md-nav" aria-label="Filtering Phase">
       <ul class="md-nav__list">
 
+          <li class="md-nav__item">
+  <a href="#cluster-matching-policy" class="md-nav__link">
+    <span class="md-ellipsis">
+      Cluster Matching Policy
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#gpu-matching-policy" class="md-nav__link">
+    <span class="md-ellipsis">
+      GPU Matching Policy
+    </span>
+  </a>
+  
+</li>
+        
           <li class="md-nav__item">
   <a href="#label-matching-policy" class="md-nav__link">
     <span class="md-ellipsis">
@@ -3668,6 +3713,15 @@
     </span>
   </a>
 
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#backend-framework-matching-policy" class="md-nav__link">
+    <span class="md-ellipsis">
+      Backend Framework Matching Policy
+    </span>
+  </a>
+  
 </li>
 
           <li class="md-nav__item">
@@ -3751,23 +3805,61 @@ <h2 id="scheduling-process">Scheduling Process</h2>
 <h3 id="filtering-phase">Filtering Phase</h3>
 <p>The filtering phase aims to narrow down the available workers or GPUs to those that meet specific criteria. The main policies involved are:</p>
 <ul>
+<li>Cluster Matching Policy</li>
+<li>GPU Matching Policy</li>
 <li>Label Matching Policy</li>
 <li>Status Policy</li>
 <li>Resource Fit Policy</li>
 </ul>
+<h4 id="cluster-matching-policy">Cluster Matching Policy</h4>
+<p>This policy filters workers based on the cluster configuration of the model. Only those workers that belong to the specified cluster are retained for further evaluation.</p>
+<h4 id="gpu-matching-policy">GPU Matching Policy</h4>
+<p>This policy filters workers based on the user selected GPUs. Only workers that included the selected GPUs are retained for further evaluation.</p>
 <h4 id="label-matching-policy">Label Matching Policy</h4>
 <p>This policy filters workers based on the label selectors configured for the model. If no label selectors are defined for the model, all workers are considered. Otherwise, the system checks whether the labels of each worker node match the model's label selectors, retaining only those workers that match.</p>
 <h4 id="status-policy">Status Policy</h4>
 <p>This policy filters workers based on their status, retaining only those that are in a READY state.</p>
+<h4 id="backend-framework-matching-policy">Backend Framework Matching Policy</h4>
+<p>This policy filters workers based on the backend framework required by the model (e.g., vLLM, SGLang). Only those workers with GPUs that support the specified backend framework are retained for further evaluation.</p>
 <h4 id="resource-fit-policy">Resource Fit Policy</h4>
-<p>The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected nodes without exceeding resource limits. The Resource Fit Policy prioritizes candidates in the following order:</p>
+<p>The Resource Fit Policy is a critical strategy in the scheduling system, used to filter workers or GPUs based on resource compatibility. The goal of this policy is to ensure that model instances can run on the selected workers. The Resource Fit Policy prioritizes candidates in the following order:</p>
+<p>Resource requirements are determined based on:</p>
 <ul>
-<li>Single Worker Node, Single GPU Full Offload: Identifies candidates where a single GPU on a single worker can fully offload the model, which usually offers the best performance.</li>
-<li>Single Worker Node, Multiple GPU Full Offload: Identifies candidates where multiple GPUs on a single worker can fully the offload the model.</li>
-<li>Distributed Inference Across Multiple Workers: Identifies candidates where a combination of GPUs across multiple workers can handle full or partial offloading, used only when distributed inference across nodes is permitted.</li>
-<li>Single Worker Node Partial Offload: Identifies candidates on a single worker that can handle a partial offload, used only when partial offloading is allowed.</li>
-<li>Single Worker Node, CPU: When no GPUs are available, the system will use the CPU for inference, identifying candidates where memory resources on a single worker are sufficient.</li>
+<li>
+<p>For GGUF models: Uses the <a href="https://github.com/gpustack/gguf-parser-go">GGUF parser</a> to estimate the model's resource requirements.</p>
+</li>
+<li>
+<p>For other model types: Estimated by the backend (e.g., vLLM, SGLang, MindIE, VoxBox).</p>
+</li>
 </ul>
+<p>Backends have different capabilities:</p>
+<ul>
+<li>
+<p>vLLM, SGLang, MindIE: GPU-only, no CPU or partial offload.</p>
+</li>
+<li>
+<p>Custom backends, VoxBox: Support GPU offload or CPU execution.</p>
+</li>
+</ul>
+<p>Candidates are evaluated in the following order, and the process stops once the first valid placement is found:</p>
+<ol>
+<li>
+<p>Single Worker, Single GPU (Full Fit)
+A single GPU fully satisfies the model’s requirements.</p>
+</li>
+<li>
+<p>Single Worker, Multiple GPUs (Full Fit)
+Multiple GPUs on the same worker jointly satisfy the requirements.</p>
+</li>
+<li>
+<p>Distributed Inference (Across Workers)
+GPUs across multiple workers can be used when the backend supports distributed execution.</p>
+</li>
+<li>
+<p>Single Worker, CPU Execution
+CPU-only execution, supported only for Custom and VoxBox backends.</p>
+</li>
+</ol>
 <h3 id="scoring-phase">Scoring Phase</h3>
 <p>The scoring phase evaluates the filtered candidates, scoring them to select the optimal deployment location. The primary strategy involved is:</p>
 <ul>
@@ -3781,7 +3873,7 @@ <h4 id="placement-strategy-policy">Placement Strategy Policy</h4>
 <ul>
 <li>Spread</li>
 </ul>
-<p>This strategy seeks to distribute multiple model instances across different worker nodes as evenly as possible, improving system fault tolerance and load balancing.</p>
+<p>This strategy seeks to distribute multiple model instances across different workers as evenly as possible, improving system fault tolerance and load balancing.</p>