You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/user_guide/launch.rst
+36Lines changed: 36 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,42 @@ you can set the replica count to 2. This way, two identical instances of the mod
14
14
Xinference automatically load-balances requests to ensure even distribution across multiple GPUs.
15
15
Meanwhile, users see it as a single model, which greatly improves overall resource utilization.
16
16
17
+
Traditional Multi-Instance Deployment:
18
+
19
+
When you have multiple GPU cards, each capable of hosting one model instance, you can set the number of instances equal to the number of GPUs. For example:
20
+
21
+
- 2 GPUs, 2 instances: Each GPU runs one model instance
22
+
- 4 GPUs, 4 instances: Each GPU runs one model instance
23
+
24
+
.. versionadded:: v1.11.1
25
+
26
+
Introduce a new environment variable:
27
+
28
+
.. code-block:: bash
29
+
30
+
XINFERENCE_ENABLE_SINGLE_GPU_MULTI_REPLICA
31
+
32
+
Control whether to enable the single GPU multi-copy feature
33
+
Default value: 1
34
+
35
+
New Feature: Smart Replica Deployment
36
+
37
+
1. Single GPU Multi-Replica
38
+
39
+
New Support: Run multiple model replicas even with just one GPU.
40
+
41
+
- Scenario: You have 1 GPU with sufficient VRAM
42
+
- Configuration: Replica Count = 3, GPU Count = 1
43
+
- Result: 3 model instances running on the same GPU, sharing GPU resources
44
+
45
+
2. Hybrid GPU Allocation
46
+
47
+
Smart Allocation: Number of replicas may differ from GPU count; system intelligently distributes
0 commit comments