Skip to content

Commit 5f611c7

Browse files
Kyrie336Lei Guo
andauthored
Support Metax SGPU to sharing GPU (#895)
Signed-off-by: Lei Guo <[email protected]> Co-authored-by: Lei Guo <[email protected]>
1 parent 7114445 commit 5f611c7

21 files changed

+1031
-28
lines changed

charts/hami/templates/scheduler/configmap.yaml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,18 @@ data:
7979
{
8080
"name": "{{ .Values.iluvatarResourceName }}",
8181
"ignoredByScheduler": true
82+
},
83+
{
84+
"name": "{{ .Values.metaxResourceName }}",
85+
"ignoredByScheduler": true
86+
},
87+
{
88+
"name": "{{ .Values.metaxResourceCore }}",
89+
"ignoredByScheduler": true
90+
},
91+
{
92+
"name": "{{ .Values.metaxResourceMem }}",
93+
"ignoredByScheduler": true
8294
}
8395
],
8496
"ignoreable": false

charts/hami/templates/scheduler/configmapnew.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,12 @@ data:
4949
ignoredByScheduler: true
5050
- name: {{ .Values.iluvatarResourceName }}
5151
ignoredByScheduler: true
52+
- name: {{ .Values.metaxResourceName }}
53+
ignoredByScheduler: true
54+
- name: {{ .Values.metaxResourceCore }}
55+
ignoredByScheduler: true
56+
- name: {{ .Values.metaxResourceMem }}
57+
ignoredByScheduler: true
5258
{{- if .Values.devices.ascend.enabled }}
5359
{{- range .Values.devices.ascend.customresources }}
5460
- name: {{ . }}

charts/hami/templates/scheduler/device-configmap.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,10 @@ data:
9090
resourceCoreName: {{ .Values.dcuResourceCores }}
9191
metax:
9292
resourceCountName: "metax-tech.com/gpu"
93+
94+
resourceVCountName: {{ .Values.metaxResourceName }}
95+
resourceVMemoryName: {{ .Values.metaxResourceMem }}
96+
resourceVCoreName: {{ .Values.metaxResourceCore }}
9397
mthreads:
9498
resourceCountName: "mthreads.com/vgpu"
9599
resourceMemoryName: "mthreads.com/sgpu-memory"

charts/hami/values.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,11 @@ iluvatarResourceName: "iluvatar.ai/vgpu"
2727
iluvatarResourceMem: "iluvatar.ai/vcuda-memory"
2828
iluvatarResourceCore: "iluvatar.ai/vcuda-core"
2929

30+
#Metax SGPU Parameters
31+
metaxResourceName: "metax-tech.com/sgpu"
32+
metaxResourceCore: "metax-tech.com/vcore"
33+
metaxResourceMem: "metax-tech.com/vmemory"
34+
3035
schedulerName: "hami-scheduler"
3136

3237
podSecurityPolicy:

docs/metax-support.md

Lines changed: 60 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,58 @@
11
## Introduction
22

3-
**We now support metax.com/gpu by implementing topo-awareness among metax GPUs**:
3+
We support metax.com/gpu as follows:
4+
5+
- support metax.com/gpu by implementing most device-sharing features as nvidia-GPU
6+
- support metax.com/gpu by implementing topo-awareness among metax GPUs
7+
8+
## support metax.com/gpu by implementing most device-sharing features as nvidia-GPU
9+
10+
device-sharing features include the following:
11+
12+
***GPU sharing***: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.
13+
14+
***Device Memory Control***: GPUs can be allocated with certain device memory size and have made it that it does not exceed the boundary.
15+
16+
***Device compute core limitation***: GPUs can be allocated with certain percentage of device core(60 indicate this container uses 60% compute cores of this device)
17+
18+
### Prerequisites
19+
20+
* Metax Driver >= 2.31.0
21+
* Metax GPU Operator >= 0.10.1
22+
* Kubernetes >= 1.23
23+
24+
### Enabling GPU-sharing Support
25+
26+
* Deploy Metax GPU Operator on metax nodes (Please consult your device provider to aquire its package and document)
27+
28+
* Deploy HAMi according to README.md
29+
30+
### Running Metax jobs
31+
32+
Metax GPUs can now be requested by a container
33+
using the `metax-tech.com/sgpu` resource type:
34+
35+
```yaml
36+
apiVersion: v1
37+
kind: Pod
38+
metadata:
39+
name: gpu-pod1
40+
spec:
41+
containers:
42+
- name: ubuntu-container
43+
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
44+
imagePullPolicy: IfNotPresent
45+
command: ["sleep","infinity"]
46+
resources:
47+
limits:
48+
metax-tech.com/sgpu: 1 # requesting 1 GPU
49+
metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores
50+
metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory
51+
```
52+
53+
> **NOTICE1:** *You can find more examples in [examples/metax folder](../examples/metax/sgpu)*
54+
55+
## support metax.com/gpu by implementing topo-awareness among metax GPUs
456
557
When multiple GPUs are configured on a single server, the GPU cards are connected to the same PCIe Switch or MetaXLink depending on whether they are connected
658
, there is a near-far relationship. This forms a topology among all the cards on the server, as shown in the following figure:
@@ -21,29 +73,29 @@ Equipped with MetaXLink interconnected resources.
2173

2274
![img](../imgs/metax_binpack.png)
2375

24-
## Important Notes
76+
### Important Notes
2577

2678
1. Device sharing is not supported yet.
2779

2880
2. These features are tested on MXC500
2981

30-
## Prerequisites
82+
### Prerequisites
3183

3284
* Metax GPU extensions >= 0.8.0
3385
* Kubernetes >= 1.23
3486

35-
## Enabling topo-awareness scheduling
87+
### Enabling topo-awareness scheduling
3688

3789
* Deploy Metax GPU Extensions on metax nodes (Please consult your device provider to aquire its package and document)
3890

3991
* Deploy HAMi according to README.md
4092

41-
## Running Metax jobs
93+
### Running Metax jobs
4294

43-
Mthreads GPUs can now be requested by a container
95+
Metax GPUs can now be requested by a container
4496
using the `metax-tech.com/gpu` resource type:
4597

46-
```
98+
```yaml
4799
apiVersion: v1
48100
kind: Pod
49101
metadata:
@@ -60,6 +112,4 @@ spec:
60112
metax-tech.com/gpu: 1 # requesting 1 vGPUs
61113
```
62114

63-
> **NOTICE2:** *You can find more examples in [examples/metax folder](../examples/metax/)*
64-
65-
115+
> **NOTICE2:** *You can find more examples in [examples/metax folder](../examples/metax/gpu)*

docs/metax-support_cn.md

Lines changed: 57 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,56 @@
11
## 简介
22

3-
**我们支持基于拓扑结构,对沐曦设备进行优化调度**:
3+
我们对沐曦设备做如下支持:
4+
5+
- 复用沐曦GPU设备,提供与vGPU类似的复用功能
6+
- 基于拓扑结构,对沐曦设备进行优化调度
7+
8+
## 复用沐曦GPU设备,提供与vGPU类似的复用功能
9+
10+
复用功能包括以下:
11+
12+
***GPU 共享***: 每个任务可以只占用一部分显卡,多个任务可以共享一张显卡
13+
14+
***可限制分配的显存大小***: 你现在可以用显存值(例如4G)来分配GPU,本组件会确保任务使用的显存不会超过分配数值
15+
16+
***可限制计算单元数量***: 你现在可以指定任务使用的算力比例(例如60即代表使用60%算力)来分配GPU,本组件会确保任务使用的算力不会超过分配数值
17+
18+
### 需求
19+
20+
* Metax Driver >= 2.31.0
21+
* Metax GPU Operator >= 0.10.1
22+
* Kubernetes >= 1.23
23+
24+
### 开启复用沐曦设备
25+
26+
* 部署Metax GPU Operator (请联系您的设备提供方获取)
27+
* 根据readme.md部署HAMi
28+
29+
### 运行沐曦任务
30+
31+
一个典型的沐曦任务如下所示:
32+
33+
```yaml
34+
apiVersion: v1
35+
kind: Pod
36+
metadata:
37+
name: gpu-pod1
38+
spec:
39+
containers:
40+
- name: ubuntu-container
41+
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
42+
imagePullPolicy: IfNotPresent
43+
command: ["sleep","infinity"]
44+
resources:
45+
limits:
46+
metax-tech.com/sgpu: 1 # requesting 1 GPU
47+
metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores
48+
metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory
49+
```
50+
51+
> **NOTICE1:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/sgpu)*
52+
53+
## 基于拓扑结构,对沐曦设备进行优化调度
454
555
在单台服务器上配置多张 GPU 时,GPU 卡间根据双方是否连接在相同的 PCIe Switch 或 MetaXLink
656
下,存在近远(带宽高低)关系。服务器上所有卡间据此形成一张拓扑,如下图所示。
@@ -23,28 +73,28 @@
2373

2474
![img](../imgs/metax_binpack.png)
2575

26-
## 注意:
76+
### 注意:
2777

2878
1. 暂时不支持沐曦设备的切片,只能申请整卡
2979

3080
2. 本功能基于MXC500进行测试
3181

32-
## 需求
82+
### 需求
3383

3484
* Metax GPU extensions >= 0.8.0
3585
* Kubernetes >= 1.23
3686

37-
## 开启针对沐曦设备的拓扑调度优化
87+
### 开启针对沐曦设备的拓扑调度优化
3888

3989
* 部署Metax GPU extensions (请联系您的设备提供方获取)
4090

4191
* 根据readme.md部署HAMi
4292

43-
## 运行沐曦任务
93+
### 运行沐曦任务
4494

4595
一个典型的沐曦任务如下所示:
4696

47-
```
97+
```yaml
4898
apiVersion: v1
4999
kind: Pod
50100
metadata:
@@ -61,6 +111,4 @@ spec:
61111
metax-tech.com/gpu: 1 # requesting 1 vGPUs
62112
```
63113

64-
> **NOTICE2:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/)*
65-
66-
114+
> **NOTICE2:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/gpu)*
File renamed without changes.
File renamed without changes.
File renamed without changes.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
apiVersion: v1
2+
kind: Pod
3+
metadata:
4+
name: gpu-pod
5+
spec:
6+
containers:
7+
- name: ubuntu-container
8+
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
9+
imagePullPolicy: IfNotPresent
10+
command: ["sleep","infinity"]
11+
resources:
12+
limits:
13+
metax-tech.com/sgpu: 1 # requesting 1 exclusive GPU

0 commit comments

Comments
 (0)