Skip to content

Commit 8fbec81

Browse files
authored
Improved support for Iulvatar GPUs (#1399)
Signed-off-by: 魏强 <[email protected]>
1 parent f13abdd commit 8fbec81

21 files changed

+716
-315
lines changed

charts/hami/README.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -43,13 +43,6 @@ This document provides detailed descriptions of all configurable values paramete
4343
| `dcuResourceMem` | DCU memory resource name | `"hygon.com/dcumem"` |
4444
| `dcuResourceCores` | DCU core resource name | `"hygon.com/dcucores"` |
4545

46-
### Iluvatar GPU Resources
47-
| Parameter | Description | Default Value |
48-
|-----------|-------------|---------------|
49-
| `iluvatarResourceName` | GPU resource name | `"iluvatar.ai/vgpu"` |
50-
| `iluvatarResourceMem` | GPU memory resource name | `"iluvatar.ai/vcuda-memory"` |
51-
| `iluvatarResourceCore` | GPU core resource name | `"iluvatar.ai/vcuda-core"` |
52-
5346
### Metax GPU Resources
5447
| Parameter | Description | Default Value |
5548
|-----------|-------------|---------------|
@@ -231,3 +224,9 @@ This document provides detailed descriptions of all configurable values paramete
231224
| `devices.ascend.nodeSelector` | Node selector | `{"ascend": "on"}` |
232225
| `devices.ascend.tolerations` | Tolerations | `[]` |
233226
| `devices.ascend.customresources` | Custom resources | `["huawei.com/Ascend910A", "huawei.com/Ascend910A-memory", ...]` |
227+
228+
### Iluvatar
229+
| Parameter | Description | Default Value |
230+
|-----------|-------------|---------------|
231+
| `devices.iluvatar.enabled` | Whether to enable | `false` |
232+
| `devices.iluvatar.customresources` | Custom resources | `["iluvatar.ai/BI-V150-vgpu", "iluvatar.ai/BI-V150.vMem","iluvatar.ai/BI-V150.vCore", ...]` |

charts/hami/templates/scheduler/configmap.yaml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,14 @@ data:
6363
"ignoredByScheduler": true
6464
},
6565
{{- end }}
66+
{{- if .Values.devices.iluvatar.enabled }}
67+
{{- range .Values.devices.iluvatar.customresources }}
68+
{
69+
"name": "{{ . }}",
70+
"ignoredByScheduler": true
71+
},
72+
{{- end }}
73+
{{- end }}
6674
{
6775
"name": "{{ .Values.resourceName }}",
6876
"ignoredByScheduler": true
@@ -99,10 +107,6 @@ data:
99107
"name": "{{ .Values.dcuResourceCores }}",
100108
"ignoredByScheduler": true
101109
},
102-
{
103-
"name": "{{ .Values.iluvatarResourceName }}",
104-
"ignoredByScheduler": true
105-
},
106110
{
107111
"name": "metax-tech.com/gpu",
108112
"ignoredByScheduler": true

charts/hami/templates/scheduler/configmapnew.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,6 @@ data:
4848
ignoredByScheduler: true
4949
- name: {{ .Values.dcuResourceCores }}
5050
ignoredByScheduler: true
51-
- name: {{ .Values.iluvatarResourceName }}
52-
ignoredByScheduler: true
5351
- name: "metax-tech.com/gpu"
5452
ignoredByScheduler: true
5553
- name: {{ .Values.metaxResourceName }}
@@ -86,4 +84,10 @@ data:
8684
- name: {{ . }}
8785
ignoredByScheduler: true
8886
{{- end }}
87+
{{- if .Values.devices.iluvatar.enabled }}
88+
{{- range .Values.devices.iluvatar.customresources }}
89+
- name: {{ . }}
90+
ignoredByScheduler: true
91+
{{- end }}
92+
{{- end }}
8993
{{- end }}

charts/hami/templates/scheduler/deployment.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,9 @@ spec:
108108
{{- if .Values.devices.ascend.enabled }}
109109
- --enable-ascend=true
110110
{{- end }}
111+
{{- if .Values.devices.iluvatar.enabled }}
112+
- --enable-iluvatar=true
113+
{{- end }}
111114
{{- if .Values.scheduler.nodeLabelSelector }}
112115
- --node-label-selector={{- $first := true -}}
113116
{{- range $key, $value := .Values.scheduler.nodeLabelSelector -}}

charts/hami/templates/scheduler/device-configmap.yaml

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -106,10 +106,27 @@ data:
106106
resourceCountName: "mthreads.com/vgpu"
107107
resourceMemoryName: "mthreads.com/sgpu-memory"
108108
resourceCoreName: "mthreads.com/sgpu-core"
109-
iluvatar:
110-
resourceCountName: {{ .Values.iluvatarResourceName }}
111-
resourceMemoryName: {{ .Values.iluvatarResourceMem }}
112-
resourceCoreName: {{ .Values.iluvatarResourceCore }}
109+
iluvatars:
110+
- chipName: MR-V100
111+
commonWord: MR-V100
112+
resourceCountName: iluvatar.ai/MR-V100-vgpu
113+
resourceMemoryName: iluvatar.ai/MR-V100.vMem
114+
resourceCoreName: iluvatar.ai/MR-V100.vCore
115+
- chipName: MR-V50
116+
commonWord: MR-V50
117+
resourceCountName: iluvatar.ai/MR-V50-vgpu
118+
resourceMemoryName: iluvatar.ai/MR-V50.vMem
119+
resourceCoreName: iluvatar.ai/MR-V50.vCore
120+
- chipName: BI-V150
121+
commonWord: BI-V150
122+
resourceCountName: iluvatar.ai/BI-V150-vgpu
123+
resourceMemoryName: iluvatar.ai/BI-V150.vMem
124+
resourceCoreName: iluvatar.ai/BI-V150.vCore
125+
- chipName: BI-V100
126+
commonWord: BI-V100
127+
resourceCountName: iluvatar.ai/BI-V100-vgpu
128+
resourceMemoryName: iluvatar.ai/BI-V100.vMem
129+
resourceCoreName: iluvatar.ai/BI-V100.vCore
113130
kunlun:
114131
resourceCountName: {{ .Values.kunlunResourceName }}
115132
resourceVCountName: {{ .Values.kunlunResourceVCountName }}

charts/hami/values.yaml

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,6 @@ dcuResourceName: "hygon.com/dcunum"
3838
dcuResourceMem: "hygon.com/dcumem"
3939
dcuResourceCores: "hygon.com/dcucores"
4040

41-
#Iluvatar GPU Parameters
42-
iluvatarResourceName: "iluvatar.ai/vgpu"
43-
iluvatarResourceMem: "iluvatar.ai/vcuda-memory"
44-
iluvatarResourceCore: "iluvatar.ai/vcuda-core"
45-
4641
#Metax sGPU Parameters
4742
metaxResourceName: "metax-tech.com/sgpu"
4843
metaxResourceCore: "metax-tech.com/vcore"
@@ -389,4 +384,20 @@ devices:
389384
- huawei.com/Ascend910B4-1-memory
390385
- huawei.com/Ascend310P
391386
- huawei.com/Ascend310P-memory
387+
iluvatar:
388+
enabled: false
389+
customresources:
390+
- iluvatar.ai/BI-V100-vgpu
391+
- iluvatar.ai/BI-V100.vCore
392+
- iluvatar.ai/BI-V100.vMem
393+
- iluvatar.ai/BI-V150-vgpu
394+
- iluvatar.ai/BI-V150.vCore
395+
- iluvatar.ai/BI-V150.vMem
396+
- iluvatar.ai/MR-V100-vgpu
397+
- iluvatar.ai/MR-V100.vCore
398+
- iluvatar.ai/MR-V100.vMem
399+
- iluvatar.ai/MR-V50-vgpu
400+
- iluvatar.ai/MR-V50.vCore
401+
- iluvatar.ai/MR-V50.vMem
402+
392403

docs/iluvatar-gpu-support.md

Lines changed: 43 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -23,52 +23,67 @@
2323

2424
> **NOTICE:** *Install only gpu-manager, don't install gpu-admission package.*
2525
26-
* Identify the resource name about core and memory usage(i.e 'iluvatar.ai/vcuda-core', 'iluvatar.ai/vcuda-memory')
27-
28-
* set the 'iluvatarResourceMem' and 'iluvatarResourceCore' parameters when install hami
29-
26+
* set the devices.iluvatar.enabled=true when install hami
3027
```
31-
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set iluvatarResourceMem=iluvatar.ai/vcuda-memory --set iluvatarResourceCore=iluvatar.ai/vcuda-core -n kube-system
28+
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set devices.iluvatar.enabled=true
3229
```
3330

34-
> **NOTE:** The default resource names are:
35-
> - `iluvatar.ai/vgpu` for GPU count
36-
> - `iluvatar.ai/vcuda-memory` for memory allocation
37-
> - `iluvatar.ai/vcuda-core` for core allocation
38-
>
39-
> You can customize these names using the parameters above.
31+
**Note:** The currently supported GPU models and resource names are defined in (https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/templates/scheduler/device-configmap.yaml):
32+
```yaml
33+
iluvatars:
34+
- chipName: MR-V100
35+
commonWord: MR-V100
36+
resourceCountName: iluvatar.ai/MR-V100-vgpu
37+
resourceMemoryName: iluvatar.ai/MR-V100.vMem
38+
resourceCoreName: iluvatar.ai/MR-V100.vCore
39+
- chipName: MR-V50
40+
commonWord: MR-V50
41+
resourceCountName: iluvatar.ai/MR-V50-vgpu
42+
resourceMemoryName: iluvatar.ai/MR-V50.vMem
43+
resourceCoreName: iluvatar.ai/MR-V50.vCore
44+
- chipName: BI-V150
45+
commonWord: BI-V150
46+
resourceCountName: iluvatar.ai/BI-V150-vgpu
47+
resourceMemoryName: iluvatar.ai/BI-V150.vMem
48+
resourceCoreName: iluvatar.ai/BI-V150.vCore
49+
- chipName: BI-V100
50+
commonWord: BI-V100
51+
resourceCountName: iluvatar.ai/BI-V100-vgpu
52+
resourceMemoryName: iluvatar.ai/BI-V100.vMem
53+
resourceCoreName: iluvatar.ai/BI-V100.vCore
54+
```
4055
4156
## Device Granularity
4257
4358
HAMi divides each Iluvatar GPU into 100 units for resource allocation. When you request a portion of a GPU, you're actually requesting a certain number of these units.
4459
4560
### Memory Allocation
4661
47-
- Each unit of `iluvatar.ai/vcuda-memory` represents 256MB of device memory
62+
- Each unit of `iluvatar.ai/<card-type>.vMem` represents 256MB of device memory
4863
- If you don't specify a memory request, the system will default to using 100% of the available memory
4964
- Memory allocation is enforced with hard limits to ensure tasks don't exceed their allocated memory
5065

5166
### Core Allocation
5267

53-
- Each unit of `iluvatar.ai/vcuda-core` represents 1% of the available compute cores
68+
- Each unit of `iluvatar.ai/<card-type>.vCore` represents 1% of the available compute cores
5469
- Core allocation is enforced with hard limits to ensure tasks don't exceed their allocated cores
5570
- When requesting multiple GPUs, the system will automatically set the core resources based on the number of GPUs requested
5671

5772
## Running Iluvatar jobs
5873

5974
Iluvatar GPUs can now be requested by a container
60-
using the `iluvatar.ai/vgpu`, `iluvatar.ai/vcuda-memory` and `iluvatar.ai/vcuda-core` resource type:
75+
using the `iluvatar.ai/BI-V150-vgpu`, `iluvatar.ai/BI-V150.vMem` and `iluvatar.ai/BI-V150.vCore` resource type:
6176

6277
```yaml
6378
apiVersion: v1
6479
kind: Pod
6580
metadata:
66-
name: poddemo
81+
name: BI-V150-poddemo
6782
spec:
6883
restartPolicy: Never
6984
containers:
70-
- name: poddemo
71-
image: harbor.4pd.io/vgpu/corex_transformers@sha256:36a01ec452e6ee63c7aa08bfa1fa16d469ad19cc1e6000cf120ada83e4ceec1e
85+
- name: BI-V150-poddemo
86+
image: registry.iluvatar.com.cn:10443/saas/mr-bi150-4.3.0-x86-ubuntu22.04-py3.10-base-base:v1.0
7287
command:
7388
- bash
7489
args:
@@ -82,13 +97,13 @@ spec:
8297
sleep 360000
8398
resources:
8499
requests:
85-
iluvatar.ai/vgpu: 1
86-
iluvatar.ai/vcuda-core: 50
87-
iluvatar.ai/vcuda-memory: 64
100+
iluvatar.ai/BI-V150-vgpu: 1
101+
iluvatar.ai/BI-V150.vCore: 50
102+
iluvatar.ai/BI-V150.vMem: 64
88103
limits:
89-
iluvatar.ai/vgpu: 1
90-
iluvatar.ai/vcuda-core: 50
91-
iluvatar.ai/vcuda-memory: 64
104+
iluvatar.ai/BI-V150-vgpu: 1
105+
iluvatar.ai/BI-V150.vCore: 50
106+
iluvatar.ai/BI-V150.vMem: 64
92107
```
93108

94109
> **NOTICE1:** *Each unit of vcuda-memory indicates 256M device memory*
@@ -106,15 +121,13 @@ metadata:
106121
name: poddemo
107122
annotations:
108123
# Use specific GPU devices (comma-separated list)
109-
iluvatar.ai/use-gpuuuid: "node1-iluvatar-0,node1-iluvatar-1"
124+
hami.io/use-<card-type>-uuid: "device-uuid-1,device-uuid-2"
110125
# Or exclude specific GPU devices (comma-separated list)
111-
iluvatar.ai/nouse-gpuuuid: "node1-iluvatar-2,node1-iluvatar-3"
126+
hami.io/no-use-<card-type>-uuid: "device-uuid-1,device-uuid-2"
112127
spec:
113128
# ... rest of pod spec
114129
```
115130

116-
> **NOTE:** The device ID format is `{node-name}-iluvatar-{index}`. You can find the available device IDs in the node status.
117-
118131
### Finding Device UUIDs
119132

120133
You can find the UUIDs of Iluvatar GPUs on a node using the following command:
@@ -126,7 +139,7 @@ kubectl get pod <pod-name> -o yaml | grep -A 10 "hami.io/<card-type>-devices-all
126139
Or by examining the node annotations:
127140

128141
```bash
129-
kubectl get node <node-name> -o yaml | grep -A 10 "hami.io/node-register-<card-type>"
142+
kubectl get node <node-name> -o yaml | grep -A 10 "hami.io/node-<card-type>-register"
130143
```
131144

132145
Look for annotations containing device information in the node status.
@@ -144,6 +157,6 @@ Look for annotations containing device information in the node status.
144157

145158
2. Virtualization takes effect only for containers that apply for one GPU(i.e iluvatar.ai/vgpu=1 ). When requesting multiple GPUs, the system will automatically set the core resources based on the number of GPUs requested.
146159

147-
3. The `iluvatar.ai/vcuda-memory` resource is only effective when `iluvatar.ai/vgpu=1`.
160+
3. The `iluvatar.ai/<card-type>.vMem` resource is only effective when `iluvatar.ai/<card-type>-vgpu=1`.
148161

149-
4. Multi-device requests (`iluvatar.ai/vgpu > 1`) do not support vGPU mode.
162+
4. Multi-device requests (`iluvatar.ai/<card-type>-vgpu= > 1`) do not support vGPU mode.

0 commit comments

Comments
 (0)