Project-HAMi
diff --git a/‎charts/hami/templates/scheduler/configmap.yaml‎
Lines changed: 12 additions & 0 deletions b/‎charts/hami/templates/scheduler/configmap.yaml‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎charts/hami/templates/scheduler/configmapnew.yaml‎
Lines changed: 6 additions & 0 deletions b/‎charts/hami/templates/scheduler/configmapnew.yaml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎charts/hami/templates/scheduler/device-configmap.yaml‎
Lines changed: 4 additions & 0 deletions b/‎charts/hami/templates/scheduler/device-configmap.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎charts/hami/values.yaml‎
Lines changed: 5 additions & 0 deletions b/‎charts/hami/values.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/metax-support.md‎
Lines changed: 60 additions & 10 deletions b/‎docs/metax-support.md‎
Lines changed: 60 additions & 10 deletions
diff --git a/‎docs/metax-support_cn.md‎
Lines changed: 57 additions & 9 deletions b/‎docs/metax-support_cn.md‎
Lines changed: 57 additions & 9 deletions
diff --git a/‎examples/metax/binpack.yaml‎ renamed to ‎examples/metax/gpu/binpack.yaml‎ b/‎examples/metax/binpack.yaml‎ renamed to ‎examples/metax/gpu/binpack.yaml‎
diff --git a/‎examples/metax/default_use.yaml‎ renamed to ‎examples/metax/gpu/default_use.yaml‎ b/‎examples/metax/default_use.yaml‎ renamed to ‎examples/metax/gpu/default_use.yaml‎
diff --git a/‎examples/metax/spread.yaml‎ renamed to ‎examples/metax/gpu/spread.yaml‎ b/‎examples/metax/spread.yaml‎ renamed to ‎examples/metax/gpu/spread.yaml‎
diff --git a/‎examples/metax/sgpu/allocate_exclusive.yaml‎
Lines changed: 13 additions & 0 deletions b/‎examples/metax/sgpu/allocate_exclusive.yaml‎
Lines changed: 13 additions & 0 deletions
@@ -79,6 +79,18 @@ data:
                     {
                         "name": "{{ .Values.iluvatarResourceName }}",
                         "ignoredByScheduler": true
+                    },
+                    {
+                        "name": "{{ .Values.metaxResourceName }}",
+                        "ignoredByScheduler": true
+                    },
+                    {
+                        "name": "{{ .Values.metaxResourceCore }}",
+                        "ignoredByScheduler": true
+                    },
+                    {
+                        "name": "{{ .Values.metaxResourceMem }}",
+                        "ignoredByScheduler": true
                     }
                 ],
                 "ignoreable": false
 
@@ -49,6 +49,12 @@ data:
         ignoredByScheduler: true
       - name: {{ .Values.iluvatarResourceName }}
         ignoredByScheduler: true
+      - name: {{ .Values.metaxResourceName }}
+        ignoredByScheduler: true
+      - name: {{ .Values.metaxResourceCore }}
+        ignoredByScheduler: true
+      - name: {{ .Values.metaxResourceMem }}
+        ignoredByScheduler: true
       {{- if .Values.devices.ascend.enabled }}
       {{- range .Values.devices.ascend.customresources }}
       - name: {{ . }}
 
@@ -90,6 +90,10 @@ data:
       resourceCoreName: {{ .Values.dcuResourceCores }}
     metax:
       resourceCountName: "metax-tech.com/gpu"
+
+      resourceVCountName: {{ .Values.metaxResourceName }}
+      resourceVMemoryName: {{ .Values.metaxResourceMem }}
+      resourceVCoreName: {{ .Values.metaxResourceCore }}
     mthreads:
       resourceCountName: "mthreads.com/vgpu"
       resourceMemoryName: "mthreads.com/sgpu-memory"
 
@@ -27,6 +27,11 @@ iluvatarResourceName: "iluvatar.ai/vgpu"
 iluvatarResourceMem: "iluvatar.ai/vcuda-memory"
 iluvatarResourceCore: "iluvatar.ai/vcuda-core"
 
+#Metax SGPU Parameters
+metaxResourceName: "metax-tech.com/sgpu"
+metaxResourceCore: "metax-tech.com/vcore"
+metaxResourceMem: "metax-tech.com/vmemory"
+
 schedulerName: "hami-scheduler"
 
 podSecurityPolicy:
 
@@ -1,6 +1,58 @@
 ## Introduction
 
-**We now support metax.com/gpu by implementing topo-awareness among metax GPUs**:
+We support metax.com/gpu as follows:
+
+- support metax.com/gpu by implementing most device-sharing features as nvidia-GPU
+- support metax.com/gpu by implementing topo-awareness among metax GPUs
+
+## support metax.com/gpu by implementing most device-sharing features as nvidia-GPU
+
+device-sharing features include the following:
+
+***GPU sharing***: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.
+
+***Device Memory Control***: GPUs can be allocated with certain device memory size and have made it that it does not exceed the boundary.
+
+***Device compute core limitation***: GPUs can be allocated with certain percentage of device core(60 indicate this container uses 60% compute cores of this device)
+
+### Prerequisites
+
+* Metax Driver >= 2.31.0
+* Metax GPU Operator >= 0.10.1
+* Kubernetes >= 1.23
+
+### Enabling GPU-sharing Support
+
+* Deploy Metax GPU Operator on metax nodes (Please consult your device provider to aquire its package and document)
+
+* Deploy HAMi according to README.md
+
+### Running Metax jobs
+
+Metax GPUs can now be requested by a container
+using the `metax-tech.com/sgpu`  resource type:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod1
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/sgpu: 1 # requesting 1 GPU 
+          metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores
+          metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory
+```
+
+> **NOTICE1:** *You can find more examples in [examples/metax folder](../examples/metax/sgpu)*
+
+## support metax.com/gpu by implementing topo-awareness among metax GPUs
 
 When multiple GPUs are configured on a single server, the GPU cards are connected to the same PCIe Switch or MetaXLink depending on whether they are connected
 , there is a near-far relationship. This forms a topology among all the cards on the server, as shown in the following figure:
@@ -21,29 +73,29 @@ Equipped with MetaXLink interconnected resources.
 
 ![img](../imgs/metax_binpack.png)
 
-## Important Notes
+### Important Notes
 
 1. Device sharing is not supported yet.
 
 2. These features are tested on MXC500
 
-## Prerequisites
+### Prerequisites
 
 * Metax GPU extensions >= 0.8.0
 * Kubernetes >= 1.23
 
-## Enabling topo-awareness scheduling
+### Enabling topo-awareness scheduling
 
 * Deploy Metax GPU Extensions on metax nodes (Please consult your device provider to aquire its package and document)
 
 * Deploy HAMi according to README.md
 
-## Running Metax jobs
+### Running Metax jobs
 
-Mthreads GPUs can now be requested by a container
+Metax GPUs can now be requested by a container
 using the `metax-tech.com/gpu`  resource type:
 
-```
+```yaml
 apiVersion: v1
 kind: Pod
 metadata:
@@ -60,6 +112,4 @@ spec:
           metax-tech.com/gpu: 1 # requesting 1 vGPUs
 ```
 
-> **NOTICE2:** *You can find more examples in [examples/metax folder](../examples/metax/)*
-
-   
+> **NOTICE2:** *You can find more examples in [examples/metax folder](../examples/metax/gpu)*
@@ -1,6 +1,56 @@
 ## 简介
 
-**我们支持基于拓扑结构，对沐曦设备进行优化调度**:
+我们对沐曦设备做如下支持：
+
+- 复用沐曦GPU设备，提供与vGPU类似的复用功能
+- 基于拓扑结构，对沐曦设备进行优化调度
+
+## 复用沐曦GPU设备，提供与vGPU类似的复用功能
+
+复用功能包括以下：
+
+***GPU 共享***: 每个任务可以只占用一部分显卡，多个任务可以共享一张显卡
+
+***可限制分配的显存大小***: 你现在可以用显存值（例如4G）来分配GPU，本组件会确保任务使用的显存不会超过分配数值
+
+***可限制计算单元数量***: 你现在可以指定任务使用的算力比例（例如60即代表使用60%算力）来分配GPU，本组件会确保任务使用的算力不会超过分配数值
+
+### 需求
+
+* Metax Driver >= 2.31.0
+* Metax GPU Operator >= 0.10.1
+* Kubernetes >= 1.23
+
+### 开启复用沐曦设备
+
+* 部署Metax GPU Operator (请联系您的设备提供方获取)
+* 根据readme.md部署HAMi
+
+### 运行沐曦任务
+
+一个典型的沐曦任务如下所示：
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod1
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/sgpu: 1 # requesting 1 GPU 
+          metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores
+          metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory
+```
+
+> **NOTICE1:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/sgpu)*
+
+## 基于拓扑结构，对沐曦设备进行优化调度
 
 在单台服务器上配置多张 GPU 时，GPU 卡间根据双方是否连接在相同的 PCIe Switch 或 MetaXLink
 下，存在近远（带宽高低）关系。服务器上所有卡间据此形成一张拓扑，如下图所示。
@@ -23,28 +73,28 @@
 
 ![img](../imgs/metax_binpack.png)
 
-## 注意：
+### 注意：
 
 1. 暂时不支持沐曦设备的切片，只能申请整卡
 
 2. 本功能基于MXC500进行测试
 
-## 需求
+### 需求
 
 * Metax GPU extensions >= 0.8.0
 * Kubernetes >= 1.23
 
-## 开启针对沐曦设备的拓扑调度优化
+### 开启针对沐曦设备的拓扑调度优化
 
 * 部署Metax GPU extensions (请联系您的设备提供方获取)
 
 * 根据readme.md部署HAMi
 
-## 运行沐曦任务
+### 运行沐曦任务
 
 一个典型的沐曦任务如下所示：
 
-```
+```yaml
 apiVersion: v1
 kind: Pod
 metadata:
@@ -61,6 +111,4 @@ spec:
           metax-tech.com/gpu: 1 # requesting 1 vGPUs
 ```
 
-> **NOTICE2:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/)*
-
-   
+> **NOTICE2:** *你可以在这里找到更多样例 [examples/metax folder](../examples/metax/gpu)*
@@ -0,0 +1,13 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: gpu-pod
+spec:
+  containers:
+    - name: ubuntu-container
+      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
+      imagePullPolicy: IfNotPresent
+      command: ["sleep","infinity"]
+      resources:
+        limits:
+          metax-tech.com/sgpu: 1 # requesting 1 exclusive GPU