@@ -250,10 +250,23 @@ limitations of the current approach for the following use cases:
250
250
containers should be able to use other free resources on the same
251
251
device.
252
252
253
- * Limitation* : Current implementation of the device plugin doesn’t
254
- allow one to allocate part of the device because parameters are too limited
255
- and Kubernetes doesn't have enough information about the extended
256
- resources on a node to decide whether they can be shared.
253
+ * Limitation* : For example, newer generations of NVIDIA GPUs have a mode of
254
+ operation called MIG, that allow them to be sub-divided into a set of
255
+ mini-GPUs (called MIG devices) with varying amounts of memory and compute
256
+ resources provided by each. From a hardware-standpoint, configuring a GPU
257
+ into a set of MIG devices is highly-dynamic and creating a MIG device
258
+ tailored to the resource needs of a particular application is well
259
+ supported. However, with the current device plugin API, the only way to make
260
+ use of this feature is to pre-partition a GPU into a set of MIG devices and
261
+ advertise them to the kubelet in the same way a full / static GPU is
262
+ advertised. The user must then pick from this set of pre-partitioned MIG
263
+ devices instead of having one created for them on the fly based on their
264
+ particular resource constraints. Without the ability to create MIG devices
265
+ dynamically (i.e. at the time they are requested) the set of pre-defined MIG
266
+ devices must be carefully tuned to ensure that GPU resources do not go unused
267
+ because some of the pre-partioned devices are in low-demand. It also puts
268
+ the burden on the user to pick a particular MIG device type, rather than
269
+ declaring the resource constraints more abstractly.
257
270
258
271
- * Optional allocation* : When deploying a workload I’d like to specify
259
272
soft(optional) device requirements. If a device exists and it’s
0 commit comments