-
Notifications
You must be signed in to change notification settings - Fork 124
Closed
Labels
featureissue/PR that proposes a new feature or functionalityissue/PR that proposes a new feature or functionality
Milestone
Description
I understand DRA will finally promote to Beta in v1.32๐ Thank you very much contributors for your hard work standardizing flexible device scheduling and implementing NVIDIA's dra-driver.
Do you have a plan exposing intra-node topology as device attribute?? Especially distances between GPU<->GPU and GPU<->NIC or HCA (I imagine nvidia-smi topo -m equivalent information)? Or, would you have a plan to provide some extension point to add user-defined device attribute in this dar-driver??
I imagine below usecases for optimizing training performance:
Single Node Multi GPUs:a user wants to have 1 pod with 2 gpus which are connected via NVLink each other (NV#innvidia-smi topo -m)
โ discussed in Support for NVLINK Aware Scheduling?ย #214
- Multi Node Multi GPUs:
- a user wants like to have N pods per 4 gpus each of which have adjacent NIC or HCA (
PIXinnvidia-smi topo -m) in specific zone(achieved by node selector)- probably, it needs integration with cni and network device plugins (e.g. https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin)
- a user wants like to have N pods per 4 gpus each of which have adjacent NIC or HCA (
Thanks, in advance.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
featureissue/PR that proposes a new feature or functionalityissue/PR that proposes a new feature or functionality
Type
Projects
Status
Closed