Skip to content

compute device score bug #1699

@Wang-Kai

Description

@Wang-Kai

What happened:

In pkg/scheduler/policy/gpu_policy.go, the ComputeScore function computes a score for each device to determine scheduling order. However, there are two issues in how the usedScore component is calculated:

Issue 1: container.Nums is incorrectly used as slot consumption

func (ds *DeviceListsScore) ComputeScore(requests device.ContainerDeviceRequests) {
    request, core, mem := int32(0), int32(0), int32(0)
    for _, container := range requests {
        request += container.Nums  // ← BUG: Nums is the number of GPUs requested, not slot usage per card
        // ...
    }
    usedScore := float32(request+ds.Device.Used) / float32(ds.Device.Count)
    // ...
}

container.Nums represents how many GPU cards the container requests (e.g., hami.io/gpu: 4Nums = 4). However, when a container is allocated to a specific card, it only occupies 1 time-slicing slot on that card (as seen in AddResourceUsage where n.Used++).

The current code adds Nums (e.g., 4) to the used count, implying this single container would consume 4 slots on one card, which is incorrect.

Issue 2: No device type filtering when iterating over requests

requests is of type ContainerDeviceRequests (map[string]ContainerDeviceRequest), where the key is the device type (e.g., "NVIDIA", "DCU"). The function iterates over all device types without filtering:

for _, container := range requests {  // iterates over ALL device types
    request += container.Nums
    core += container.Coresreq
    mem += container.Memreq
}

When a container requests multiple device types (e.g., 2 NVIDIA GPUs + 1 Hygon DCU), the score for a single NVIDIA GPU card would incorrectly include the DCU request's Nums, Coresreq, and Memreq.

What you expected to happen:

  1. Each container should contribute at most 1 to the slot usage prediction per card (not Nums).
  2. ComputeScore should only accumulate requests that match ds.Device.Type, ignoring requests for other device types.

A corrected version might look like:

func (ds *DeviceListsScore) ComputeScore(requests device.ContainerDeviceRequests) {
    request, core, mem := int32(0), int32(0), int32(0)
    for devType, container := range requests {
        if devType != ds.Device.Type {
            continue  // only consider same-type requests
        }
        request += 1  // one container occupies one slot, regardless of Nums
        core += container.Coresreq
        if container.MemPercentagereq != 0 && container.MemPercentagereq != 101 {
            mem += ds.Device.Totalmem * (container.MemPercentagereq / 100.0)
            continue
        }
        mem += container.Memreq
    }
    // ...
}

How to reproduce it (as minimally and precisely as possible):

  1. Deploy a pod that requests multiple device types, or requests more than 1 GPU (e.g., hami.io/gpu: 4)
  2. Observe the computed usedScore in scheduler logs (log level V(2)):
    device GPU-xxxx computer score is <value>
    
  3. The usedScore component will be inflated because request equals Nums (e.g., 4) instead of 1

Anything else we need to know?:

Practical impact is limited for single-type requests

Since request is a constant added to every card's score, the relative ordering between cards of the same type is still determined by ds.Device.Used differences. So the sorting result remains correct in most cases.

However, the inflated usedScore can cause the score to exceed 1.0, which breaks the implicit normalization assumption across the three scoring dimensions (slot usage, core usage, memory usage). This may cause the usedScore to disproportionately outweigh coreScore and memScore in the final weighted sum:

ds.Score = float32(util.Weight) * (usedScore + coreScore + memScore)

For example, with Nums=4, Used=2, Count=10:

  • Current (incorrect): usedScore = (4 + 2) / 10 = 0.6
  • Expected: usedScore = (1 + 2) / 10 = 0.3

The comment // Here we are required to use the same type device also acknowledges the type-filtering assumption but does not enforce it.

Environment:

  • HAMi version: master branch (commit 2ca2ae1)
  • Affected file: pkg/scheduler/policy/gpu_policy.go, function ComputeScore (line 59-78)

Metadata

Metadata

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions