Skip to content

Avoid submitting jobs for overloaded and offline device #1403

@nuclearcat

Description

@nuclearcat

One of major remaining missing features comparing with legacy KernelCI was the ability to validate lab (LAVA) device availability before job submission.
This is important to avoid job submission failures due to device unavailability, which wastes resources and time.
Sometimes devices are just offline, but jobs are being queued for them. Sometimes there is single device in the lab, and multiple jobs are competing for it, and queue is so long, that they wont complete before job timeout.
To address this, we have to implement a new feature in LAVA runtime that allows to fetch all device names for specific device type in the lab, and check their availability(status, job queue size) before job submission.
Then, we will make some safe defaults and configurables for job submission to avoid wasting resources on jobs that are unlikely to complete successfully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions