Skip to content

Commit 593faf6

Browse files
committed
lava: Skip devices with queue more than queue_depth limit (default 50)
Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
1 parent b7d359f commit 593faf6

File tree

2 files changed

+145
-0
lines changed

2 files changed

+145
-0
lines changed

doc/config-reference.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -294,3 +294,101 @@ nodes when all jobs have daily frequency limits.
294294

295295
**Note**: The `--force` flag on the trigger service overrides the frequency
296296
check, allowing checkout creation regardless of the frequency setting.
297+
298+
## Runtimes configuration
299+
300+
The `runtimes` section defines the runtime environments where jobs are executed.
301+
Each runtime is defined as a dictionary entry with configuration parameters
302+
specific to the runtime type.
303+
304+
Runtimes are defined in `config/pipeline.yaml`.
305+
306+
### Common parameters
307+
308+
- **lab_type**: string (required) - Type of runtime (`lava`, `kubernetes`, `docker`, `shell`, `pull_labs`)
309+
- **rules**: dictionary (optional) - Filtering rules for trees/branches
310+
311+
### LAVA runtime parameters
312+
313+
LAVA runtimes are used to submit test jobs to LAVA labs.
314+
315+
#### Required parameters
316+
317+
- **lab_type**: `lava`
318+
- **url**: string - URL of the LAVA server API (e.g., `https://lava.example.com/`)
319+
- **notify.callback.token**: string - Token name for LAVA callbacks
320+
321+
#### Optional parameters
322+
323+
- **priority**: integer or string (`low`, `medium`, `high`) - Job priority (0-100)
324+
- **priority_min**: integer - Minimum priority level for the lab (0-100)
325+
- **priority_max**: integer - Maximum priority level for the lab (0-100)
326+
- **queue_timeout**: dictionary - Timeout for jobs in queue
327+
- **days**: integer
328+
- **hours**: integer
329+
- **max_queue_depth**: integer - Maximum number of queued jobs per device type before skipping new submissions
330+
- **Default**: 50
331+
- When the queue depth for a device type reaches this limit, new jobs will be skipped
332+
- Jobs are also skipped if no online devices are available for the device type
333+
- **rules**: dictionary - Tree/branch filtering rules
334+
335+
#### Example
336+
337+
```yaml
338+
runtimes:
339+
340+
lava-collabora:
341+
lab_type: lava
342+
url: https://lava.collabora.dev/
343+
priority_min: 40
344+
priority_max: 60
345+
max_queue_depth: 100 # Higher limit for larger lab
346+
notify:
347+
callback:
348+
token: kernelci-api-token
349+
rules:
350+
tree:
351+
- '!android'
352+
353+
lava-small-lab:
354+
lab_type: lava
355+
url: https://small-lab.example.com/
356+
max_queue_depth: 20 # Lower limit for smaller lab
357+
notify:
358+
callback:
359+
token: small-lab-token
360+
```
361+
362+
#### Queue depth behavior
363+
364+
The `max_queue_depth` parameter controls job submission throttling per LAVA lab:
365+
366+
1. Before submitting a job, the scheduler queries the LAVA API to check:
367+
- If there are online devices for the target device type
368+
- The current number of queued jobs for that device type
369+
370+
2. The job is **skipped** (not submitted) if:
371+
- No online devices are available for the device type
372+
- The queue depth is >= `max_queue_depth`
373+
374+
3. When a job is skipped, a log message is generated:
375+
```
376+
Skipping job <job-name> for <lab-name>: device_type=<type> queue_depth=<N> >= max=<limit>
377+
```
378+
379+
4. This helps prevent queue overload in busy labs and avoids submitting jobs
380+
to device types with no available hardware.
381+
382+
### LAVA token configuration
383+
384+
LAVA tokens are stored separately in the TOML settings file (`kernelci.toml`),
385+
not in the YAML config:
386+
387+
```toml
388+
[runtime.lava-collabora]
389+
runtime_token = "YOUR_LAVA_API_TOKEN"
390+
callback_token = "YOUR_CALLBACK_TOKEN" # Optional, if different from runtime_token
391+
```
392+
393+
The token **name** (description) goes in the YAML config's `notify.callback.token` field,
394+
while the token **value** (secret) goes in the TOML file.

src/scheduler.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,50 @@ def _log_lava_queue_status(self, runtime, params, platform):
374374
f"Failed to query LAVA queue status for {device_type}: {exc}"
375375
)
376376

377+
def _should_skip_due_to_queue_depth(self, runtime, job_config, platform):
378+
"""Check if job should be skipped due to LAVA queue depth.
379+
380+
Returns True if job should be skipped, False otherwise.
381+
"""
382+
if runtime.config.lab_type != 'lava':
383+
return False
384+
385+
if not hasattr(runtime, 'get_devicetype_job_count'):
386+
return False
387+
388+
max_queue_depth = runtime.config.max_queue_depth
389+
device_type = job_config.params.get('device_type') if job_config.params else None
390+
if not device_type:
391+
device_type = platform.name
392+
393+
try:
394+
if hasattr(runtime, 'get_device_names_by_type'):
395+
device_names = runtime.get_device_names_by_type(
396+
device_type, online_only=True
397+
)
398+
if not device_names:
399+
self.log.info(
400+
f"Skipping job {job_config.name} for {runtime.config.name}: "
401+
f"device_type={device_type} has no online devices"
402+
)
403+
return True # Skip submission when no online devices
404+
405+
queued = runtime.get_devicetype_job_count(device_type)
406+
407+
if queued >= max_queue_depth:
408+
self.log.info(
409+
f"Skipping job {job_config.name} for {runtime.config.name}: "
410+
f"device_type={device_type} queue_depth={queued} >= "
411+
f"max={max_queue_depth}"
412+
)
413+
return True
414+
return False
415+
except Exception as exc:
416+
self.log.warning(
417+
f"Failed to check LAVA queue depth for {device_type}: {exc}"
418+
)
419+
return False # Fail-open: don't skip on errors
420+
377421
def _run_job(self, job_config, runtime, platform, input_node, retry_counter):
378422
try:
379423
node = self._api_helper.create_job_node(
@@ -697,6 +741,9 @@ def _run_scheduler(self, channel, sub_id):
697741
with self._api_helper_lock:
698742
flag = self._api_helper.should_create_node(rules, input_node)
699743
if flag:
744+
# Check LAVA queue depth before creating job node
745+
if self._should_skip_due_to_queue_depth(runtime, job, platform):
746+
continue
700747
retry_counter = event.get('retry_counter', 0)
701748
self._run_job(job, runtime, platform, input_node, retry_counter)
702749

0 commit comments

Comments
 (0)