Skip to content

Add input files metrics to Seqera executor task submission#6790

Closed
pditommaso wants to merge 1 commit intosched-a2from
input-files-metrics
Closed

Add input files metrics to Seqera executor task submission#6790
pditommaso wants to merge 1 commit intosched-a2from
input-files-metrics

Conversation

@pditommaso
Copy link
Member

Summary

  • Add InputFilesComputer to compute file metrics from task input files
  • Modify SeqeraBatchSubmitter for async metrics computation using a dedicated thread pool
  • Include input file statistics in task submission payload to Sched API

Changes

File Change
InputFilesComputer.groovy New class to compute file count, total size, and size distribution
SeqeraBatchSubmitter.groovy Async metrics computation with configurable timeout
InputFilesComputerTest.groovy Unit tests for the computer class
build.gradle Update sched-client to 0.16.0

Metrics Payload Structure

{
  "inputFilesMetrics": {
    "count": 12,
    "totalBytes": 4500000000,
    "bins": [
      {"range": "<=1MB", "count": 2},
      {"range": "<=10MB", "count": 5},
      {"range": "<=100MB", "count": 3},
      {"range": "<=1GB", "count": 2},
      {"range": ">1GB", "count": 0}
    ]
  }
}

Configuration

Variable Default Description
NXF_SEQERA_METRICS_TIMEOUT 30 sec Timeout for metrics computation

Test plan

  • Unit tests for InputFilesComputer
  • Integration test with actual Sched API

🤖 Generated with Claude Code

This change adds telemetry about input files (count, total size, and size
distribution) to the task submission payload sent to the Sched API.

Key changes:
- Add InputFilesComputer to compute file metrics from TaskRun.inputFiles
  - Follows symlinks and recursively computes directory sizes
  - Logs warnings on access failures, graceful degradation
- Modify SeqeraBatchSubmitter for async metrics computation:
  - Uses dedicated thread pool (max 10 threads) for parallel computation
  - Computation starts at enqueue(), resolved at flush time
  - Configurable timeout via NXF_SEQERA_METRICS_TIMEOUT (default 30s)
- Update sched-client dependency to 0.16.0

The metrics payload structure:
{
  "inputFilesMetrics": {
    "count": 12,
    "totalBytes": 4500000000,
    "bins": [
      {"range": "<=1MB", "count": 2},
      {"range": "<=10MB", "count": 5},
      ...
    ]
  }
}

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso changed the base branch from sched to sched-a2 February 9, 2026 20:52
@pditommaso
Copy link
Member Author

Using a simpler model for the same of first POC

model InputFilesMetrics {
  /**
   * Number of input files.
   */
  @example(12)
  count?: int32;

  /**
   * Total size of all input files in bytes.
   */
  @example(4500000000)
  totalBytes?: int64;

  /**
   * Size of the largest input file in bytes.
   */
  @example(1200000000)
  maxFileBytes?: int64;

  /**
   * Size of the smallest input file in bytes.
   */
  @example(50000000)
  minFileBytes?: int64;
}

@pditommaso pditommaso closed this Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant