Skip to content

Combining

Alan B. Christie edited this page Sep 16, 2025 · 5 revisions

Combining, or "fan in", is used to refer to a step Job that expects multiple input files and creates a single output file. Simple concatenation of the files is one example. "Combiner" Jobs typically join a large number of files (produced by a prior parallel step) in a single step instance.

The Workflow Engine determines that a step is combining multiple files by inspecting the plumbing that refers to a prior steps's output. If a step input variable is (according to the Job Definition) of type files then the step is assumed to be a combiner of files generated by multiple instances of a prior step.

Here's an example workflow excerpt: -

- name: parallel
  description: Add some params
  specification:
    collection: demo
    job: append-col
    version: "1.0.0"

- name: combine
  description: Combine the parallel files
  specification:
    collection: demo
    job: concatenate
    version: "1.0.0"
  plumbing:
  - variable: inputFile
    from-step:
      name: parallel
      variable: outputFile
  - variable: inputDirPrefix
    from-predefined:
      variable: link-glob

In the above example, the combine step uses an inputFile variable (whose value is the value of the outputFile variable of the parallel step). When the workflow engine decides to run the combine step it inspects the step's Job Definition, (version 1.0.0 of the concatenate job in the demo collection). The engine looks specifically for the definition of the job's inputFile variable. If the variable is found to be of type files then the concatenate* step will be launched once and given a glob so the step can find the instance directories (hard-linked into its instance directory) where all the outputFile files can be found (one in each incoming instance directory).

Important properties of "combiner" steps

  1. When the workflow engine discovers a combining step, it does nothing about launching the step until all of the prior (parallel) steps have successfully completed. If a parallel step fails the workflow will not progress.
  2. A step can must use the pre-defined (built-in) variable exposed by the workflow engine (a filesystem glob) that is can use to locate the directories where the files to be combined can be found.
  3. Combining Jobs must provide two variables: one to accept the output filename of the prior step and another that accepts a filesystem glob value to identify the instance directories where each file can be found.

Clone this wiki locally