# The configuration file
The configuration file specifies for one only data Pipeline the data flows. It contains:
* The **Extractors** list (all the Data sources to read in input)
* The **Transformers** list (all the needed transformations)
* The **Loaders** (all the Data sources to write in output)
* Some **global Data Pipeline configuration** like the logging configuration.

This file is a **JSON** file and must strictly respects the JSON specification.

There are 5 "sections" (json node parents):
* The first node {classname} specifies the Data Pipeline type. depending on this type or class that will decide how the pipeline is executed (like a stack, read before, etc.)
* The node {extractors} lists all the data sources in input
* The node {loaders} lists all the data sources in output
* The node {transformers} lists all the transformers
* the node {config} details the global configuration

*Example:*
This Data Pipeline reads a XES file and convert it into a CSV file ...
```mermaid
graph TD;
    Read-XES-File-->Dataset-S1;
    Dataset-S1-->Transform-Nothing;
    Transform-Nothing-->Dataset-S2;
    Dataset-S2-->Write-CSV-File
```
is configured like this:
```
{
    "classname" : "pipelines.type.directPipeline",
    "extractors" : [ {
        "id": "S1",
        "classname": "datasources.xesFileDS",
        "parameters": {
            "separator": ",",
            "filename": "test.xes",
            "path": "tests/data/"
        }   
    } ],
    "loaders" : [ {
        "id": "S2",
        "classname": "pipelite.datasources.csvFileDS",
        "parameters": {
            "separator": ",",
            "filename": "test-xes.csv",
            "path": "tests/data/out",
            "encoding": "utf-8"
        }   
    } ],
    "transformers":  [
    { 
        "id": "T",
        "classname": "pipelite.transformers.doNothingTR",
        "inputs" : [ "S1" ],
        "outputs" : [ "S2" ]
    } ],
    "config": {
        "logger" : {
            "level": "DEBUG",
            "format" : "%(asctime)s|%(name)s|%(levelname)s|%(message)s",
            "path": "logs/",
            "filename" : "xes2csv_direct.log",
            "maxbytes" : 1000000
        }
    }
} 
```
## The {extractor} section
The {extractors} section must embed an array []. This array contains one or more Data Sources description the Data pipeline will have to read.

Each section has some mandatory parameters/values to fill in and other attributes which depends on the nature itself of the Data source.

These are the mandatory attributes:
* **id**: this id is important and msut identify (so it has to be unique) the data source (and afterwards the dataset generated) into the Data pipeline.
* **classname**: this is the Python class which manages the reading of the data source.
* **parameters**: this is where you must have to specify the specific parameters which depends on the nature of the data source. 

Below we're reading a XES file (class datasources.xesFileDS), its id is S1 and the filename read is test.xes
```
{
        "id": "S1",
        "classname": "datasources.xesFileDS",
        "parameters": {
            "separator": ",",
            "filename": "test.xes",
            "path": "tests/data/"
        }   
    }
```

## The {loaders} section
The {loaders} section is absolutely similar to the {extractor} section and follows the same rules.

## The {transformers} section
The {transformers} lists all the transformations the Data pipeline will need to perform. So like the {extractors} and {loaders} this section must embed an array [] will all those transformations.

Each transformation has also some mandatory (common to all transformations) attributes and other attributes which depends on the transformer itself.

These are the mandatory attributes:
* **id**: to identify a transformer
* **classname**: this is the Python class which manages the reading of the transformer
* **inputs**: an array (or python list) lists all the datasources (or datasets) names needed and used by the transformer
* **output**: an array (or python list) lists all the datasets generated by the transformer in output.
* **parameters**: this is where you must have to specify the specific parameters which depends on the transformer. 

In this example below there are 3 different Transformers which manages different Datasets:
```mermaid
graph TD;
    Read-I1-->I1;
    Read-I2-->I2;
    Read-I3-->I3;
    I1-->T1;
    I2-->T1;
    T1-->T01;
    I3-->T2;
    T01-->T2;
    T2-->T02;
```
This is the {transformers} configuration needed:
```
"transformers":  [ { 
        "id": "T1",
        "classname": "pipelite.transformers.concatTR",
        "inputs" : [ "I1", "I2" ],
        "outputs" : [ "TO1" ]
    },
    { 
        "id": "T2",
        "classname": "pipelite.transformers.lookupTR",
        "inputs" : [ "TO1", "I3" ],
        "outputs" : [ "T02" ],
        "parameters" : {
            "main" : { "ds-id" : "E1", 
                       "key" : "col2"},
            "lookup" :  { "ds-id" : "E2", 
                          "key" : "tcol1", 
                          "keep" : "tcol2"}
            }
    },
    { 
        "id": "T3",
        "classname": "pipelite.transformers.passthroughTR"
    } ]
```

## Examples
[Many examples can be found in the repository](https://github.com/datacorner/pipelite/tree/main/src/config/pipelines)