# The configuration file The configuration file specifies for one only data Pipeline the data flows. It contains: * The **Extractors** list (all the Data sources to read in input) * The **Transformers** list (all the needed transformations) * The **Loaders** (all the Data sources to write in output) * Some **global Data Pipeline configuration** like the logging configuration. This file is a **JSON** file and must strictly respects the JSON specification. There are 5 "sections" (json node parents): * The first node {classname} specifies the Data Pipeline type. depending on this type or class that will decide how the pipeline is executed (like a stack, read before, etc.) * The node {extractors} lists all the data sources in input * The node {loaders} lists all the data sources in output * The node {transformers} lists all the transformers * the node {config} details the global configuration *Example:* This Data Pipeline reads a XES file and convert it into a CSV file ... ```mermaid graph TD; Read-XES-File-->Dataset-S1; Dataset-S1-->Transform-Nothing; Transform-Nothing-->Dataset-S2; Dataset-S2-->Write-CSV-File ``` is configured like this: ``` { "classname" : "pipelines.type.directPipeline", "extractors" : [ { "id": "S1", "classname": "datasources.xesFileDS", "parameters": { "separator": ",", "filename": "test.xes", "path": "tests/data/" } } ], "loaders" : [ { "id": "S2", "classname": "pipelite.datasources.csvFileDS", "parameters": { "separator": ",", "filename": "test-xes.csv", "path": "tests/data/out", "encoding": "utf-8" } } ], "transformers": [ { "id": "T", "classname": "pipelite.transformers.doNothingTR", "inputs" : [ "S1" ], "outputs" : [ "S2" ] } ], "config": { "logger" : { "level": "DEBUG", "format" : "%(asctime)s|%(name)s|%(levelname)s|%(message)s", "path": "logs/", "filename" : "xes2csv_direct.log", "maxbytes" : 1000000 } } } ``` ## The {extractor} section The {extractors} section must embed an array []. This array contains one or more Data Sources description the Data pipeline will have to read. Each section has some mandatory parameters/values to fill in and other attributes which depends on the nature itself of the Data source. These are the mandatory attributes: * **id**: this id is important and msut identify (so it has to be unique) the data source (and afterwards the dataset generated) into the Data pipeline. * **classname**: this is the Python class which manages the reading of the data source. * **parameters**: this is where you must have to specify the specific parameters which depends on the nature of the data source. Below we're reading a XES file (class datasources.xesFileDS), its id is S1 and the filename read is test.xes ``` { "id": "S1", "classname": "datasources.xesFileDS", "parameters": { "separator": ",", "filename": "test.xes", "path": "tests/data/" } } ``` ## The {loaders} section The {loaders} section is absolutely similar to the {extractor} section and follows the same rules. ## The {transformers} section The {transformers} lists all the transformations the Data pipeline will need to perform. So like the {extractors} and {loaders} this section must embed an array [] will all those transformations. Each transformation has also some mandatory (common to all transformations) attributes and other attributes which depends on the transformer itself. These are the mandatory attributes: * **id**: to identify a transformer * **classname**: this is the Python class which manages the reading of the transformer * **inputs**: an array (or python list) lists all the datasources (or datasets) names needed and used by the transformer * **output**: an array (or python list) lists all the datasets generated by the transformer in output. * **parameters**: this is where you must have to specify the specific parameters which depends on the transformer. In this example below there are 3 different Transformers which manages different Datasets: ```mermaid graph TD; Read-I1-->I1; Read-I2-->I2; Read-I3-->I3; I1-->T1; I2-->T1; T1-->T01; I3-->T2; T01-->T2; T2-->T02; ``` This is the {transformers} configuration needed: ``` "transformers": [ { "id": "T1", "classname": "pipelite.transformers.concatTR", "inputs" : [ "I1", "I2" ], "outputs" : [ "TO1" ] }, { "id": "T2", "classname": "pipelite.transformers.lookupTR", "inputs" : [ "TO1", "I3" ], "outputs" : [ "T02" ], "parameters" : { "main" : { "ds-id" : "E1", "key" : "col2"}, "lookup" : { "ds-id" : "E2", "key" : "tcol1", "keep" : "tcol2"} } }, { "id": "T3", "classname": "pipelite.transformers.passthroughTR" } ] ``` ## Examples [Many examples can be found in the repository](https://github.com/datacorner/pipelite/tree/main/src/config/pipelines)