Separating repeated processing from classifier models

In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB.  What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset.  Unless I'm misunderstanding the flow of data, this seems inefficient.  Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.  

We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.

If you think this is a good idea, how do we want to go about architecting this from a software perspective?  One approach is to compute the static pipeline before the `test_classifier` method is run and save that to the data directory where the train/test dataset is being saved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separating repeated processing from classifier models #70

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Separating repeated processing from classifier models #70

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions