Skip to content

How to implement Featuretools into my ML Process without data leakage? #14

@kilincali35

Description

@kilincali35

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.

Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In your FAQ, youare giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.

Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.

Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.

It is the safest option, with time and complexity disadvantages.

Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at your Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.

For example;
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb

Actually if you as the developers of the project think like that, I could give it a chance with whole data. Don't you think there is a leakage risk with the approach you are using at these Taxi Trip Duration examples?

What do you think, I would love to hear about your intuition on FeatureTools.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions