Conversation
| # by the `fit` method, since this allows us to access the transformed | ||
| # dataframe. For other models we could use the `transform` method, but | ||
| # the GradientBoostingRegressor does not have a `transform` method. | ||
| X_transformed, _ = pipeline.price_pipe._fit(X_train, y_train) |
There was a problem hiding this comment.
@solegalli previously I was able to use the _fit method to access the transformed dataframe of the pipeline. This method seems to have been updated (possibly here: scikit-learn/scikit-learn@88ce8cd) and is throwing an error.
Any idea how to access the transformed dataframe for the GradientBoostingRegressor (since it doesn't have a transform method?)
There was a problem hiding this comment.
Not straightforward. Here some potential solutions:
https://stackoverflow.com/questions/54332654/how-to-analyse-the-intermediate-steps-of-sklearn-pipeline
https://stackoverflow.com/questions/48743032/get-intermediate-data-state-in-scikit-learn-pipeline?
In short, I think you need to apply all the transformations one after the other, just up to the one before the model.
There was a problem hiding this comment.
Ok, I asked the sklearn community through the email list and one of the developers told me to do this:
Using slicing: model[:-1].transform(X)
There was a problem hiding this comment.
That worked perfectly :)
from gradient_boosting_model import pipeline
from gradient_boosting_model.config.core import config
from functools import reduce
def test_pipeline_drops_unnecessary_features(pipeline_inputs):
# Given
X_train, X_test, y_train, y_test = pipeline_inputs
assert config.model_config.drop_features in X_train.columns
pipeline.price_pipe.fit(X_train, y_train)
# When
# We access the transformed inputs with slicing
transformed_inputs = pipeline.price_pipe[:-1].transform(X_train)
# Then
assert config.model_config.drop_features in X_train.columns
assert config.model_config.drop_features not in transformed_inputs.columns
def test_pipeline_transforms_temporal_features(pipeline_inputs):
# Given
X_train, X_test, y_train, y_test = pipeline_inputs
# When
# We access the transformed inputs with slicing
transformed_inputs = pipeline.price_pipe[:-1].transform(X_train)
# Then
assert (
transformed_inputs.iloc[0]["YearRemodAdd"]
== X_train.iloc[0]["YrSold"] - X_train.iloc[0]["YearRemodAdd"]
)
| @@ -1,2 +1,3 @@ | |||
| flask>=1.1.1,<1.2.0 | |||
| markupsafe==2.0.1 # https://github.com/aws/aws-sam-cli/issues/3661 | |||
There was a problem hiding this comment.
Key Section 5 fix
| feature_engine>=0.3.1,<0.4.0 | ||
| joblib>=0.14.1,<0.15.0 | ||
| numpy>=1.20.0,<1.21.0 | ||
| pandas>=1.3.5,<1.4.0 |
There was a problem hiding this comment.
Key section 4 fix 1
| # old model for testing purposes | ||
| # source code: https://github.com/trainindata/deploying-machine-learning-models/tree/master/packages/regression_model | ||
| tid-regression-model>=2.0.20,<2.1.0 | ||
| tid-regression-model==3.1.2 |
There was a problem hiding this comment.
Key section 4 fix 2
| @@ -1,4 +1,5 @@ | |||
| Flask>=1.1.1,<1.2.0 | |||
| markupsafe==2.0.1 # https://github.com/aws/aws-sam-cli/issues/3661 | |||
There was a problem hiding this comment.
key fix for section 9
| @@ -1,12 +1,13 @@ | |||
| # ML Model | |||
| tid-gradient-boosting-model>=0.1.18,<0.2.0 | |||
| tid-gradient-boosting-model>=0.3.0,<0.4.0 | |||
There was a problem hiding this comment.
Key fix for section 9 (2/2)
Note required new publishing of gradient boosting model (so sklearn versions are compatible)
1716ef4 to
72ae086
Compare
Update: For students checking this PR, the fixes have now been applied to the commit history. Please update your local copy from the current master branch
Fix for the
Unit Testing a Production ML Modelsection