Skip to content

GridSearch with pipelines of dataframes #24

@mratsim

Description

@mratsim

Hello again Cédric,

Following your help on transformer I am now trying to use a GridSearch to optimize the hyperparameters of a RandomForest.

I have a pipeline with lots of transformer which works great with Cross Validation and actual prediction, however I get a type error when trying to use it in a GridSearchCV, it seems like there is an extra argument of type ScikitLearn.Skcore.ParameterGrid in my setup :

pipe = Pipelines.Pipeline([ # This is working fine for cross validation, fitting and predicting
    ("extract_deck",PP_DeckTransformer()),
     ... # A list of 15 transformers
     ("featurize", mapper), # This is a DataFrameMapper to convert to Array
    ("forest", RandomForestClassifier(ntrees=200)) #Hyperparam: nsubfeatures, partialsampling, maxdepth
    ])

X_train = train
Y_train = convert(Array, train[:Survived])

# #Cross Validation - check model accuracy -- This is working fine
# crossval = round(cross_val_score(pipe, X_train, Y_train, cv =10), 2)
# print("\n",crossval,"\n")
# print(mean(crossval))

# GridSearch
grid = Dict(:ntrees => 10:30:240,
            :nsubfeatures => 0:1:13,
            :partialsampling => 0.2:0.1:1.0,
            :maxdepth => -1:2:13
)

gridsearch = GridSearchCV(pipe, grid)
fit!(gridsearch, X_train, Y_train)
println("Best hyper-parameters: $(gridsearch.best_params_)")

The error I get is :

ERROR: LoadError: MethodError: no method matching _fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}, ::ScikitLearn.Skcore.ParameterGrid)
Closest candidates are:
  _fit!(::ScikitLearn.Skcore.BaseSearchCV, !Matched::AbstractArray{T,N}, ::Any, ::Any) at /Users/<user>/.julia/v0.5/ScikitLearn/src/grid_search.jl:254
 in fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}) at /Users/<user>/.julia/v0.5/ScikitLearn/src/grid_search.jl:526
 in include_from_node1(::String) at ./loading.jl:488
 in include_from_node1(::String) at /usr/local/Cellar/julia/0.5.0/lib/julia/sys.dylib:?
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
 in _start() at /usr/local/Cellar/julia/0.5.0/lib/julia/sys.dylib:?
while loading /Users/<path>/Kaggle-001-Julia-MagicalForest.jl, in expression starting on line 538

So the proc is receiving _fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}, ::ScikitLearn.Skcore.ParameterGrid) but expecting an array instead of a Dataframe. The thing is it should have been converted away by the DataFrameMapper.

If needed the full code is there https://github.com/mratsim/MachineLearning_Kaggle/blob/9c07a64a981a6512e021ae01623212a278fd05d1/Kaggle%20-%20001%20-%20Titanic%20Survivors/Kaggle-001-Julia-MagicalForest.jl#L530

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions