|
4 | 4 | | :------------ | :------- | :------------- | |
5 | 5 | | [](https://github.com/JuliaAI/FeatureSelection.jl/actions) | [](https://codecov.io/github/JuliaAI/FeatureSelection.jl?branch=dev) | [](https://github.com/invenia/BlueStyle) | |
6 | 6 |
|
7 | | -Repository housing feature selection algorithms for use with the machine learning toolbox |
8 | | -[MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/). |
9 | | - |
10 | | -`FeatureSelector` model builds on contributions originally residing at [MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl/blob/v0.16.15/src/builtins/Transformers.jl#L189-L266) |
11 | | - |
12 | | -# Installation |
13 | | -On a running instance of Julia with at least version 1.6 run |
14 | | -```julia |
15 | | -import Pkg; |
16 | | -Pkg.add("FeatureSelection") |
17 | | -``` |
18 | | - |
19 | | -# Example Usage |
20 | | -Lets build a supervised recursive feature eliminator with `RandomForestRegressor` |
21 | | -from DecisionTree.jl as our base model. |
22 | | -But first we need a dataset to train on. We shall create a synthetic dataset popularly |
23 | | -known in the R community as the friedman dataset#1. Notice how the target vector for this |
24 | | -dataset depends on only the first five columns of feature table. So we expect that our |
25 | | -recursive feature elimination should return the first columns as important features. |
26 | | -```julia |
27 | | -using MLJ, FeatureSelection |
28 | | -using StableRNGs |
29 | | -rng = StableRNG(10) |
30 | | -A = rand(rng, 50, 10) |
31 | | -X = MLJ.table(A) # features |
32 | | -y = @views( |
33 | | - 10 .* sin.( |
34 | | - pi .* A[:, 1] .* A[:, 2] |
35 | | - ) .+ 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5] |
36 | | -) # target |
37 | | -``` |
38 | | -Now we that we have our data we can create our recursive feature elimination model and |
39 | | -train it on our dataset |
40 | | -```julia |
41 | | -RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree |
42 | | -forest = RandomForestRegressor(rng=rng) |
43 | | -rfe = RecursiveFeatureElimination( |
44 | | - model = forest, n_features=5, step=1 |
45 | | -) # see doctring for description of defaults |
46 | | -mach = machine(rfe, X, y) |
47 | | -fit!(mach) |
48 | | -``` |
49 | | -We can inspect the feature importances in two ways: |
50 | | -```julia |
51 | | -# A variable with lower rank has more significance than a variable with higher rank. |
52 | | -# A variable with Higher feature importance is better than a variable with lower |
53 | | -# feature importance |
54 | | -report(mach).ranking # returns [1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0] |
55 | | -feature_importances(mach) # returns dict of feature => importance pairs |
56 | | -``` |
57 | | -We can view the important features used by our model by inspecting the `fitted_params` |
58 | | -object. |
59 | | -```julia |
60 | | -p = fitted_params(mach) |
61 | | -p.features_left == [:x1, :x2, :x3, :x4, :x5] |
62 | | -``` |
63 | | -We can also call the `predict` method on the fitted machine, to predict using a |
64 | | -random forest regressor trained using only the important features, or call the `transform` |
65 | | -method, to select just those features from some new table including all the original |
66 | | -features. For more info, type `?RecursiveFeatureElimination` on a Julia REPL. |
67 | | - |
68 | | -Okay, let's say that we didn't know that our synthetic dataset depends on only five |
69 | | -columns from our feature table. We could apply cross fold validation |
70 | | -`StratifiedCV(nfolds=5)` with our recursive feature elimination model to select the |
71 | | -optimal value of `n_features` for our model. In this case we will use a simple Grid |
72 | | -search with root mean square as the measure. |
73 | | -```julia |
74 | | -rfe = RecursiveFeatureElimination(model = forest) |
75 | | -tuning_rfe_model = TunedModel( |
76 | | - model = rfe, |
77 | | - measure = rms, |
78 | | - tuning = Grid(rng=rng), |
79 | | - resampling = StratifiedCV(nfolds = 5), |
80 | | - range = range( |
81 | | - rfe, :n_features, values = 1:10 |
82 | | - ) |
83 | | -) |
84 | | -self_tuning_rfe_mach = machine(tuning_rfe_model, X, y) |
85 | | -fit!(self_tuning_rfe_mach) |
86 | | -``` |
87 | | -As before we can inspect the important features by inspecting the object returned by |
88 | | -`fitted_params` or `feature_importances` as shown below. |
89 | | -```julia |
90 | | -fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left == [:x1, :x2, :x3, :x4, :x5] |
91 | | -feature_importances(self_tuning_rfe_mach) # returns dict of feature => importance pairs |
92 | | -``` |
93 | | -and call `predict` on the tuned model machine as shown below |
94 | | -```julia |
95 | | -Xnew = MLJ.table(rand(rng, 50, 10)) # create test data |
96 | | -predict(self_tuning_rfe_mach, Xnew) |
97 | | -``` |
98 | | -In this case, prediction is done using the best recursive feature elimination model gotten |
99 | | -from the tuning process above. |
100 | | - |
101 | | -For resampling methods different from cross-validation, and for other |
102 | | - `TunedModel` options, such as parallelization, see the |
103 | | - [Tuning Models](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/) section of the MLJ manual. |
104 | | -[MLJ Documentation](https://alan-turing-institute.github.io/MLJ.jl/dev/) |
| 7 | +Repository housing feature selection algorithms for use with the machine learning toolbox [MLJ](https://juliaai.github.io/MLJ.jl/dev/). |
0 commit comments