|
4 | 4 | | :------------ | :------- | :------------- |
|
5 | 5 | | [](https://github.com/JuliaAI/FeatureSelection.jl/actions) | [](https://codecov.io/github/JuliaAI/FeatureSelection.jl?branch=dev) | [](https://github.com/invenia/BlueStyle) |
|
6 | 6 |
|
7 |
| -Repository housing feature selection algorithms for use with the machine learning toolbox |
8 |
| -[MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/). |
9 |
| - |
10 |
| -`FeatureSelector` model builds on contributions originally residing at [MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl/blob/v0.16.15/src/builtins/Transformers.jl#L189-L266) |
11 |
| - |
12 |
| -# Installation |
13 |
| -On a running instance of Julia with at least version 1.6 run |
14 |
| -```julia |
15 |
| -import Pkg; |
16 |
| -Pkg.add("FeatureSelection") |
17 |
| -``` |
18 |
| - |
19 |
| -# Example Usage |
20 |
| -Lets build a supervised recursive feature eliminator with `RandomForestRegressor` |
21 |
| -from DecisionTree.jl as our base model. |
22 |
| -But first we need a dataset to train on. We shall create a synthetic dataset popularly |
23 |
| -known in the R community as the friedman dataset#1. Notice how the target vector for this |
24 |
| -dataset depends on only the first five columns of feature table. So we expect that our |
25 |
| -recursive feature elimination should return the first columns as important features. |
26 |
| -```julia |
27 |
| -using MLJ, FeatureSelection |
28 |
| -using StableRNGs |
29 |
| -rng = StableRNG(10) |
30 |
| -A = rand(rng, 50, 10) |
31 |
| -X = MLJ.table(A) # features |
32 |
| -y = @views( |
33 |
| - 10 .* sin.( |
34 |
| - pi .* A[:, 1] .* A[:, 2] |
35 |
| - ) .+ 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5] |
36 |
| -) # target |
37 |
| -``` |
38 |
| -Now we that we have our data we can create our recursive feature elimination model and |
39 |
| -train it on our dataset |
40 |
| -```julia |
41 |
| -RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree |
42 |
| -forest = RandomForestRegressor(rng=rng) |
43 |
| -rfe = RecursiveFeatureElimination( |
44 |
| - model = forest, n_features=5, step=1 |
45 |
| -) # see doctring for description of defaults |
46 |
| -mach = machine(rfe, X, y) |
47 |
| -fit!(mach) |
48 |
| -``` |
49 |
| -We can inspect the feature importances in two ways: |
50 |
| -```julia |
51 |
| -# A variable with lower rank has more significance than a variable with higher rank. |
52 |
| -# A variable with Higher feature importance is better than a variable with lower |
53 |
| -# feature importance |
54 |
| -report(mach).ranking # returns [1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0] |
55 |
| -feature_importances(mach) # returns dict of feature => importance pairs |
56 |
| -``` |
57 |
| -We can view the important features used by our model by inspecting the `fitted_params` |
58 |
| -object. |
59 |
| -```julia |
60 |
| -p = fitted_params(mach) |
61 |
| -p.features_left == [:x1, :x2, :x3, :x4, :x5] |
62 |
| -``` |
63 |
| -We can also call the `predict` method on the fitted machine, to predict using a |
64 |
| -random forest regressor trained using only the important features, or call the `transform` |
65 |
| -method, to select just those features from some new table including all the original |
66 |
| -features. For more info, type `?RecursiveFeatureElimination` on a Julia REPL. |
67 |
| - |
68 |
| -Okay, let's say that we didn't know that our synthetic dataset depends on only five |
69 |
| -columns from our feature table. We could apply cross fold validation |
70 |
| -`StratifiedCV(nfolds=5)` with our recursive feature elimination model to select the |
71 |
| -optimal value of `n_features` for our model. In this case we will use a simple Grid |
72 |
| -search with root mean square as the measure. |
73 |
| -```julia |
74 |
| -rfe = RecursiveFeatureElimination(model = forest) |
75 |
| -tuning_rfe_model = TunedModel( |
76 |
| - model = rfe, |
77 |
| - measure = rms, |
78 |
| - tuning = Grid(rng=rng), |
79 |
| - resampling = StratifiedCV(nfolds = 5), |
80 |
| - range = range( |
81 |
| - rfe, :n_features, values = 1:10 |
82 |
| - ) |
83 |
| -) |
84 |
| -self_tuning_rfe_mach = machine(tuning_rfe_model, X, y) |
85 |
| -fit!(self_tuning_rfe_mach) |
86 |
| -``` |
87 |
| -As before we can inspect the important features by inspecting the object returned by |
88 |
| -`fitted_params` or `feature_importances` as shown below. |
89 |
| -```julia |
90 |
| -fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left == [:x1, :x2, :x3, :x4, :x5] |
91 |
| -feature_importances(self_tuning_rfe_mach) # returns dict of feature => importance pairs |
92 |
| -``` |
93 |
| -and call `predict` on the tuned model machine as shown below |
94 |
| -```julia |
95 |
| -Xnew = MLJ.table(rand(rng, 50, 10)) # create test data |
96 |
| -predict(self_tuning_rfe_mach, Xnew) |
97 |
| -``` |
98 |
| -In this case, prediction is done using the best recursive feature elimination model gotten |
99 |
| -from the tuning process above. |
100 |
| - |
101 |
| -For resampling methods different from cross-validation, and for other |
102 |
| - `TunedModel` options, such as parallelization, see the |
103 |
| - [Tuning Models](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/) section of the MLJ manual. |
104 |
| -[MLJ Documentation](https://alan-turing-institute.github.io/MLJ.jl/dev/) |
| 7 | +Repository housing feature selection algorithms for use with the machine learning toolbox [MLJ](https://juliaai.github.io/MLJ.jl/dev/). |
0 commit comments