|
1 | 1 | # FeatureSelection.jl
|
2 | 2 |
|
3 |
| -This repository is a template for creating repositories that contain |
4 |
| -glue code between (i) packages providing machine learning algorithms; and (ii) |
5 |
| -the machine learning toolbox |
6 |
| -[MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/) - that is, |
7 |
| -for so-called *interface-only packages*. |
8 |
| - |
9 |
| -## When to use this template |
10 |
| - |
11 |
| -This template is intended for use when a package providing a machine |
12 |
| -learning model algorithm is not hosting the code that implements the |
13 |
| -MLJ model API, and a separate package for this purpose is to be |
14 |
| -created. This repo is itself a working implementation but should |
15 |
| -be used in conjunction with the more detailed [model implementation |
16 |
| -guidelines](https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/). |
17 |
| - |
18 |
| -## How to use this template |
19 |
| - |
20 |
| -1. Clone this repository or use it as a template if available from your organization. |
21 |
| - |
22 |
| -2. Rename this repository, replacing the word "Example" with the name of the model-providing package. |
23 |
| - |
24 |
| -1. Develop the contents of src/MLJExampleInterface.jl appropriately. |
25 |
| - |
26 |
| -2. Rename src/MLJExampleInterface.jl appropriately. |
27 |
| - |
28 |
| -3. Remove Example from Project.toml and instead add the model-providing package. |
29 |
| - |
30 |
| -3. **GENERATE A NEW UUID in Project.toml** and change the Project.toml |
31 |
| - name and author appropriately. |
32 |
| - |
33 |
| -1. You may want to remove the Distributions test dependency if you don't need it. |
34 |
| - |
35 |
| -4. Replace every instance of "Example" in this README.md with the name |
36 |
| - of the model-providing package and adjust the organization name in |
37 |
| - the link. |
38 |
| - |
39 |
| -5. Remove everything in this REAMDE.md except what is below the line |
40 |
| - you are currently reading 😉. |
41 |
| - |
42 |
| - |
43 |
| -# MLJ.jl <--> Example.jl |
44 |
| - |
45 |
| -Repository implementing the [MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/) model interface for models provided by |
46 |
| -[Example.jl](https://github.com/JuliaLang/Example.jl). |
47 |
| - |
48 |
| -| Linux | Coverage | |
49 |
| -| :------------ | :------- | |
50 |
| -| [](https://github.com/JuliaAI/MLJExampleInterface.jl/actions) | [](https://codecov.io/github/JuliaAI/MLJExampleInterface.jl?branch=master) | |
| 3 | +| Linux | Coverage | Code Style |
| 4 | +| :------------ | :------- | :------------- | |
| 5 | +| [](https://github.com/JuliaAI/FeatureSelection.jl/actions) | [](https://codecov.io/github/JuliaAI/FeatureSelection.jl?branch=dev) | [](https://github.com/invenia/BlueStyle) | |
| 6 | + |
| 7 | +Repository housing feature selection algorithms for use with the machine learning toolbox |
| 8 | +[MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/). |
| 9 | + |
| 10 | +`FeatureSelector` model builds on contributions originally residing at [MLJModels.jl](https://github.com/JuliaAI/MLJModels.jl/blob/v0.16.15/src/builtins/Transformers.jl#L189-L266) |
| 11 | + |
| 12 | +# Installation |
| 13 | +On a running instance of Julia with at least version 1.6 run |
| 14 | +```julia |
| 15 | +import Pkg; |
| 16 | +Pkg.add("FeatureSelection") |
| 17 | +``` |
| 18 | + |
| 19 | +# Example Usage |
| 20 | +Lets build a supervised recursive feature eliminator with `RandomForestRegressor` |
| 21 | +from DecisionTree.jl as our base model. |
| 22 | +But first we need a dataset to train on. We shall create a synthetic dataset popularly |
| 23 | +known in the R community as the friedman dataset#1. Notice how the target vector for this |
| 24 | +dataset depends on only the first five columns of feature table. So we expect that our |
| 25 | +recursive feature elimination should return the first columns as important features. |
| 26 | +```julia |
| 27 | +using MLJ, FeatureSelection |
| 28 | +using StableRNGs |
| 29 | +rng = StableRNG(10) |
| 30 | +A = rand(rng, 50, 10) |
| 31 | +X = MLJ.table(A) # features |
| 32 | +y = @views( |
| 33 | + 10 .* sin.( |
| 34 | + pi .* A[:, 1] .* A[:, 2] |
| 35 | + ) .+ 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5] |
| 36 | +) # target |
| 37 | +``` |
| 38 | +Now we that we have our data we can create our recursive feature elimination model and |
| 39 | +train it on our dataset |
| 40 | +```julia |
| 41 | +RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree |
| 42 | +forest = RandomForestRegressor(rng=rng) |
| 43 | +rfe = RecursiveFeatureElimination( |
| 44 | + model = forest, n_features=5, step=1 |
| 45 | +) # see doctring for description of defaults |
| 46 | +mach = machine(rfe, X, y) |
| 47 | +fit!(mach) |
| 48 | +``` |
| 49 | +We can inspect the feature importances in two ways: |
| 50 | +```julia |
| 51 | +# A variable with lower rank has more significance than a variable with higher rank. |
| 52 | +# A variable with Higher feature importance is better than a variable with lower |
| 53 | +# feature importance |
| 54 | +report(mach).ranking # returns [1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0] |
| 55 | +feature_importances(mach) # returns dict of feature => importance pairs |
| 56 | +``` |
| 57 | +We can view the important features used by our model by inspecting the `fitted_params` |
| 58 | +object. |
| 59 | +```julia |
| 60 | +p = fitted_params(mach) |
| 61 | +p.features_left == [:x1, :x2, :x3, :x4, :x5] |
| 62 | +``` |
| 63 | +We can also call the `predict` method on the fitted machine, to predict using a |
| 64 | +random forest regressor trained using only the important features, or call the `transform` |
| 65 | +method, to select just those features from some new table including all the original |
| 66 | +features. For more info, type `?RecursiveFeatureElimination` on a Julia REPL. |
| 67 | + |
| 68 | +Okay, let's say that we didn't know that our synthetic dataset depends on only five |
| 69 | +columns from our feature table. We could apply cross fold validation |
| 70 | +`StratifiedCV(nfolds=5)` with our recursive feature elimination model to select the |
| 71 | +optimal value of `n_features` for our model. In this case we will use a simple Grid |
| 72 | +search with root mean square as the measure. |
| 73 | +```julia |
| 74 | +rfe = RecursiveFeatureElimination(model = forest) |
| 75 | +tuning_rfe_model = TunedModel( |
| 76 | + model = rfe, |
| 77 | + measure = rms, |
| 78 | + tuning = Grid(rng=rng), |
| 79 | + resampling = StratifiedCV(nfolds = 5), |
| 80 | + range = range( |
| 81 | + rfe, :n_features, values = 1:10 |
| 82 | + ) |
| 83 | +) |
| 84 | +self_tuning_rfe_mach = machine(tuning_rfe_model, X, y) |
| 85 | +fit!(self_tuning_rfe_mach) |
| 86 | +``` |
| 87 | +As before we can inspect the important features by inspecting the object returned by |
| 88 | +`fitted_params` or `feature_importances` as shown below. |
| 89 | +```julia |
| 90 | +fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left == [:x1, :x2, :x3, :x4, :x5] |
| 91 | +feature_importances(self_tuning_rfe_mach) # returns dict of feature => importance pairs |
| 92 | +``` |
| 93 | +and call `predict` on the tuned model machine as shown below |
| 94 | +```julia |
| 95 | +Xnew = MLJ.table(rand(rng, 50, 10)) # create test data |
| 96 | +predict(self_tuning_rfe_mach, Xnew) |
| 97 | +``` |
| 98 | +In this case, prediction is done using the best recursive feature elimination model gotten |
| 99 | +from the tuning process above. |
| 100 | + |
| 101 | +For resampling methods different from cross-validation, and for other |
| 102 | + `TunedModel` options, such as parallelization, see the |
| 103 | + [Tuning Models](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/) section of the MLJ manual. |
| 104 | +[MLJ Documentation](https://alan-turing-institute.github.io/MLJ.jl/dev/) |
0 commit comments