Skip to content

Commit b78ef9d

Browse files
authored
Merge pull request #15 from ngoix/general
general review
2 parents b542716 + d15a13d commit b78ef9d

File tree

6 files changed

+41
-33
lines changed

6 files changed

+41
-33
lines changed

AUTHORS.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ People
77

88
.. hlist::
99

10-
* `Ronan Gauthier <[email protected]>`_
1110
* `Florian Gardin <[email protected]>`_
11+
* `Ronan Gautier <[email protected]>`_
1212
* `Nicolas Goix <[email protected]>`_
1313
* `Bibi Ndiaye <[email protected]>`_
1414
* `Jean-Matthieu Schertzer <[email protected]>`_

doc/Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ clean:
5353
-rm -rf modules/generated/*
5454

5555
html:
56-
# These two lines make the build a bit more lengthy, and the
56+
# These two lines make the build a bit more lengthy, and
5757
# the embedding of images more robust
5858
rm -rf $(BUILDDIR)/html/_images
5959
#rm -rf _build/doctrees/

doc/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ rules have to be used for classifying data.
1212
This project is particularly suitable for supervised anomaly detection,
1313
i.e. imbalanced classification.
1414
Application domains include fraud detection, predictive
15-
maintenance, intrusion detection, churn detection.
15+
maintenance, intrusion detection, churn detection...
1616

1717
This project comes with a `skrules` module which contains a single
1818
estimator with unit tests, along with examples and benchmarks.

examples/plot_credit_default.py

Lines changed: 9 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,14 @@
55
66
77
SkopeRules finds logical rules with high precision and fuse them. Finding
8-
good rules is done by fitting classification or regression trees
8+
good rules is done by fitting classification and regression trees
99
to sub-samples.
1010
A fitted tree defines a set of rules (each tree node defines a rule); rules
1111
are then tested out of the bag, and the ones with higher precision are kept.
12-
This set of rules is decision function, reflecting for
13-
each new samples how many rules have find it abnormal.
1412
1513
This example aims at finding logical rules to predict credit defaults. The
1614
analysis shows that setting.
1715
18-
The dataset comes from BLABLABLA.
1916
"""
2017

2118
###############################################################################
@@ -54,8 +51,6 @@
5451
for col in ['ID']:
5552
del data[col]
5653

57-
# data = pd.get_dummies(data, columns = ['SEX', 'EDUCATION', 'MARRIAGE'])
58-
5954
# Quick feature engineering
6055
data = data.rename(columns={"PAY_0": "PAY_1"})
6156
old_PAY = ['PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
@@ -80,7 +75,7 @@
8075

8176
# Creating the train/test split
8277
feature_names = list(data.columns)
83-
print(feature_names)
78+
print("List of variables used to train models : " + str(feature_names))
8479
data = data.values
8580
n_samples = data.shape[0]
8681
n_samples_train = int(n_samples / 2)
@@ -90,14 +85,11 @@
9085
X_test = data[n_samples_train:]
9186

9287
###############################################################################
93-
# Benchmark with a Decision Tree and Random Forests
88+
# Benchmark with a Random Forest classifier.
9489
# ..................
9590
#
96-
# This part shows the training and performance evaluation of
97-
# two tree-based models.
98-
# The objective remains to extract rules which targets credit defaults.
99-
# This benchmark shows the performance reached with a decision tree and a
100-
# random forest.
91+
# This part shows the training and performance evaluation of a random forest
92+
# model. The objective remains to extract rules which targets credit defaults.
10193

10294
RF = GridSearchCV(
10395
RandomForestClassifier(
@@ -106,24 +98,22 @@
10698
class_weight='balanced'),
10799
param_grid={
108100
'max_depth': range(3, 8, 1),
109-
'max_features': np.linspace(0.1, 0.2, 1.)
101+
'max_features': np.linspace(0.1, 1., 5)
110102
},
111103
scoring={'AUC': 'roc_auc'}, cv=5,
112104
refit='AUC', n_jobs=-1)
113105

114106
RF.fit(X_train, y_train)
115107
scoring_RF = RF.predict_proba(X_test)[:, 1]
116108

117-
# print("Decision Tree selected parameters : "+str(DT.best_params_))
118-
print("Random Forest selected parameters : "+str(RF.best_params_))
109+
print("Random Forest selected parameters : " + str(RF.best_params_))
119110

120111
# Plot ROC and PR curves
121112

122113
fig, axes = plt.subplots(1, 2, figsize=(12, 5),
123114
sharex=True, sharey=True)
124115

125116
ax = axes[0]
126-
# fpr_DT, tpr_DT, _ = roc_curve(y_test, scoring_DT)
127117
fpr_RF, tpr_RF, _ = roc_curve(y_test, scoring_RF)
128118
ax.step(fpr_RF, tpr_RF, linestyle='-.', c='g', lw=1, where='post')
129119
ax.set_title("ROC", fontsize=20)
@@ -132,7 +122,6 @@
132122
ax.set_ylabel('True Positive Rate (Recall)', fontsize=18)
133123

134124
ax = axes[1]
135-
# precision_DT, recall_DT, _ = precision_recall_curve(y_test, scoring_DT)
136125
precision_RF, recall_RF, _ = precision_recall_curve(y_test, scoring_RF)
137126
ax.step(recall_RF, precision_RF, linestyle='-.', c='g', lw=1, where='post')
138127
ax.set_title("Precision-Recall", fontsize=20)
@@ -209,8 +198,8 @@
209198

210199
###############################################################################
211200
# The ROC and Precision-Recall curves show the performance of the rules
212-
# generated by SkopeRulesthe (blue points) and the performance of the Random
213-
# Forest classifier fitted above.
201+
# generated by SkopeRulesthe (the blue points) and the performance of the
202+
# Random Forest classifier fitted above.
214203
# Each blue point represents the performance of a set of rules: The kth point
215204
# represents the score associated to the concatenation (union) of the k first
216205
# rules, etc. Thus, each blue point is associated with an interpretable

skrules/datasets/credit_data.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,21 @@
1+
"""default of credit card clients dataset.
2+
3+
The original database is available from UCI Machine Learning Repository:
4+
5+
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
6+
7+
The data contains 30000 observations on 24 variables.
8+
9+
References
10+
----------
11+
12+
Lichman, M. (2013). UCI Machine Learning Repository
13+
[http://archive.ics.uci.edu/ml].
14+
Irvine, CA: University of California, School of Information and Computer
15+
Science.
16+
17+
"""
18+
119
import pandas as pd
220
import numpy as np
321
from sklearn.datasets.base import get_data_home, Bunch

skrules/skope_rules.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,15 @@ class SkopeRules(BaseEstimator):
2222
Parameters
2323
----------
2424
25-
feature_names: list of str, optional
25+
feature_names : list of str, optional
2626
The names of each feature to be used for returning rules in string
2727
format.
2828
29-
precision_min: float, optional (default=0.5)
30-
minimal precision of a rule to be selected.
29+
precision_min : float, optional (default=0.5)
30+
The minimal precision of a rule to be selected.
3131
32-
recall_min: float, optional (default=0.01)
33-
minimal recall of a rule to be selected.
32+
recall_min : float, optional (default=0.01)
33+
The minimal recall of a rule to be selected.
3434
3535
n_estimators : int, optional (default=10)
3636
The number of base estimators (rules) to use for prediction. More are
@@ -50,7 +50,8 @@ class SkopeRules(BaseEstimator):
5050
all samples will be used for all trees (no sampling).
5151
5252
max_samples_features : int or float, optional (default=1.0)
53-
The number of features to draw from X to train each decision tree.
53+
The number of features to draw from X to train each decision tree, from
54+
which rules are generated and selected.
5455
- If int, then draw `max_features` features.
5556
- If float, then draw `max_features * X.shape[1]` features.
5657
@@ -95,9 +96,9 @@ class SkopeRules(BaseEstimator):
9596
If -1, then the number of jobs is set to the number of cores.
9697
9798
random_state : int, RandomState instance or None, optional
98-
If int, random_state is the seed used by the random number generator;
99-
If RandomState instance, random_state is the random number generator;
100-
If None, the random number generator is the RandomState instance used
99+
- If int, random_state is the seed used by the random number generator.
100+
- If RandomState instance, random_state is the random number generator.
101+
- If None, the random number generator is the RandomState instance used
101102
by `np.random`.
102103
103104
verbose : int, optional (default=0)
@@ -442,7 +443,7 @@ def decision_function(self, X):
442443

443444
scores = np.zeros(X.shape[0])
444445
for (r, w) in selected_rules:
445-
scores[list(df.query(r).index)] += 1 # w[0]
446+
scores[list(df.query(r).index)] += w[0]
446447

447448
return scores
448449

0 commit comments

Comments
 (0)