Skip to content

Commit 199d062

Browse files
committed
added interest measures descriptions
1 parent 5cacca0 commit 199d062

File tree

2 files changed

+211
-11
lines changed

2 files changed

+211
-11
lines changed

README.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,9 @@ data = Dataset('datasets/Abalone.csv')
9292
print(data)
9393
```
9494

95-
### Data Squashing
95+
### Preprocessing
96+
97+
#### Data Squashing
9698

9799
Optionally, a preprocessing technique, called data squashing [5], can be applied. This will significantly reduce the number of transactions, while providing similar results to the original dataset.
98100

@@ -104,7 +106,9 @@ squashed = squash(dataset, threshold=0.9, similarity='euclidean')
104106
print(squashed)
105107
```
106108

107-
### Mining association rules the easy way (recommended)
109+
### Mining association rules
110+
111+
#### The easy way (recommended)
108112

109113
Association rule mining can be easily performed using the `get_rules` function:
110114

@@ -124,7 +128,7 @@ print(f'Run Time: {run_time}')
124128
rules.to_csv('output.csv')
125129
```
126130

127-
### Mining association rules the hard way
131+
#### The hard way
128132

129133
The above example can be also be implemented using a more low level interface,
130134
with the `NiaARM` class directly:
@@ -137,7 +141,7 @@ from niapy.task import Task, OptimizationType
137141

138142
data = Dataset("datasets/Abalone.csv")
139143

140-
# Create a problem:::
144+
# Create a problem
141145
# dimension represents the dimension of the problem;
142146
# features represent the list of features, while transactions depicts the list of transactions
143147
# metrics is a sequence of metrics to be taken into account when computing the fitness;
@@ -162,6 +166,12 @@ problem.rules.sort()
162166
problem.rules.to_csv('output.csv')
163167
```
164168

169+
#### Interest measures
170+
171+
The framework implements several popular interest measures, which can be used to compute the fitness function value of rules
172+
and for assessing the quality of the mined rules. A full list of the implemented interest measures along with their descriptions
173+
and equations can be found [here](interest_measures.md).
174+
165175
### Visualization
166176

167177
The framework currently supports the hill slopes visualization method presented in [4]. More visualization methods are planned
@@ -207,13 +217,9 @@ algorithm = ParticleSwarmOptimization(population_size=200, seed=123)
207217
metrics = ('support', 'confidence', 'aws')
208218
rules, time = get_text_rules(corpus, max_terms=5, algorithm=algorithm, metrics=metrics, max_evals=10000, logging=True)
209219

210-
if len(rules):
211-
print(rules)
212-
print(f'Run time: {time:.2f}s')
213-
rules.to_csv('output.csv')
214-
else:
215-
print('No rules generated')
216-
print(f'Run time: {time:.2f}s')
220+
print(rules)
221+
print(f'Run time: {time:.2f}s')
222+
rules.to_csv('output.csv')
217223
```
218224

219225
**Note:** You may need to download stopwords and the punkt tokenizer from nltk by running `import nltk; nltk.download('stopwords'); nltk.download('punkt')`.

interest_measures.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# Interest Measures
2+
3+
## Support
4+
5+
Support is defined on an itemset as the proportion of transactions that contain the attribute $`X`$.
6+
7+
```math
8+
supp(X) = \frac{n_{X}}{|D|},
9+
```
10+
11+
where $`|D|`$ is the number of records in the transactional database.
12+
13+
For an association rule, support is defined as the support of all the attributes in the rule.
14+
15+
```math
16+
supp(X \implies Y) = \frac{n_{XY}}{|D|}
17+
```
18+
19+
**Range:** $`[0, 1]`$
20+
21+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
22+
2015, URL: https://mhahsler.github.io/arules/docs/measures
23+
24+
## Confidence
25+
26+
Confidence of the rule, defined as the proportion of transactions that contain
27+
the consequent in the set of transactions that contain the antecedent. This proportion is an estimate
28+
of the probability of seeing the consequent, if the antecedent is present in the transaction.
29+
30+
```math
31+
conf(X \implies Y) = \frac{supp(X \implies Y)}{supp(X)}
32+
```
33+
34+
**Range:** $`[0, 1]`$
35+
36+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
37+
2015, URL: https://mhahsler.github.io/arules/docs/measures
38+
39+
## Lift
40+
41+
Lift measures how many times more often the antecedent and the consequent Y
42+
occur together than expected if they were statistically independent.
43+
44+
```math
45+
lift(X \implies Y) = \frac{conf(X \implies Y)}{supp(Y)}
46+
```
47+
48+
**Range:** $`[0, \infty]`$ (1 means independence)
49+
50+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
51+
2015, URL: https://mhahsler.github.io/arules/docs/measures
52+
53+
## Coverage
54+
55+
Coverage, also known as antecedent support, is an estimate of the probability that
56+
the rule applies to a randomly selected transaction. It is the proportion of transactions
57+
that contain the antecedent.
58+
59+
```math
60+
cover(X \implies Y) = supp(X)
61+
```
62+
63+
**Range:** $`[0, 1]`$
64+
65+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
66+
2015, URL: https://mhahsler.github.io/arules/docs/measures
67+
68+
## RHS Support
69+
70+
Support of the consequent.
71+
72+
```math
73+
RHSsupp(X \implies Y) = supp(Y)
74+
```
75+
76+
**Range:** $`[0, 1]`$
77+
78+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
79+
2015, URL: https://mhahsler.github.io/arules/docs/measures
80+
81+
## Conviction
82+
83+
Conviction can be interpreted as the ratio of the expected frequency that the antecedent occurs without
84+
the consequent.
85+
86+
```math
87+
conv(X \implies Y) = \frac{1 - supp(Y)}{1 - conf(X \implies Y)}
88+
```
89+
90+
**Range:** $`[0, \infty]`$ (1 means independence, $`\infty`$ means the rule always holds)
91+
92+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
93+
2015, URL: https://mhahsler.github.io/arules/docs/measures
94+
95+
## Inclusion
96+
97+
Inclusion is defined as the ratio between the number of attributes of the rule
98+
and all attributes in the database.
99+
100+
```math
101+
inclusion(X \implies Y) = \frac{|X \cup Y|}{m},
102+
```
103+
104+
where $`m`$ is the total number of attributes in the transactional database.
105+
106+
107+
**Range:** $`[0, 1]`$
108+
109+
**Reference:** I. Fister Jr., V. Podgorelec, I. Fister. Improved Nature-Inspired Algorithms for Numeric Association
110+
Rule Mining. In: Vasant P., Zelinka I., Weber GW. (eds) Intelligent Computing and Optimization. ICO 2020. Advances in
111+
Intelligent Systems and Computing, vol 1324. Springer, Cham.
112+
113+
## Amplitude
114+
115+
Amplitude measures the quality of a rule, preferring attributes with smaller intervals.
116+
117+
```math
118+
ampl(X \implies Y) = 1 - \frac{1}{n}\sum_{k = 1}^{n}{\frac{Ub_k - Lb_k}{max(o_k) - min(o_k)}},
119+
```
120+
121+
where $`n`$ is the total number of attributes in the rule, $`Ub_k`$ and $`Lb_k`$ are upper and lower
122+
bounds of the selected attribute, and $`max(o_k)`$ and $`min(o_k)`$ are the maximum and minimum
123+
feasible values of the attribute $`o_k`$ in the transactional database.
124+
125+
**Range:** $`[0, 1]`$
126+
127+
**Reference:** I. Fister Jr., I. Fister A brief overview of swarm intelligence-based algorithms for numerical
128+
association rule mining. arXiv preprint arXiv:2010.15524 (2020).
129+
130+
## Interestingness
131+
132+
Interestingness of the rule, defined as:
133+
134+
```math
135+
interest(X \implies Y) = \frac{supp(X \implies Y)}{supp(X)} \cdot \frac{supp(X \implies Y)}{supp(Y)}
136+
\cdot (1 - \frac{supp(X \implies Y)}{|D|})
137+
```
138+
139+
Here, the first part gives us the probability of generating the rule based on the antecedent, the second part
140+
gives us the probability of generating the rule based on the consequent and the third part is the probability
141+
that the rule won't be generated. Thus, rules with very high support will be deemed uninteresting.
142+
143+
**Range:** $`[0, 1]`$
144+
145+
**Reference:** I. Fister Jr., I. Fister A brief overview of swarm intelligence-based algorithms for numerical
146+
association rule mining. arXiv preprint arXiv:2010.15524 (2020).
147+
148+
## Comprehensibility
149+
150+
Comprehensibility of the rule. Rules with fewer attributes in the consequent are more
151+
comprehensible.
152+
153+
```math
154+
comp(X \implies Y) = \frac{log(1 + |Y|)}{log(1 + |X \cup Y|)}
155+
```
156+
157+
**Range:** $`[0, 1]`$
158+
159+
**Reference:** I. Fister Jr., I. Fister A brief overview of swarm intelligence-based algorithms for numerical
160+
association rule mining. arXiv preprint arXiv:2010.15524 (2020).
161+
162+
## Netconf
163+
164+
The netconf metric evaluates the interestingness of
165+
association rules depending on the support of the rule and the
166+
support of the antecedent and consequent of the rule.
167+
168+
```math
169+
netconf(X \implies Y) = \frac{supp(X \implies Y) - supp(X)supp(Y)}{supp(X)(1 - supp(X))}
170+
```
171+
172+
**Range:** $`[-1, 1]`$ (Negative values represent negative dependence, positive values represent positive
173+
dependence and 0 represents independence)
174+
175+
**Reference:** E. V. Altay and B. Alatas, "Sensitivity Analysis of MODENAR Method for Mining of Numeric Association
176+
Rules," 2019 1st International Informatics and Software Engineering Conference (UBMYK), 2019, pp. 1-6,
177+
doi: 10.1109/UBMYK48245.2019.8965539.
178+
179+
## Yule's Q
180+
181+
The Yule's Q metric represents the correlation between two possibly related dichotomous events.
182+
183+
```math
184+
yulesq(X \implies Y) =
185+
\frac{supp(X \implies Y)supp(\neg X \implies \neg Y) - supp(X \implies \neg Y)supp(\neg X \implies Y)}
186+
{supp(X \implies Y)supp(\neg X \implies \neg Y) + supp(X \implies \neg Y)supp(\neg X \implies Y)}
187+
```
188+
189+
**Range:** $`[-1, 1]`$ (-1 reflects total negative association, 1 reflects perfect positive association
190+
and 0 reflects independence)
191+
192+
**Reference:** E. V. Altay and B. Alatas, "Sensitivity Analysis of MODENAR Method for Mining of Numeric Association
193+
Rules," 2019 1st International Informatics and Software Engineering Conference (UBMYK), 2019, pp. 1-6,
194+
doi: 10.1109/UBMYK48245.2019.8965539.

0 commit comments

Comments
 (0)