Skip to content

Commit fc7ae85

Browse files
authored
Merge pull request #101 from zStupan/main
Updated docs and examples
2 parents 5cacca0 + f714e70 commit fc7ae85

File tree

8 files changed

+240
-22
lines changed

8 files changed

+240
-22
lines changed

README.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,9 @@ data = Dataset('datasets/Abalone.csv')
9292
print(data)
9393
```
9494

95-
### Data Squashing
95+
### Preprocessing
96+
97+
#### Data Squashing
9698

9799
Optionally, a preprocessing technique, called data squashing [5], can be applied. This will significantly reduce the number of transactions, while providing similar results to the original dataset.
98100

@@ -104,7 +106,9 @@ squashed = squash(dataset, threshold=0.9, similarity='euclidean')
104106
print(squashed)
105107
```
106108

107-
### Mining association rules the easy way (recommended)
109+
### Mining association rules
110+
111+
#### The easy way (recommended)
108112

109113
Association rule mining can be easily performed using the `get_rules` function:
110114

@@ -124,7 +128,7 @@ print(f'Run Time: {run_time}')
124128
rules.to_csv('output.csv')
125129
```
126130

127-
### Mining association rules the hard way
131+
#### The hard way
128132

129133
The above example can be also be implemented using a more low level interface,
130134
with the `NiaARM` class directly:
@@ -137,7 +141,7 @@ from niapy.task import Task, OptimizationType
137141

138142
data = Dataset("datasets/Abalone.csv")
139143

140-
# Create a problem:::
144+
# Create a problem
141145
# dimension represents the dimension of the problem;
142146
# features represent the list of features, while transactions depicts the list of transactions
143147
# metrics is a sequence of metrics to be taken into account when computing the fitness;
@@ -162,6 +166,12 @@ problem.rules.sort()
162166
problem.rules.to_csv('output.csv')
163167
```
164168

169+
#### Interest measures
170+
171+
The framework implements several popular interest measures, which can be used to compute the fitness function value of rules
172+
and for assessing the quality of the mined rules. A full list of the implemented interest measures along with their descriptions
173+
and equations can be found [here](interest_measures.md).
174+
165175
### Visualization
166176

167177
The framework currently supports the hill slopes visualization method presented in [4]. More visualization methods are planned
@@ -207,13 +217,9 @@ algorithm = ParticleSwarmOptimization(population_size=200, seed=123)
207217
metrics = ('support', 'confidence', 'aws')
208218
rules, time = get_text_rules(corpus, max_terms=5, algorithm=algorithm, metrics=metrics, max_evals=10000, logging=True)
209219

210-
if len(rules):
211-
print(rules)
212-
print(f'Run time: {time:.2f}s')
213-
rules.to_csv('output.csv')
214-
else:
215-
print('No rules generated')
216-
print(f'Run time: {time:.2f}s')
220+
print(rules)
221+
print(f'Run time: {time:.2f}s')
222+
rules.to_csv('output.csv')
217223
```
218224

219225
**Note:** You may need to download stopwords and the punkt tokenizer from nltk by running `import nltk; nltk.download('stopwords'); nltk.download('punkt')`.

examples/basic_run.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
# load and preprocess the dataset from csv
99
data = Dataset("datasets/Abalone.csv")
1010

11-
# Create a problem:::
11+
# Create a problem
1212
# dimension represents the dimension of the problem;
1313
# features represent the list of features, while transactions depicts the list of transactions
1414
# the following 4 elements represent weights (support, confidence, coverage, shrinkage)

examples/basic_run_with_get_rules.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,18 @@
11
from niaarm import Dataset, get_rules
22
from niapy.algorithms.basic import DifferentialEvolution
33

4-
4+
# load dataset
55
data = Dataset("datasets/Abalone.csv")
6+
7+
# initialize the algorithm
68
algo = DifferentialEvolution(
79
population_size=50, differential_weight=0.5, crossover_probability=0.9
810
)
11+
12+
# define metrics to be used in fitness computation
913
metrics = ("support", "confidence")
1014

15+
# mine association rules
1116
res = get_rules(data, algo, metrics, max_iters=30, logging=True)
1217
# or rules, run_time = get_rules(...)
1318

examples/data_squashing.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
from niaarm.dataset import Dataset
22
from niaarm.preprocessing import squash
33

4-
4+
# load dataset
55
dataset = Dataset("datasets/Abalone.csv")
6+
7+
# squash the dataset with a threshold of 0.9, using Euclidean distance as a similarity measure
68
squashed = squash(dataset, threshold=0.9, similarity="euclidean")
9+
10+
# print the squashed dataset
711
print(squashed)

examples/text_mining.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,11 @@
33
from niaarm.mine import get_text_rules
44
from niapy.algorithms.basic import ParticleSwarmOptimization
55

6+
# load corpus and extract the documents as a list of strings
67
df = pd.read_json("datasets/text/artm_test_dataset.json", orient="records")
78
documents = df["text"].tolist()
89

10+
# create a Corpus object from the documents (requires nltk's punkt tokenizer and the stopwords list)
911
try:
1012
corpus = Corpus.from_list(documents)
1113
except LookupError:
@@ -15,21 +17,21 @@
1517
nltk.download("stopwords")
1618
corpus = Corpus.from_list(documents)
1719

20+
# the rest is pretty much the same as with the numerical association rules
21+
# 1. Init algorithm
22+
# 2. Define metrics
23+
# 3. Run algorithm
1824
algorithm = ParticleSwarmOptimization(population_size=200, seed=123)
1925
metrics = ("support", "confidence", "aws")
2026
rules, time = get_text_rules(
2127
corpus,
22-
max_terms=5,
28+
max_terms=8,
2329
algorithm=algorithm,
2430
metrics=metrics,
2531
max_evals=10000,
2632
logging=True,
2733
)
2834

29-
if len(rules):
30-
print(rules)
31-
print(f"Run time: {time:.2f}s")
32-
rules.to_csv("output.csv")
33-
else:
34-
print("No rules generated")
35-
print(f"Run time: {time:.2f}s")
35+
print(rules)
36+
print(f"Run time: {time:.2f}s")
37+
rules.to_csv("output.csv")

examples/visualization.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,14 @@
22
from niaarm import Dataset, get_rules
33
from niaarm.visualize import hill_slopes
44

5+
# Load dataset and mine rules
56
dataset = Dataset("datasets/Abalone.csv")
67
metrics = ("support", "confidence")
78
rules, _ = get_rules(
89
dataset, "DifferentialEvolution", metrics, max_evals=1000, seed=1234
910
)
11+
12+
# Visualize any rule using the hill_slope function like so:
1013
some_rule = rules[150]
1114
print(some_rule)
1215
fig, ax = hill_slopes(some_rule, dataset.transactions)

interest_measures.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# Interest Measures
2+
3+
## Support
4+
5+
Support is defined on an itemset as the proportion of transactions that contain the attribute $`X`$.
6+
7+
```math
8+
supp(X) = \frac{n_{X}}{|D|},
9+
```
10+
11+
where $`|D|`$ is the number of records in the transactional database.
12+
13+
For an association rule, support is defined as the support of all the attributes in the rule.
14+
15+
```math
16+
supp(X \implies Y) = \frac{n_{XY}}{|D|}
17+
```
18+
19+
**Range:** $`[0, 1]`$
20+
21+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
22+
2015, URL: https://mhahsler.github.io/arules/docs/measures
23+
24+
## Confidence
25+
26+
Confidence of the rule, defined as the proportion of transactions that contain
27+
the consequent in the set of transactions that contain the antecedent. This proportion is an estimate
28+
of the probability of seeing the consequent, if the antecedent is present in the transaction.
29+
30+
```math
31+
conf(X \implies Y) = \frac{supp(X \implies Y)}{supp(X)}
32+
```
33+
34+
**Range:** $`[0, 1]`$
35+
36+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
37+
2015, URL: https://mhahsler.github.io/arules/docs/measures
38+
39+
## Lift
40+
41+
Lift measures how many times more often the antecedent and the consequent Y
42+
occur together than expected if they were statistically independent.
43+
44+
```math
45+
lift(X \implies Y) = \frac{conf(X \implies Y)}{supp(Y)}
46+
```
47+
48+
**Range:** $`[0, \infty]`$ (1 means independence)
49+
50+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
51+
2015, URL: https://mhahsler.github.io/arules/docs/measures
52+
53+
## Coverage
54+
55+
Coverage, also known as antecedent support, is an estimate of the probability that
56+
the rule applies to a randomly selected transaction. It is the proportion of transactions
57+
that contain the antecedent.
58+
59+
```math
60+
cover(X \implies Y) = supp(X)
61+
```
62+
63+
**Range:** $`[0, 1]`$
64+
65+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
66+
2015, URL: https://mhahsler.github.io/arules/docs/measures
67+
68+
## RHS Support
69+
70+
Support of the consequent.
71+
72+
```math
73+
RHSsupp(X \implies Y) = supp(Y)
74+
```
75+
76+
**Range:** $`[0, 1]`$
77+
78+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
79+
2015, URL: https://mhahsler.github.io/arules/docs/measures
80+
81+
## Conviction
82+
83+
Conviction can be interpreted as the ratio of the expected frequency that the antecedent occurs without
84+
the consequent.
85+
86+
```math
87+
conv(X \implies Y) = \frac{1 - supp(Y)}{1 - conf(X \implies Y)}
88+
```
89+
90+
**Range:** $`[0, \infty]`$ (1 means independence, $`\infty`$ means the rule always holds)
91+
92+
**Reference:** Michael Hahsler, A Probabilistic Comparison of Commonly Used Interest Measures for Association Rules,
93+
2015, URL: https://mhahsler.github.io/arules/docs/measures
94+
95+
## Inclusion
96+
97+
Inclusion is defined as the ratio between the number of attributes of the rule
98+
and all attributes in the database.
99+
100+
```math
101+
inclusion(X \implies Y) = \frac{|X \cup Y|}{m},
102+
```
103+
104+
where $`m`$ is the total number of attributes in the transactional database.
105+
106+
107+
**Range:** $`[0, 1]`$
108+
109+
**Reference:** I. Fister Jr., V. Podgorelec, I. Fister. Improved Nature-Inspired Algorithms for Numeric Association
110+
Rule Mining. In: Vasant P., Zelinka I., Weber GW. (eds) Intelligent Computing and Optimization. ICO 2020. Advances in
111+
Intelligent Systems and Computing, vol 1324. Springer, Cham.
112+
113+
## Amplitude
114+
115+
Amplitude measures the quality of a rule, preferring attributes with smaller intervals.
116+
117+
```math
118+
ampl(X \implies Y) = 1 - \frac{1}{n}\sum_{k = 1}^{n}{\frac{Ub_k - Lb_k}{max(o_k) - min(o_k)}},
119+
```
120+
121+
where $`n`$ is the total number of attributes in the rule, $`Ub_k`$ and $`Lb_k`$ are upper and lower
122+
bounds of the selected attribute, and $`max(o_k)`$ and $`min(o_k)`$ are the maximum and minimum
123+
feasible values of the attribute $`o_k`$ in the transactional database.
124+
125+
**Range:** $`[0, 1]`$
126+
127+
**Reference:** I. Fister Jr., I. Fister A brief overview of swarm intelligence-based algorithms for numerical
128+
association rule mining. arXiv preprint arXiv:2010.15524 (2020).
129+
130+
## Interestingness
131+
132+
Interestingness of the rule, defined as:
133+
134+
```math
135+
interest(X \implies Y) = \frac{supp(X \implies Y)}{supp(X)} \cdot \frac{supp(X \implies Y)}{supp(Y)}
136+
\cdot (1 - \frac{supp(X \implies Y)}{|D|})
137+
```
138+
139+
Here, the first part gives us the probability of generating the rule based on the antecedent, the second part
140+
gives us the probability of generating the rule based on the consequent and the third part is the probability
141+
that the rule won't be generated. Thus, rules with very high support will be deemed uninteresting.
142+
143+
**Range:** $`[0, 1]`$
144+
145+
**Reference:** I. Fister Jr., I. Fister A brief overview of swarm intelligence-based algorithms for numerical
146+
association rule mining. arXiv preprint arXiv:2010.15524 (2020).
147+
148+
## Comprehensibility
149+
150+
Comprehensibility of the rule. Rules with fewer attributes in the consequent are more
151+
comprehensible.
152+
153+
```math
154+
comp(X \implies Y) = \frac{log(1 + |Y|)}{log(1 + |X \cup Y|)}
155+
```
156+
157+
**Range:** $`[0, 1]`$
158+
159+
**Reference:** I. Fister Jr., I. Fister A brief overview of swarm intelligence-based algorithms for numerical
160+
association rule mining. arXiv preprint arXiv:2010.15524 (2020).
161+
162+
## Netconf
163+
164+
The netconf metric evaluates the interestingness of
165+
association rules depending on the support of the rule and the
166+
support of the antecedent and consequent of the rule.
167+
168+
```math
169+
netconf(X \implies Y) = \frac{supp(X \implies Y) - supp(X)supp(Y)}{supp(X)(1 - supp(X))}
170+
```
171+
172+
**Range:** $`[-1, 1]`$ (Negative values represent negative dependence, positive values represent positive
173+
dependence and 0 represents independence)
174+
175+
**Reference:** E. V. Altay and B. Alatas, "Sensitivity Analysis of MODENAR Method for Mining of Numeric Association
176+
Rules," 2019 1st International Informatics and Software Engineering Conference (UBMYK), 2019, pp. 1-6,
177+
doi: 10.1109/UBMYK48245.2019.8965539.
178+
179+
## Yule's Q
180+
181+
The Yule's Q metric represents the correlation between two possibly related dichotomous events.
182+
183+
```math
184+
yulesq(X \implies Y) =
185+
\frac{supp(X \implies Y)supp(\neg X \implies \neg Y) - supp(X \implies \neg Y)supp(\neg X \implies Y)}
186+
{supp(X \implies Y)supp(\neg X \implies \neg Y) + supp(X \implies \neg Y)supp(\neg X \implies Y)}
187+
```
188+
189+
**Range:** $`[-1, 1]`$ (-1 reflects total negative association, 1 reflects perfect positive association
190+
and 0 reflects independence)
191+
192+
**Reference:** E. V. Altay and B. Alatas, "Sensitivity Analysis of MODENAR Method for Mining of Numeric Association
193+
Rules," 2019 1st International Informatics and Software Engineering Conference (UBMYK), 2019, pp. 1-6,
194+
doi: 10.1109/UBMYK48245.2019.8965539.

niaarm/rule_list.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,10 @@ def to_csv(self, filename):
8383
filename (str): File to save the rules to.
8484
8585
"""
86+
if not self:
87+
print("No rules to output")
88+
return
89+
8690
with open(filename, "w", newline="") as f:
8791
writer = csv.writer(f)
8892

0 commit comments

Comments
 (0)