Skip to content

Commit d6e746e

Browse files
authored
Merge pull request #62 from zStupan/text-mining
Experimental support for text mining
2 parents efccf11 + f96c58c commit d6e746e

File tree

13 files changed

+827
-158
lines changed

13 files changed

+827
-158
lines changed

README.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ The current version includes (but is not limited to) the following functions:
2828
- searching for association rules,
2929
- providing output of mined association rules,
3030
- generating statistics about mined association rules,
31-
- visualization of association rules.
31+
- visualization of association rules,
32+
- association rule text mining (experimental).
3233

3334
## Installation
3435

@@ -159,6 +160,37 @@ plt.show()
159160
</p>
160161

161162

163+
### Text Mining (Experimental)
164+
165+
An experimental implementation of association rule text mining using nature-inspired algorithms, based on ideas from [5]
166+
is also provided. The `niaarm.text` module contains the `Corpus` and `Document` classes for loading and preprocessing corpora,
167+
a `TextRule` class, representing a text rule, and the `NiaARTM` class, implementing association rule text mining
168+
as a continuous optimization problem. The `get_text_rules` function, equivalent to `get_rules`, but for text mining, was also
169+
added to the `niaarm.mine` module.
170+
171+
```python
172+
import pandas as pd
173+
from niaarm.text import Corpus
174+
from niaarm.mine import get_text_rules
175+
from niapy.algorithms.basic import ParticleSwarmOptimization
176+
177+
df = pd.read_json('datasets/text/artm_test_dataset.json', orient='records')
178+
documents = df['text'].tolist()
179+
corpus = Corpus.from_list(documents)
180+
181+
algorithm = ParticleSwarmOptimization(population_size=200, seed=123)
182+
metrics = ('support', 'confidence', 'aws')
183+
rules, time = get_text_rules(corpus, max_terms=5, algorithm=algorithm, metrics=metrics, max_evals=10000, logging=True)
184+
185+
if len(rules):
186+
print(rules)
187+
print(f'Run time: {time:.2f}s')
188+
rules.to_csv('output.csv')
189+
else:
190+
print('No rules generated')
191+
print(f'Run time: {time:.2f}s')
192+
```
193+
162194
For a full list of examples see the [examples folder](https://github.com/firefly-cpp/NiaARM/tree/main/examples)
163195
in the GitHub repository.
164196

@@ -218,6 +250,10 @@ Ideas are based on the following research papers:
218250
In: Analide, C., Novais, P., Camacho, D., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2020.
219251
IDEAL 2020. Lecture Notes in Computer Science(), vol 12489. Springer, Cham. https://doi.org/10.1007/978-3-030-62362-3_10
220252

253+
[5] I. Fister, S. Deb, I. Fister, „Population-based metaheuristics for Association Rule Text Mining“,
254+
In: Proceedings of the 2020 4th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence,
255+
New York, NY, USA, mar. 2020, pp. 19–23. doi: 10.1145/3396474.3396493.
256+
221257
## License
222258

223259
This package is distributed under the MIT License. This license can be found online at <http://www.opensource.org/licenses/MIT>.

docs/api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ API Reference
99
niaarm
1010
rule
1111
rule_list
12+
text
1213
visualize

docs/api/text.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Text
2+
====
3+
4+
.. automodule:: niaarm.text
5+
:members:
6+
:show-inheritance:

docs/getting_started.rst

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,68 @@ presented in `this paper <https://link.springer.com/chapter/10.1007/978-3-030-62
217217

218218
.. image:: _static/hill_slopes.png
219219

220+
Text Mining (Experimental)
221+
~~~~~~~~~~~~~~~~~~~~~~~~~~
222+
223+
An experimental implementation of association rule text mining using nature-inspired algorithms
224+
is also provided. The :mod:`niaarm.text` module contains the :class:`~niaarm.text.Corpus` and :class:`~niaarm.text.Document` classes for loading and preprocessing corpora,
225+
a :class:`~niaarm.text.TextRule` class, representing a text rule, and the :class:`~niaarm.text.NiaARTM` class, implementing association rule text mining
226+
as a continuous optimization problem. The :func:`~niaarm.mine.get_text_rules` function, equivalent to :func:`~niaarm.mine.get_rules`, but for text mining, was also
227+
added to the :mod:`niaarm.mine` module.
228+
229+
.. code:: python
230+
231+
import pandas as pd
232+
from niaarm.text import Corpus
233+
from niaarm.mine import get_text_rules
234+
from niapy.algorithms.basic import ParticleSwarmOptimization
235+
236+
df = pd.read_json('datasets/text/artm_test_dataset.json', orient='records')
237+
documents = df['text'].tolist()
238+
corpus = Corpus.from_list(documents)
239+
240+
algorithm = ParticleSwarmOptimization(population_size=200, seed=123)
241+
metrics = ('support', 'confidence', 'aws')
242+
rules, time = get_text_rules(corpus, max_terms=5, algorithm=algorithm, metrics=metrics, max_evals=10000, logging=True)
243+
244+
if len(rules):
245+
print(rules)
246+
print(f'Run time: {time:.2f}s')
247+
rules.to_csv('output.csv')
248+
else:
249+
print('No rules generated')
250+
print(f'Run time: {time:.2f}s')
251+
252+
**Output:**
253+
254+
.. code:: text
255+
256+
Fitness: 0.53345778328699, Support: 0.1111111111111111, Confidence: 1.0, Aws: 0.48926223874985886
257+
Fitness: 0.7155830770302328, Support: 0.1111111111111111, Confidence: 1.0, Aws: 1.0356381199795872
258+
Fitness: 0.7279963436805833, Support: 0.1111111111111111, Confidence: 1.0, Aws: 1.072877919930639
259+
Fitness: 0.7875917299029188, Support: 0.1111111111111111, Confidence: 1.0, Aws: 1.251664078597645
260+
Fitness: 0.8071206688346807, Support: 0.1111111111111111, Confidence: 1.0, Aws: 1.310250895392931
261+
STATS:
262+
Total rules: 52
263+
Average fitness: 0.5179965084882088
264+
Average support: 0.11538461538461527
265+
Average confidence: 0.7115384615384616
266+
Average lift: 5.524038461538462
267+
Average coverage: 0.17948717948717943
268+
Average consequent support: 0.1517094017094015
269+
Average conviction: 1568561408678185.8
270+
Average amplitude: nan
271+
Average inclusion: 0.007735042735042727
272+
Average interestingness: 0.6170069642291859
273+
Average comprehensibility: 0.6763685578758655
274+
Average netconf: 0.6675824175824177
275+
Average Yule's Q: 0.9670329670329672
276+
Average antecedent length: 1.6346153846153846
277+
Average consequent length: 1.8461538461538463
278+
279+
Run time: 13.37s
280+
Rules exported to output.csv
281+
220282
Interest Measures
221283
-----------------
222284

docs/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@ The current version includes (but is not limited to) the following functions:
2424
- searching for association rules,
2525
- providing output of mined association rules,
2626
- generating statistics about mined association rules,
27-
- visualization of association rules.
27+
- visualization of association rules,
28+
- association rule text mining (experimental).
2829

2930
Documentation
3031
=============

docs/refs.bib

Lines changed: 60 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,66 @@
1-
@inproceedings{fister2018differential,
2-
title={Differential evolution for association rule mining using categorical and numerical attributes},
3-
author={Fister Jr., Iztok and Iglesias, Andres and Galvez, Akemi and Ser, Javier Del and Osaba, Eneko and Fister, Iztok},
4-
booktitle={International conference on intelligent data engineering and automated learning},
5-
pages={79--88},
6-
year={2018},
7-
organization={Springer}
1+
@inproceedings{fister_differential_2018,
2+
address = {Cham},
3+
title = {Differential {Evolution} for {Association} {Rule} {Mining} {Using} {Categorical} and {Numerical} {Attributes}},
4+
isbn = {9783030034931},
5+
doi = {10.1007/978-3-030-03493-1_9},
6+
language = {en},
7+
booktitle = {Intelligent {Data} {Engineering} and {Automated} {Learning} – {IDEAL} 2018},
8+
publisher = {Springer International Publishing},
9+
author = {Fister, Iztok and Iglesias, Andres and Galvez, Akemi and Del Ser, Javier and Osaba, Eneko and Fister, Iztok},
10+
editor = {Yin, Hujun and Camacho, David and Novais, Paulo and Tallón-Ballesteros, Antonio J.},
11+
year = {2018},
12+
pages = {79--88},
813
}
914

10-
@inproceedings{fister2020improved,
11-
title={Improved nature-inspired algorithms for numeric association rule mining},
12-
author={Fister Jr, Iztok and Podgorelec, Vili and Fister, Iztok},
13-
booktitle={International Conference on Intelligent Computing \& Optimization},
14-
pages={187--195},
15-
year={2020},
16-
organization={Springer}
15+
@inproceedings{fister_jr_improved_2021,
16+
address = {Cham},
17+
title = {Improved {Nature}-{Inspired} {Algorithms} for {Numeric} {Association} {Rule} {Mining}},
18+
isbn = {9783030681548},
19+
doi = {10.1007/978-3-030-68154-8_19},
20+
language = {en},
21+
booktitle = {Intelligent {Computing} and {Optimization}},
22+
publisher = {Springer International Publishing},
23+
author = {Fister Jr., Iztok and Podgorelec, Vili and Fister, Iztok},
24+
editor = {Vasant, Pandian and Zelinka, Ivan and Weber, Gerhard-Wilhelm},
25+
year = {2021},
26+
pages = {187--195},
1727
}
1828

29+
@article{fister_jr_brief_2020,
30+
title = {A brief overview of swarm intelligence-based algorithms for numerical association rule mining},
31+
doi = {10.48550/ARXIV.2010.15524},
32+
abstract = {Numerical Association Rule Mining is a popular variant of Association Rule Mining, where numerical attributes are handled without discretization. This means that the algorithms for dealing with this problem can operate directly, not only with categorical, but also with numerical attributes. Until recently, a big portion of these algorithms were based on a stochastic nature-inspired population-based paradigm. As a result, evolutionary and swarm intelligence-based algorithms showed big efficiency for dealing with the problem. In line with this, the main mission of this chapter is to make a historical overview of swarm intelligence-based algorithms for Numerical Association Rule Mining, as well as to present the main features of these algorithms for the observed problem. A taxonomy of the algorithms was proposed on the basis of the applied features found in this overview. Challenges, waiting in the future, finish this paper.},
33+
journal = {arXiv:2010.15524 [cs]},
34+
author = {Fister Jr. , Iztok and Fister, Iztok},
35+
month = oct,
36+
year = {2020},
37+
}
38+
39+
@inproceedings{fister_population-based_2020,
40+
address = {New York, NY, USA},
41+
series = {{ISMSI} '20},
42+
title = {Population-based metaheuristics for {Association} {Rule} {Text} {Mining}},
43+
isbn = {9781450377614},
44+
doi = {10.1145/3396474.3396493},
45+
booktitle = {Proceedings of the 2020 4th {International} {Conference} on {Intelligent} {Systems}, {Metaheuristics} \& {Swarm} {Intelligence}},
46+
publisher = {Association for Computing Machinery},
47+
author = {Fister, Iztok and Deb, Suash and Fister, Iztok},
48+
month = mar,
49+
year = {2020},
50+
keywords = {association rule text mining, particle swarm optimization, triathlon, natural language processing, optimization},
51+
pages = {19--23},
52+
}
1953

20-
@article{fister2021brief,
21-
title={A Brief Overview of Swarm Intelligence-Based Algorithms for Numerical Association Rule Mining},
22-
author={Fister Jr, Iztok and Fister, Iztok},
23-
journal={Applied Optimization and Swarm Intelligence},
24-
pages={47--59},
25-
year={2021},
26-
publisher={Springer}
54+
@inproceedings{fister_visualization_2020,
55+
address = {Cham},
56+
title = {Visualization of {Numerical} {Association} {Rules} by {Hill} {Slopes}},
57+
isbn = {9783030623623},
58+
doi = {10.1007/978-3-030-62362-3_10},
59+
language = {en},
60+
booktitle = {Intelligent {Data} {Engineering} and {Automated} {Learning} – {IDEAL} 2020},
61+
publisher = {Springer International Publishing},
62+
author = {Fister, Iztok and Fister, Dušan and Iglesias, Andres and Galvez, Akemi and Osaba, Eneko and Del Ser, Javier and Fister, Iztok},
63+
editor = {Analide, Cesar and Novais, Paulo and Camacho, David and Yin, Hujun},
64+
year = {2020},
65+
pages = {101--111},
2766
}

examples/text_mining.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import pandas as pd
2+
from niaarm.text import Corpus
3+
from niaarm.mine import get_text_rules
4+
from niapy.algorithms.basic import ParticleSwarmOptimization
5+
6+
df = pd.read_json('datasets/text/artm_test_dataset.json', orient='records')
7+
documents = df['text'].tolist()
8+
corpus = Corpus.from_list(documents)
9+
10+
algorithm = ParticleSwarmOptimization(population_size=200, seed=123)
11+
metrics = ('support', 'confidence', 'aws')
12+
rules, time = get_text_rules(corpus, max_terms=5, algorithm=algorithm, metrics=metrics, max_evals=10000, logging=True)
13+
14+
if len(rules):
15+
print(rules)
16+
print(f'Run time: {time:.2f}s')
17+
rules.to_csv('output.csv')
18+
else:
19+
print('No rules generated')
20+
print(f'Run time: {time:.2f}s')

niaarm/mine.py

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
from niaarm.niaarm import NiaARM
55
from niapy.task import OptimizationType, Task
66
from niapy.util.factory import get_algorithm
7+
from niaarm.text import NiaARTM
78

89

910
class Result(namedtuple('Result', ('rules', 'run_time'))):
@@ -51,3 +52,42 @@ def get_rules(dataset, algorithm, metrics, max_evals=np.inf, max_iters=np.inf, l
5152
problem.rules.sort()
5253

5354
return Result(problem.rules, stop_time - start_time)
55+
56+
57+
def get_text_rules(corpus, max_terms, algorithm, metrics, smooth=True, norm=2, max_evals=np.inf, max_iters=np.inf,
58+
logging=False, **kwargs):
59+
"""Mine association rules in a text corpus.
60+
61+
Args:
62+
corpus (Corpus): Dataset to mine rules on.
63+
max_terms (int): Maximum number of terms in association rule.
64+
algorithm (Union[niapy.algorithms.Algorithm, str]): Algorithm to use.
65+
Can be either an Algorithm object or the class name as a string.
66+
In that case, algorithm parameters can be passed in as keyword arguments.
67+
metrics (Union[Dict[str, float], Sequence[str]]): Metrics to take into account when computing the fitness.
68+
Metrics can either be passed as a Dict of pairs {'metric_name': <weight of metric>} or
69+
a sequence of metrics as strings, in which case, the weights of the metrics will be set to 1.
70+
smooth (bool): Smooth idf to prevent division by 0 error. Default: ``True``.
71+
norm (int): Order of norm for normalizing the tf-idf matrix. Default: 2.
72+
max_evals (Optional[int]): Maximum number of iterations. Default: ``inf``. At least one of ``max_evals`` or
73+
``max_iters`` must be provided.
74+
max_iters (Optional[int]): Maximum number of fitness evaluations. Default: ``inf``.
75+
logging (bool): Enable logging of fitness improvements. Default: ``False``.
76+
77+
Returns:
78+
Result: A named tuple containing the list of mined rules and the algorithm's run time in seconds.
79+
80+
"""
81+
problem = NiaARTM(max_terms, corpus.terms(), corpus.tf_idf_matrix(smooth=smooth, norm=norm), metrics, logging)
82+
task = Task(problem, max_evals=max_evals, max_iters=max_iters, optimization_type=OptimizationType.MAXIMIZATION)
83+
84+
if isinstance(algorithm, str):
85+
algorithm = get_algorithm(algorithm, **kwargs)
86+
87+
start_time = time.perf_counter()
88+
algorithm.run(task)
89+
stop_time = time.perf_counter()
90+
91+
problem.rules.sort()
92+
93+
return Result(problem.rules, stop_time - start_time)

niaarm/niaarm.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@ def __init__(self, dimension, features, transactions, metrics, logging=False):
6868
super().__init__(dimension, 0.0, 1.0)
6969

7070
def build_rule(self, vector):
71-
r"""Build association rule from the candidate solution."""
7271
rule = []
7372

7473
permutation = vector[-self.num_features:]

niaarm/rule_list.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
from collections import UserList
22
import csv
33
import numpy as np
4-
from niaarm.rule import Rule
54

65

76
class RuleList(UserList):
@@ -87,12 +86,14 @@ def to_csv(self, filename):
8786
with open(filename, 'w', newline='') as f:
8887
writer = csv.writer(f)
8988

89+
metrics = self.data[0].metrics
90+
9091
# write header
91-
writer.writerow(("antecedent", "consequent", "fitness") + Rule.metrics)
92+
writer.writerow(("antecedent", "consequent", "fitness") + metrics)
9293

9394
for rule in self.data:
9495
writer.writerow(
95-
[rule.antecedent, rule.consequent, rule.fitness] + [getattr(rule, metric) for metric in Rule.metrics])
96+
[rule.antecedent, rule.consequent, rule.fitness] + [getattr(rule, metric) for metric in metrics])
9697
print(f"Rules exported to {filename}")
9798

9899
def __str__(self):

0 commit comments

Comments
 (0)