Skip to content

Commit 2671fbb

Browse files
authored
Merge pull request #1 from CompNet/ECML
Ecml
2 parents bb44f4e + 94c2d47 commit 2671fbb

File tree

18 files changed

+1605
-417313
lines changed

18 files changed

+1605
-417313
lines changed

README.md

Lines changed: 106 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,116 @@
1-
# Pang
2-
Pattern Mining for the Classification of Public Procurement Fraud
3-
4-
Copyright 2022 Lucas Potin, Rosa Figueiredo, Christine Largeron, Vincent Labatut
1+
Pang
2+
=======
3+
*Pattern-Based Anomaly Detection in Graphs*
54

65
Pang is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see licence.txt
76

8-
* Contact: Lucas Potin [email protected]
7+
-----------------------------------------------------------------------
8+
9+
# Description
10+
Pang is an algorithm which represents and classifies a collection of graphs according to their frequent patterns (subgraphs).
11+
12+
13+
# Organization
14+
This repository is composed of the following elements:
15+
* `requirements.txt` : List of Python packages used in pang.py.
16+
* `PANG.py` : Python script in order to use the algorithm.
17+
* `EMCL.py` : Python script in order to compute the results of the experiments of the ECML paper.
18+
* `ProcessingPattern.py` : Python script in order to compute the number of occurences and the set of induced patterns
19+
* `data` : folder with the input data files. There is one folder for each dataset, which are described in the [Datasets](#datasets) section.
20+
21+
22+
# Installation
23+
You first need to install `python` and the required packages:
24+
25+
1. Install the [`python` language](https://www.python.org)
26+
2. Download this project from GitHub and unzip.
27+
3. Execute `pip install -r requirements.txt` to install the required packages (see also the *Dependencies* Section).
28+
29+
The source code of SPMF in order to use gSpan and cgSpan is available [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=download.php).
30+
SPMF is available in two versions:
31+
* a jar file that can be run from the command line. Actually, this version can be use with gSpan, but not with cgSpan.
32+
* a source code. The installation of this version is more complicated, but it allows to use cgSpan. You can find the instructions [here](https://www.philippe-fournier-viger.com/spmf/how_to_install.php).
33+
34+
In order to use Pang, you need to unzip each dataset in its own folder in the `data` folder.
35+
36+
# Use
37+
We provide two scripts to use Pang:
38+
* `ECML.py` : a python script in order to compute the results of the ECML paper.
39+
* `PANG.py` : a python script in order to use Pang with your own data.
40+
41+
## To Replicate the Paper Experiments
42+
In order to use Pang:
43+
1. Open the Python console.
44+
2. Run `EMCL.py`
45+
46+
The script will compute the results of the experiments and save the results associated with Table 2, 5 and 6 in the `results` folder.
47+
48+
49+
## To Apply PANG to Other Data
50+
If you want to use Pang with your own data, you need to create an `XXX` folder in the `data` folder and put your data in it. This folder must contain the following files:
51+
* `XXX_graph.txt` : a file containing the graphs.
52+
* `XXX_label.txt` : a file containing the labels of the graphs.
53+
54+
Then you need to run a script to produce the data files that will be used by Pang:
55+
1. Open the Python console.
56+
2. Run the script `Patterns.sh` in order to create the files `XXX_patterns.txt`.
57+
3. Run `ProcessingPattern.py`with the option `-d XXX` in order to create the files `XXX_mono.txt` and `XXX_iso.txt`.
58+
4. Run `PANG.py` with the option `-d XXX` in order to run Pang on the data `XXX`.
59+
60+
For each value of the parameter `k`, Pang will create a file `KResults.txt` containing the results of the classification and a file `KPatterns.txt` containing the patterns.
61+
62+
## Data Format
63+
We use the same format as SPMF for the graph input files. Each graph is defined as follows:
64+
65+
1. `t # N N`: graph id
66+
2. `v M L M`: node id, L: node label
67+
3. `e P Q L P`: source node id, Q: destination node id, L: edge label
68+
69+
For the patterns output files, each pattern contains one more line than the graphs:
70+
71+
4. `x A B C A,B,C` : graphs containing the pattern
972

10-
Description
73+
## Datasets
74+
The datasets used in the paper are available in the `data` folder. The following datasets are available:
75+
* `MUTAG` : MUTAG dataset, representing chemical compounds and their mutagenic properties [[D'91](#references)],
76+
* `NCI1` : NCI1 dataset, representing molecules and classified according to carcinogenicity [[W'06](#references)],
77+
* `PTC` : PTC dataset, representing molecules and classified according to carcinogenicity [[T'03](#references)],
78+
* `DD` : DD dataset, representing amino acids and their interactions [[D'03](#references)],
1179

12-
## Organization
13-
TBC
80+
Each of these datasets can be found [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php).
81+
* `FOPPA` : dataset extracted from FOPPA, a database of French public procurement notices [[P'22](#references)].
82+
# Dependencies
83+
Tested with `SPMF` version 2.54, and `python` version 3.6.13 with the following packages:
84+
* [`pandas`](https://pypi.org/project/pandas/): version 1.1.5
85+
* [`numpy`](https://pypi.org/project/numpy/): version 1.19.5
86+
* [`networkx`](https://pypi.org/project/numpy/): version 2.5.1
87+
* [`sklearn`](https://pypi.org/project/numpy/): version 0.24.2
88+
* [`matplotlib`](https://pypi.org/project/numpy/): version 3.3.4
89+
* [`grakel`](https://pypi.org/project/numpy/): version 0.1.8
90+
* [`karateclub`](https://pypi.org/project/numpy/): version 1.3.3
91+
* [`stellargraph`](https://pypi.org/project/numpy/): version 1.2.1
1492

15-
## Installation
16-
TBC
1793

18-
## Use
19-
TBC
94+
The VF2 and ISMAGS algortihms are included in the [`Networkx` library](https://networkx.org/)
2095

21-
## Dependencies
22-
TBC
96+
For the baselines:
97+
* The WL and WLOA algorithms are included in the Grakel library, documentation available [here](https://ysig.github.io/GraKeL/0.1a8/benchmarks.html)
98+
* Graph2Vec is included in the karateclub library, documentation available [here](https://karateclub.readthedocs.io/en/latest/)
99+
* DGCNN is included in the stellargraph library, documentation available [here](https://stellargraph.readthedocs.io/en/stable/).
100+
* We use the implementation of CORK from Marisa Thoma. This implementation is available in the `CORKcpp.zip` archive.
23101

24-
##References
25102

26-
[XX'19] X. Yyyyyy & A. Bbbbbb, My title, My journal, X(X)/XX-XX, 201X. doi: XXXXXXXXXX - ⟨hal-XXXXXXXX⟩
103+
# References
104+
* **[P'22]** L. Potin, V. Labatut, R. Figueiredo, C. Largeron, P.-H. Morand. *FOPPA: A database of French Open Public Procurement Award notices*, Technical Report, Avignon University, 2022. [⟨hal-03796734⟩](https://hal.archives-ouvertes.fr/hal-03796734)
105+
* **[D'91]** A.S. Debnath, R.L. Lopez, G. Debnath, A. Shusterman, C. Hansch. *Structure-
106+
activity relationship of mutagenic aromatic and heteroaromatic nitro compounds.
107+
correlation with molecular orbital energies and hydrophobicity*, Journal of Medic-
108+
inal Chemistry 34(2), 786–797, 1991.
109+
* **[W'06]** N.Wale, G. Karypis. *Comparison of descriptor spaces for chemical compound
110+
retrieval and classification*, 6th International Conference on Data Mining, pp.
111+
678–689, 2006.
112+
* **[T'03]** H . Toivonen, A. Srinivasan, R.D. King, S. Kramer, C. Helma.*Statistical eval-
113+
uation of the predictive toxicology challenge 2000-2001*, Bioinformatics 19(10),
114+
1183–1193, 2003.
115+
* **[D'03]** P.D. Dobson, A.J. Doig. *Distinguishing enzyme structures from non-enzymes
116+
without alignments*, Journal of Molecular Biology 330(4), 771–783 ,2003.

data/DD/DD.zip

34.5 MB
Binary file not shown.

data/FOPPA/FOPPA.zip

2.12 MB
Binary file not shown.

0 commit comments

Comments
 (0)