|
1 | | -# Pang |
2 | | -Pattern Mining for the Classification of Public Procurement Fraud |
3 | | - |
4 | | -Copyright 2022 Lucas Potin, Rosa Figueiredo, Christine Largeron, Vincent Labatut |
| 1 | +Pang |
| 2 | +======= |
| 3 | +*Pattern-Based Anomaly Detection in Graphs* |
5 | 4 |
|
6 | 5 | Pang is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see licence.txt |
7 | 6 |
|
8 | | -* Contact: Lucas Potin [email protected] |
| 7 | +----------------------------------------------------------------------- |
| 8 | + |
| 9 | +# Description |
| 10 | +Pang is an algorithm which represents and classifies a collection of graphs according to their frequent patterns (subgraphs). |
| 11 | + |
| 12 | + |
| 13 | +# Organization |
| 14 | +This repository is composed of the following elements: |
| 15 | +* `requirements.txt` : List of Python packages used in pang.py. |
| 16 | +* `PANG.py` : Python script in order to use the algorithm. |
| 17 | +* `EMCL.py` : Python script in order to compute the results of the experiments of the ECML paper. |
| 18 | +* `ProcessingPattern.py` : Python script in order to compute the number of occurences and the set of induced patterns |
| 19 | +* `data` : folder with the input data files. There is one folder for each dataset, which are described in the [Datasets](#datasets) section. |
| 20 | + |
| 21 | + |
| 22 | +# Installation |
| 23 | +You first need to install `python` and the required packages: |
| 24 | + |
| 25 | +1. Install the [`python` language](https://www.python.org) |
| 26 | +2. Download this project from GitHub and unzip. |
| 27 | +3. Execute `pip install -r requirements.txt` to install the required packages (see also the *Dependencies* Section). |
| 28 | + |
| 29 | +The source code of SPMF in order to use gSpan and cgSpan is available [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=download.php). |
| 30 | +SPMF is available in two versions: |
| 31 | +* a jar file that can be run from the command line. Actually, this version can be use with gSpan, but not with cgSpan. |
| 32 | +* a source code. The installation of this version is more complicated, but it allows to use cgSpan. You can find the instructions [here](https://www.philippe-fournier-viger.com/spmf/how_to_install.php). |
| 33 | + |
| 34 | +In order to use Pang, you need to unzip each dataset in its own folder in the `data` folder. |
| 35 | + |
| 36 | +# Use |
| 37 | +We provide two scripts to use Pang: |
| 38 | +* `ECML.py` : a python script in order to compute the results of the ECML paper. |
| 39 | +* `PANG.py` : a python script in order to use Pang with your own data. |
| 40 | + |
| 41 | +## To Replicate the Paper Experiments |
| 42 | +In order to use Pang: |
| 43 | +1. Open the Python console. |
| 44 | +2. Run `EMCL.py` |
| 45 | + |
| 46 | +The script will compute the results of the experiments and save the results associated with Table 2, 5 and 6 in the `results` folder. |
| 47 | + |
| 48 | + |
| 49 | +## To Apply PANG to Other Data |
| 50 | +If you want to use Pang with your own data, you need to create an `XXX` folder in the `data` folder and put your data in it. This folder must contain the following files: |
| 51 | +* `XXX_graph.txt` : a file containing the graphs. |
| 52 | +* `XXX_label.txt` : a file containing the labels of the graphs. |
| 53 | + |
| 54 | +Then you need to run a script to produce the data files that will be used by Pang: |
| 55 | +1. Open the Python console. |
| 56 | +2. Run the script `Patterns.sh` in order to create the files `XXX_patterns.txt`. |
| 57 | +3. Run `ProcessingPattern.py`with the option `-d XXX` in order to create the files `XXX_mono.txt` and `XXX_iso.txt`. |
| 58 | +4. Run `PANG.py` with the option `-d XXX` in order to run Pang on the data `XXX`. |
| 59 | + |
| 60 | +For each value of the parameter `k`, Pang will create a file `KResults.txt` containing the results of the classification and a file `KPatterns.txt` containing the patterns. |
| 61 | + |
| 62 | +## Data Format |
| 63 | +We use the same format as SPMF for the graph input files. Each graph is defined as follows: |
| 64 | + |
| 65 | +1. `t # N N`: graph id |
| 66 | +2. `v M L M`: node id, L: node label |
| 67 | +3. `e P Q L P`: source node id, Q: destination node id, L: edge label |
| 68 | + |
| 69 | +For the patterns output files, each pattern contains one more line than the graphs: |
| 70 | + |
| 71 | +4. `x A B C A,B,C` : graphs containing the pattern |
9 | 72 |
|
10 | | -Description |
| 73 | +## Datasets |
| 74 | +The datasets used in the paper are available in the `data` folder. The following datasets are available: |
| 75 | +* `MUTAG` : MUTAG dataset, representing chemical compounds and their mutagenic properties [[D'91](#references)], |
| 76 | +* `NCI1` : NCI1 dataset, representing molecules and classified according to carcinogenicity [[W'06](#references)], |
| 77 | +* `PTC` : PTC dataset, representing molecules and classified according to carcinogenicity [[T'03](#references)], |
| 78 | +* `DD` : DD dataset, representing amino acids and their interactions [[D'03](#references)], |
11 | 79 |
|
12 | | -## Organization |
13 | | -TBC |
| 80 | +Each of these datasets can be found [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php). |
| 81 | +* `FOPPA` : dataset extracted from FOPPA, a database of French public procurement notices [[P'22](#references)]. |
| 82 | +# Dependencies |
| 83 | +Tested with `SPMF` version 2.54, and `python` version 3.6.13 with the following packages: |
| 84 | +* [`pandas`](https://pypi.org/project/pandas/): version 1.1.5 |
| 85 | +* [`numpy`](https://pypi.org/project/numpy/): version 1.19.5 |
| 86 | +* [`networkx`](https://pypi.org/project/numpy/): version 2.5.1 |
| 87 | +* [`sklearn`](https://pypi.org/project/numpy/): version 0.24.2 |
| 88 | +* [`matplotlib`](https://pypi.org/project/numpy/): version 3.3.4 |
| 89 | +* [`grakel`](https://pypi.org/project/numpy/): version 0.1.8 |
| 90 | +* [`karateclub`](https://pypi.org/project/numpy/): version 1.3.3 |
| 91 | +* [`stellargraph`](https://pypi.org/project/numpy/): version 1.2.1 |
14 | 92 |
|
15 | | -## Installation |
16 | | -TBC |
17 | 93 |
|
18 | | -## Use |
19 | | -TBC |
| 94 | +The VF2 and ISMAGS algortihms are included in the [`Networkx` library](https://networkx.org/) |
20 | 95 |
|
21 | | -## Dependencies |
22 | | -TBC |
| 96 | +For the baselines: |
| 97 | +* The WL and WLOA algorithms are included in the Grakel library, documentation available [here](https://ysig.github.io/GraKeL/0.1a8/benchmarks.html) |
| 98 | +* Graph2Vec is included in the karateclub library, documentation available [here](https://karateclub.readthedocs.io/en/latest/) |
| 99 | +* DGCNN is included in the stellargraph library, documentation available [here](https://stellargraph.readthedocs.io/en/stable/). |
| 100 | +* We use the implementation of CORK from Marisa Thoma. This implementation is available in the `CORKcpp.zip` archive. |
23 | 101 |
|
24 | | -##References |
25 | 102 |
|
26 | | -[XX'19] X. Yyyyyy & A. Bbbbbb, My title, My journal, X(X)/XX-XX, 201X. doi: XXXXXXXXXX - ⟨hal-XXXXXXXX⟩ |
| 103 | +# References |
| 104 | +* **[P'22]** L. Potin, V. Labatut, R. Figueiredo, C. Largeron, P.-H. Morand. *FOPPA: A database of French Open Public Procurement Award notices*, Technical Report, Avignon University, 2022. [⟨hal-03796734⟩](https://hal.archives-ouvertes.fr/hal-03796734) |
| 105 | +* **[D'91]** A.S. Debnath, R.L. Lopez, G. Debnath, A. Shusterman, C. Hansch. *Structure- |
| 106 | +activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. |
| 107 | +correlation with molecular orbital energies and hydrophobicity*, Journal of Medic- |
| 108 | +inal Chemistry 34(2), 786–797, 1991. |
| 109 | +* **[W'06]** N.Wale, G. Karypis. *Comparison of descriptor spaces for chemical compound |
| 110 | +retrieval and classification*, 6th International Conference on Data Mining, pp. |
| 111 | +678–689, 2006. |
| 112 | +* **[T'03]** H . Toivonen, A. Srinivasan, R.D. King, S. Kramer, C. Helma.*Statistical eval- |
| 113 | +uation of the predictive toxicology challenge 2000-2001*, Bioinformatics 19(10), |
| 114 | +1183–1193, 2003. |
| 115 | +* **[D'03]** P.D. Dobson, A.J. Doig. *Distinguishing enzyme structures from non-enzymes |
| 116 | +without alignments*, Journal of Molecular Biology 330(4), 771–783 ,2003. |
0 commit comments