|
5 | 5 | Pang is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see licence.txt |
6 | 6 |
|
7 | 7 | ----------------------------------------------------------------------- |
8 | | - |
9 | | -# Description |
10 | | -Pang is an algorithm which represents and classifies a collection of graphs according to their frequent patterns (subgraphs). |
11 | | - |
12 | | - |
13 | | -# Organization |
14 | | -This repository is composed of the following elements: |
15 | | -* `requirements.txt` : List of Python packages used in pang.py. |
16 | | -* `PANG.py` : Python script in order to use the algorithm. |
17 | | -* `EMCL.py` : Python script in order to compute the results of the experiments of the ECML paper. |
18 | | -* `ProcessingPattern.py` : Python script in order to compute the number of occurences and the set of induced patterns |
19 | | -* `data` : folder with the input data files. There is one folder for each dataset, which are described in the [Datasets](#datasets) section. |
20 | | - |
21 | | - |
22 | | -# Installation |
23 | | -You first need to install `python` and the required packages: |
24 | | - |
25 | | -1. Install the [`python` language](https://www.python.org) |
26 | | -2. Download this project from GitHub and unzip. |
27 | | -3. Execute `pip install -r requirements.txt` to install the required packages (see also the *Dependencies* Section). |
28 | | - |
29 | | -The source code of SPMF in order to use gSpan and cgSpan is available [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=download.php). |
30 | | -SPMF is available in two versions: |
31 | | -* a jar file that can be run from the command line. Actually, this version can be use with gSpan, but not with cgSpan. |
32 | | -* a source code. The installation of this version is more complicated, but it allows to use cgSpan. You can find the instructions [here](https://www.philippe-fournier-viger.com/spmf/how_to_install.php). |
33 | | - |
34 | | -In order to use Pang, you need to unzip each dataset in its own folder in the `data` folder. |
35 | | - |
36 | | -# Use |
37 | | -We provide two scripts to use Pang: |
38 | | -* `ECML.py` : a python script in order to compute the results of the ECML paper. |
39 | | -* `PANG.py` : a python script in order to use Pang with your own data. |
40 | | - |
41 | | -## To Replicate the Paper Experiments |
42 | | -In order to use Pang: |
43 | | -1. Open the Python console. |
44 | | -2. Run `EMCL.py` |
45 | | - |
46 | | -The script will compute the results of the experiments and save the results associated with Table 2, 5 and 6 in the `results` folder. |
47 | | - |
48 | | - |
49 | | -## To Apply PANG to Other Data |
50 | | -If you want to use Pang with your own data, you need to create an `XXX` folder in the `data` folder and put your data in it. This folder must contain the following files: |
51 | | -* `XXX_graph.txt` : a file containing the graphs. |
52 | | -* `XXX_label.txt` : a file containing the labels of the graphs. |
53 | | - |
54 | | -Then you need to run a script to produce the data files that will be used by Pang: |
55 | | -1. Open the Python console. |
56 | | -2. Run the script `Patterns.sh` in order to create the files `XXX_patterns.txt`. |
57 | | -3. Run `ProcessingPattern.py`with the option `-d XXX` in order to create the files `XXX_mono.txt` and `XXX_iso.txt`. |
58 | | -4. Run `PANG.py` with the option `-d XXX` in order to run Pang on the data `XXX`. |
59 | | - |
60 | | -For each value of the parameter `k`, Pang will create a file `KResults.txt` containing the results of the classification and a file `KPatterns.txt` containing the patterns. |
61 | | - |
62 | | -## Data Format |
63 | | -We use the same format as SPMF for the graph input files. Each graph is defined as follows: |
64 | | - |
65 | | -1. `t # N N`: graph id |
66 | | -2. `v M L M`: node id, L: node label |
67 | | -3. `e P Q L P`: source node id, Q: destination node id, L: edge label |
68 | | - |
69 | | -For the patterns output files, each pattern contains one more line than the graphs: |
70 | | - |
71 | | -4. `x A B C A,B,C` : graphs containing the pattern |
72 | | - |
73 | | -## Datasets |
74 | | -The datasets used in the paper are available in the `data` folder. The following datasets are available: |
75 | | -* `MUTAG` : MUTAG dataset, representing chemical compounds and their mutagenic properties [[D'91](#references)], |
76 | | -* `NCI1` : NCI1 dataset, representing molecules and classified according to carcinogenicity [[W'06](#references)], |
77 | | -* `PTC` : PTC dataset, representing molecules and classified according to carcinogenicity [[T'03](#references)], |
78 | | -* `DD` : DD dataset, representing amino acids and their interactions [[D'03](#references)], |
79 | | - |
80 | | -Each of these datasets can be found [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php). |
81 | | -* `FOPPA` : dataset extracted from FOPPA, a database of French public procurement notices [[P'22](#references)]. |
82 | | -# Dependencies |
83 | | -Tested with `SPMF` version 2.54, and `python` version 3.6.13 with the following packages: |
84 | | -* [`pandas`](https://pypi.org/project/pandas/): version 1.1.5 |
85 | | -* [`numpy`](https://pypi.org/project/numpy/): version 1.19.5 |
86 | | -* [`networkx`](https://pypi.org/project/numpy/): version 2.5.1 |
87 | | -* [`sklearn`](https://pypi.org/project/numpy/): version 0.24.2 |
88 | | -* [`matplotlib`](https://pypi.org/project/numpy/): version 3.3.4 |
89 | | -* [`grakel`](https://pypi.org/project/numpy/): version 0.1.8 |
90 | | -* [`karateclub`](https://pypi.org/project/numpy/): version 1.3.3 |
91 | | -* [`stellargraph`](https://pypi.org/project/numpy/): version 1.2.1 |
92 | | - |
93 | | - |
94 | | -The VF2 and ISMAGS algortihms are included in the [`Networkx` library](https://networkx.org/) |
95 | | - |
96 | | -For the baselines: |
97 | | -* The WL and WLOA algorithms are included in the Grakel library, documentation available [here](https://ysig.github.io/GraKeL/0.1a8/benchmarks.html) |
98 | | -* Graph2Vec is included in the karateclub library, documentation available [here](https://karateclub.readthedocs.io/en/latest/) |
99 | | -* DGCNN is included in the stellargraph library, documentation available [here](https://stellargraph.readthedocs.io/en/stable/). |
100 | | -* We use the implementation of CORK from Marisa Thoma. This implementation is available in the `CORKcpp.zip` archive. |
101 | | - |
102 | | - |
103 | | -# References |
104 | | -* **[P'22]** L. Potin, V. Labatut, R. Figueiredo, C. Largeron, P.-H. Morand. *FOPPA: A database of French Open Public Procurement Award notices*, Technical Report, Avignon University, 2022. [⟨hal-03796734⟩](https://hal.archives-ouvertes.fr/hal-03796734) |
105 | | -* **[D'91]** A.S. Debnath, R.L. Lopez, G. Debnath, A. Shusterman, C. Hansch. *Structure- |
106 | | -activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. |
107 | | -correlation with molecular orbital energies and hydrophobicity*, Journal of Medic- |
108 | | -inal Chemistry 34(2), 786–797, 1991. |
109 | | -* **[W'06]** N.Wale, G. Karypis. *Comparison of descriptor spaces for chemical compound |
110 | | -retrieval and classification*, 6th International Conference on Data Mining, pp. |
111 | | -678–689, 2006. |
112 | | -* **[T'03]** H . Toivonen, A. Srinivasan, R.D. King, S. Kramer, C. Helma.*Statistical eval- |
113 | | -uation of the predictive toxicology challenge 2000-2001*, Bioinformatics 19(10), |
114 | | -1183–1193, 2003. |
115 | | -* **[D'03]** P.D. Dobson, A.J. Doig. *Distinguishing enzyme structures from non-enzymes |
116 | | -without alignments*, Journal of Molecular Biology 330(4), 771–783 ,2003. |
0 commit comments