Skip to content

Commit 97ce924

Browse files
committed
Merge branch 'ECML' of https://github.com/CompNet/Pang into ECML
2 parents c454c6d + 2714e2a commit 97ce924

File tree

2 files changed

+97
-59
lines changed

2 files changed

+97
-59
lines changed

README.md

Lines changed: 95 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -12,75 +12,112 @@ Pang is an algorithm which represents and classifies a collection of graphs acco
1212

1313
# Organization
1414
This repository is composed of the following elements:
15-
* `requirements.txt` : List of Python packages used in pang.py.
16-
* `PANG.py` : Python script in order to use the algorithm.
17-
* `EMCL.py` : Python script in order to compute the results of the experiments of the ECML paper.
18-
* `ProcessingPattern.py` : Python script in order to compute the number of occurences and the set of induced patterns
19-
* `data` : folder with the input data files. There is one folder for each dataset, which are described in the [Datasets](#datasets) section.
15+
16+
* `requirements.txt`: List of required Python packages.
17+
* `src`: folder containing the source code
18+
* `EMCL.py`: script that reproduces the experiments of our paper submitted to ECML PKDD.
19+
* `PANG.py`: script that implements the Pang method.
20+
* `ProcessingPattern.py`: script that computes the number of occurences and the set of induced patterns.
21+
* `Pattern.sh`: **TODO (identifies the patterns with SPMF and counts them with `ProcessingPattern.py` ?).**
22+
* `CORKcpp.zip`: archive containing the CORK source code (used in `EMCL.py`) cf. Section [Installation](#installation).
23+
* `data`: folder containing the input data. Each subfolder corresponds to a distinct dataset, cf. Section [Datasets](#datasets).
24+
* `results`: files produced by the processing.
2025

2126

2227
# Installation
23-
You first need to install `python` and the required packages:
2428

25-
1. Install the [`python` language](https://www.python.org)
29+
## Python and Packages
30+
First, you need to install the `Python` language and the required packages:
31+
32+
1. Install the [`Python` language](https://www.python.org)
2633
2. Download this project from GitHub and unzip.
27-
3. Execute `pip install -r requirements.txt` to install the required packages (see also the *Dependencies* Section).
34+
3. Execute `pip install -r requirements.txt` to install the required packages (see also Section [Dependencies](#dependencies)).
35+
36+
## Non-Python Dependencies
37+
Second, one of the dependencies, SPMF, is not a Python package, but rather a Java program, and therefore requires a specific installation process:
38+
39+
* Download its source code on [Philippe Fournier-Viger's website](https://www.philippe-fournier-viger.com/spmf/index.php?link=download.php).
40+
* Follow the installation instructions provided on the [same website](https://www.philippe-fournier-viger.com/spmf/how_to_install.php).
2841

29-
The source code of SPMF in order to use gSpan and cgSpan is available [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=download.php).
30-
SPMF is available in two versions:
31-
* a jar file that can be run from the command line. Actually, this version can be use with gSpan, but not with cgSpan.
32-
* a source code. The installation of this version is more complicated, but it allows to use cgSpan. You can find the instructions [here](https://www.philippe-fournier-viger.com/spmf/how_to_install.php).
42+
Note that SPMF is available both as a JAR and as source code archive. However, the former does not contain all the features required by Pang, so one should use only the latter.
43+
44+
**TODO In order to run the script that reproduces our ECML PKDD experiments, you also need to install CORK.**
45+
46+
## Data
47+
Third, you need to set up the data to which you want to apply Pang. This can be the dataset from our paper, in which you will need to unzip several archives, or your own data, in which case they need to be respect the appropriate format. In both cases, see cf. Section [Use](#use).
3348

34-
In order to use Pang, you need to unzip each dataset in its own folder in the `data` folder.
3549

3650
# Use
3751
We provide two scripts to use Pang:
38-
* `ECML.py` : a python script in order to compute the results of the ECML paper.
39-
* `PANG.py` : a python script in order to use Pang with your own data.
52+
53+
* `ECML.py`: reproduces the experiments described in our paper submitted to ECML PKDD.
54+
* `PANG.py`: applies Pang in the general case, possibly to your own data.
4055

4156
## To Replicate the Paper Experiments
42-
In order to use Pang:
43-
1. Open the Python console.
44-
2. Run `EMCL.py`
57+
To replicate our ECML PKDD experiments, first unzip the provided datasets, and run Pang on them.
4558

46-
The script will compute the results of the experiments and save the results associated with Table 2, 5 and 6 in the `results` folder.
59+
### Data Preparation
60+
To unzip the datasets used in our experiments:
4761

62+
1. Go to the `data` folder.
63+
2. In each subfolder, you will find an archive that you need to unzip.
4864

49-
## To Apply PANG to Other Data
50-
If you want to use Pang with your own data, you need to create an `XXX` folder in the `data` folder and put your data in it. This folder must contain the following files:
51-
* `XXX_graph.txt` : a file containing the graphs.
52-
* `XXX_label.txt` : a file containing the labels of the graphs.
65+
We retrieved the benchmark datasets from the [SPMF website](https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php); they include:
66+
* `MUTAG` : MUTAG dataset, representing chemical compounds and their mutagenic properties [[D'91](#references)]
67+
* `NCI1` : NCI1 dataset, representing molecules and classified according to carcinogenicity [[W'06](#references)]
68+
* `PTC` : PTC dataset, representing molecules and classified according to carcinogenicity [[T'03](#references)]
69+
* `DD` : DD dataset, representing amino acids and their interactions [[D'03](#references)]
5370

54-
Then you need to run a script to produce the data files that will be used by Pang:
55-
1. Open the Python console.
56-
2. Run the script `Patterns.sh` in order to create the files `XXX_patterns.txt`.
57-
3. Run `ProcessingPattern.py`with the option `-d XXX` in order to create the files `XXX_mono.txt` and `XXX_iso.txt`.
58-
4. Run `PANG.py` with the option `-d XXX` in order to run Pang on the data `XXX`.
71+
The public procurement dataset contains graphs extracted from the FOPPA database:
72+
* `FOPPA` : dataset extracted from FOPPA, a database of French public procurement notices [[P'22](#references)].
73+
74+
75+
### Processing
76+
Then, run the appropriate script:
77+
78+
3. Open the Python console.
79+
4. Run `EMCL.py`
80+
81+
The script will compute the results of the experiments and save the results associated with Table 2, 5 and 6 of the paper, in the `results` folder.
5982

60-
For each value of the parameter `k`, Pang will create a file `KResults.txt` containing the results of the classification and a file `KPatterns.txt` containing the patterns.
6183

62-
## Data Format
63-
We use the same format as SPMF for the graph input files. Each graph is defined as follows:
84+
## To Apply Pang to Other Data
85+
If you want to use Pang with your own data, you need to set up the data, then identify the patterns, and finally perform the classification.
86+
87+
### Data Preparation
88+
Create an `XXX` folder in the `data` folder (where `XXX` is the name of your dataset), in order to host your data. This folder must contain the following files:
89+
90+
* `XXX_graph.txt` : a file containing all the graphs.
91+
* `XXX_label.txt` : a file indicating the labels (classes) of these graphs.
92+
93+
We use the same format as SPMF for the graph input files, i.e.:
6494

6595
1. `t # N N`: graph id
6696
2. `v M L M`: node id, L: node label
6797
3. `e P Q L P`: source node id, Q: destination node id, L: edge label
6898

69-
For the patterns output files, each pattern contains one more line than the graphs:
99+
For information, the files produced by our scripts to list the identified patterns are similar, except they contain an extra line:
70100

71101
4. `x A B C A,B,C` : graphs containing the pattern
72102

73-
## Datasets
74-
The datasets used in the paper are available in the `data` folder. The following datasets are available:
75-
* `MUTAG` : MUTAG dataset, representing chemical compounds and their mutagenic properties [[D'91](#references)],
76-
* `NCI1` : NCI1 dataset, representing molecules and classified according to carcinogenicity [[W'06](#references)],
77-
* `PTC` : PTC dataset, representing molecules and classified according to carcinogenicity [[T'03](#references)],
78-
* `DD` : DD dataset, representing amino acids and their interactions [[D'03](#references)],
103+
The format of the file containing the graph labels is as follows:
104+
105+
**TODO**
106+
107+
### Processing
108+
109+
Once the data are ready, you need to run a script to identify the patterns, and produce the files required by Pang:
110+
111+
1. Open the `Python` console.
112+
2. Run the script `Patterns.sh` in order to create the files `XXX_patterns.txt`.
113+
3. Run `ProcessingPattern.py`with the option `-d XXX` in order to create the files `XXX_mono.txt` and `XXX_iso.txt`.
114+
4. Run `PANG.py` with the option `-d XXX` in order to run Pang on the data `XXX`.
115+
116+
For each value of the parameter `k` **TODO c'est quoi ce k ?**, Pang will create a file `KResults.txt` containing the results of the classification and a file `KPatterns.txt` containing the patterns.
117+
79118

80-
Each of these datasets can be found [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php).
81-
* `FOPPA` : dataset extracted from FOPPA, a database of French public procurement notices [[P'22](#references)].
82119
# Dependencies
83-
Tested with `SPMF` version 2.54, and `python` version 3.6.13 with the following packages:
120+
Tested with `python` version 3.6.13 and the following packages:
84121
* [`pandas`](https://pypi.org/project/pandas/): version 1.1.5
85122
* [`numpy`](https://pypi.org/project/numpy/): version 1.19.5
86123
* [`networkx`](https://pypi.org/project/numpy/): version 2.5.1
@@ -90,27 +127,26 @@ Tested with `SPMF` version 2.54, and `python` version 3.6.13 with the following
90127
* [`karateclub`](https://pypi.org/project/numpy/): version 1.3.3
91128
* [`stellargraph`](https://pypi.org/project/numpy/): version 1.2.1
92129

130+
The VF2 [[C'04](#references)] and ISMAGS [[H'14](#references)] algorithms are included in the [`Networkx` library](https://networkx.org/)
131+
132+
Tested with `SPMF` version 2.54, which implements gSpan [[Y'02](#references)] (to mine frequent patterns) and cgSpan [[S'21](#references)] (closed frequent patterns).
93133

94-
The VF2 and ISMAGS algortihms are included in the [`Networkx` library](https://networkx.org/)
134+
For the ECML PKDD assessment, we use the following algorithms for the sake of comparison:
95135

96-
For the baselines:
97-
* The WL and WLOA algorithms are included in the Grakel library, documentation available [here](https://ysig.github.io/GraKeL/0.1a8/benchmarks.html)
98-
* Graph2Vec is included in the karateclub library, documentation available [here](https://karateclub.readthedocs.io/en/latest/)
99-
* DGCNN is included in the stellargraph library, documentation available [here](https://stellargraph.readthedocs.io/en/stable/).
100-
* We use the implementation of CORK from Marisa Thoma. This implementation is available in the `CORKcpp.zip` archive.
136+
* The `WL` and `WLOA` algorithms are included in the `Grakel` library, documentation available [here](https://ysig.github.io/GraKeL/0.1a8/benchmarks.html)
137+
* `Graph2Vec` is included in the `karateclub` library, documentation available [here](https://karateclub.readthedocs.io/en/latest/)
138+
* `DGCNN` is included in the `stellargraph` library, documentation available [here](https://stellargraph.readthedocs.io/en/stable/).
139+
* We use the implementation of `CORK` from Marisa Thoma. This implementation is available in the `CORKcpp.zip` archive.
101140

102141

103142
# References
143+
* **[D'91]** A. S. Debnath, R. L. Lopez, G. Debnath, A. Shusterman, C. Hansch. *Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity*, Journal of Medicinal Chemistry 34(2):786–797, 1991. DOI: [10.1021/jm00106a046](https://doi.org/10.1021/jm00106a046)
144+
* **[D'03]** P. D. Dobson, A. J. Doig. *Distinguishing enzyme structures from non-enzymes without alignments*, Journal of Molecular Biology 330(4):771–783, 2003. DOI: [10.1016/S0022-2836(03)00628-4](https://doi.org/10.1016/S0022-2836(03)00628-4)
145+
* **[H'14']** M. Houbraken, S. Demeyer, T. Michoel, P. Audenaert, D. Colle, M. Pickavet. *The Index-Based Subgraph Matching Algorithm with General Symmetries (ISMAGS): Exploiting Symmetry for Faster Subgraph Enumeration*, PLoS ONE 9(5):e97896, 2014. DOI: [10.1371/journal.pone.0097896](https://doi.org/10.1371/journal.pone.0097896).
104146
* **[P'22]** L. Potin, V. Labatut, R. Figueiredo, C. Largeron, P.-H. Morand. *FOPPA: A database of French Open Public Procurement Award notices*, Technical Report, Avignon University, 2022. [⟨hal-03796734⟩](https://hal.archives-ouvertes.fr/hal-03796734)
105-
* **[D'91]** A.S. Debnath, R.L. Lopez, G. Debnath, A. Shusterman, C. Hansch. *Structure-
106-
activity relationship of mutagenic aromatic and heteroaromatic nitro compounds.
107-
correlation with molecular orbital energies and hydrophobicity*, Journal of Medic-
108-
inal Chemistry 34(2), 786–797, 1991.
109-
* **[W'06]** N.Wale, G. Karypis. *Comparison of descriptor spaces for chemical compound
110-
retrieval and classification*, 6th International Conference on Data Mining, pp.
111-
678–689, 2006.
112-
* **[T'03]** H . Toivonen, A. Srinivasan, R.D. King, S. Kramer, C. Helma.*Statistical eval-
113-
uation of the predictive toxicology challenge 2000-2001*, Bioinformatics 19(10),
114-
1183–1193, 2003.
115-
* **[D'03]** P.D. Dobson, A.J. Doig. *Distinguishing enzyme structures from non-enzymes
116-
without alignments*, Journal of Molecular Biology 330(4), 771–783 ,2003.
147+
* **[S'21]** Z. Shaul, S. Naaz. *cgSpan: Closed Graph-Based Substructure Pattern Mining, IEEE International Conference on Big Data, pp. 4989-4998, 2021. DOI: [10.1109/bigdata52589.2021.9671995](https://doi.org/10.1109/bigdata52589.2021.9671995)
148+
* **[T'03]** H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, C. Helma. *Statistical evaluation of the predictive toxicology challenge 2000-2001*, Bioinformatics 19(10):1183–1193, 2003. DOI: [10.1093/bioinformatics/btg130](https://doi.org/10.1093/bioinformatics/btg130)
149+
* **[W'06]** N. Wale, G. Karypis. *Comparison of descriptor spaces for chemical compound retrieval and classification*, 6th International Conference on Data Mining, pp. 678–689, 2006. DOI: [10.1007/s10115-007-0103-5](https://doi.org/10.1007/s10115-007-0103-5)
150+
* **[Y'02]** X. Yan, J. Han. *gSpan: Graph-based substructure pattern mining*, IEEE International Conference on Data Mining, pp.721-724, 2002. DOI: [10.1109/ICDM.2002.1184038](https://doi.org/10.1109/ICDM.2002.1184038)
151+
* ** [C'04]** L. P. Cordella, P. Foggia, C. Sansone, M. Vento. *A (sub)graph isomorphism algorithm for matching large graphs*, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367-1372, 2004. DOI: [10.1109/tpami.2004.75](https://doi.org/10.1109/tpami.2004.75)
152+
*

results/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*
2+
!.gitignore

0 commit comments

Comments
 (0)