Merge branch 'ECML' of https://github.com/CompNet/Pang into ECML

LucasPotin98 · LucasPotin98 · commit 97ce9248cc53 · 2023-04-01T17:04:42.000+02:00
diff --git a/README.md b/README.md
@@ -12,75 +12,112 @@ Pang is an algorithm which represents and classifies a collection of graphs acco
 
 # Organization
 This repository is composed of the following elements:
-* `requirements.txt` : List of Python packages used in pang.py.
-* `PANG.py` : Python script in order to use the algorithm.
-* `EMCL.py` : Python script in order to compute the results of the experiments of the ECML paper.
-* `ProcessingPattern.py` : Python script in order to compute the number of occurences and the set of induced patterns
-* `data` : folder with the input data files. There is one folder for each dataset, which are described in the [Datasets](#datasets) section.
+
+* `requirements.txt`: List of required Python packages.
+* `src`: folder containing the source code
+  * `EMCL.py`: script that reproduces the experiments of our paper submitted to ECML PKDD.
+  * `PANG.py`: script that implements the Pang method.
+  * `ProcessingPattern.py`: script that computes the number of occurences and the set of induced patterns.
+  * `Pattern.sh`: **TODO (identifies the patterns with SPMF and counts them with `ProcessingPattern.py` ?).**
+  * `CORKcpp.zip`: archive containing the CORK source code (used in `EMCL.py`) cf. Section [Installation](#installation).
+* `data`: folder containing the input data. Each subfolder corresponds to a distinct dataset, cf. Section [Datasets](#datasets).
+* `results`: files produced by the processing.
 
 
 # Installation
-You first need to install `python` and the required packages:
 
-1. Install the [`python` language](https://www.python.org)
+## Python and Packages
+First, you need to install the `Python` language and the required packages:
+
+1. Install the [`Python` language](https://www.python.org)
 2. Download this project from GitHub and unzip.
-3. Execute `pip install -r requirements.txt` to install the required packages (see also the *Dependencies* Section).
+3. Execute `pip install -r requirements.txt` to install the required packages (see also Section [Dependencies](#dependencies)).
+
+## Non-Python Dependencies
+Second, one of the dependencies, SPMF, is not a Python package, but rather a Java program, and therefore requires a specific installation process:
+
+* Download its source code on [Philippe Fournier-Viger's website](https://www.philippe-fournier-viger.com/spmf/index.php?link=download.php).
+* Follow the installation instructions provided on the [same website](https://www.philippe-fournier-viger.com/spmf/how_to_install.php).
 
-The source code of SPMF in order to use gSpan and cgSpan is available [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=download.php).
-SPMF is available in two versions:
-* a jar file that can be run from the command line. Actually, this version can be use with gSpan, but not with cgSpan.
-* a source code. The installation of this version is more complicated, but it allows to use cgSpan. You can find the instructions [here](https://www.philippe-fournier-viger.com/spmf/how_to_install.php).
+Note that SPMF is available both as a JAR and as source code archive. However, the former does not contain all the features required by Pang, so one should use only the latter.
+
+**TODO In order to run the script that reproduces our ECML PKDD experiments, you also need to install CORK.**
+
+## Data
+Third, you need to set up the data to which you want to apply Pang. This can be the dataset from our paper, in which you will need to unzip several archives, or your own data, in which case they need to be respect the appropriate format. In both cases, see cf. Section [Use](#use).
 
-In order to use Pang, you need to unzip each dataset in its own folder in the `data` folder. 
 
 # Use
 We provide two scripts to use Pang:
-* `ECML.py` : a python script in order to compute the results of the ECML paper.
-* `PANG.py` : a python script in order to use Pang with your own data.
+
+* `ECML.py`: reproduces the experiments described in our paper submitted to ECML PKDD.
+* `PANG.py`: applies Pang in the general case, possibly to your own data.
 
 ## To Replicate the Paper Experiments
-In order to use Pang:
-1. Open the Python console.
-2. Run `EMCL.py`
+To replicate our ECML PKDD experiments, first unzip the provided datasets, and run Pang on them. 
 
-The script will compute the results of the experiments and save the results associated with Table 2, 5 and 6 in the `results` folder.
+### Data Preparation
+To unzip the datasets used in our experiments:
 
+1. Go to the `data` folder.
+2. In each subfolder, you will find an archive that you need to unzip.
 
-## To Apply PANG to Other Data
-If you want to use Pang with your own data, you need to create an `XXX` folder in the `data` folder and put your data in it. This folder must contain the following files:
-* `XXX_graph.txt` : a file containing the graphs.
-* `XXX_label.txt` : a file containing the labels of the graphs.
+We retrieved the benchmark datasets from the [SPMF website](https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php); they include:
+* `MUTAG` : MUTAG dataset, representing chemical compounds and their mutagenic properties [[D'91](#references)]
+* `NCI1` : NCI1 dataset, representing molecules and classified according to carcinogenicity [[W'06](#references)]
+* `PTC` : PTC dataset, representing molecules and classified according to carcinogenicity [[T'03](#references)] 
+* `DD` : DD dataset, representing amino acids and their interactions [[D'03](#references)]
 
-Then you need to run a script to produce the data files that will be used by Pang:
-1. Open the Python console.
-2. Run the script `Patterns.sh` in order to create the files `XXX_patterns.txt`.
-3. Run `ProcessingPattern.py`with the option `-d XXX` in order to create the files `XXX_mono.txt` and `XXX_iso.txt`.
-4. Run `PANG.py` with the option `-d XXX` in order to run Pang on the data `XXX`.
+The public procurement dataset contains graphs extracted from the FOPPA database:
+* `FOPPA` : dataset extracted from FOPPA, a database of French public procurement notices [[P'22](#references)].
+
+
+### Processing
+Then, run the appropriate script:
+
+3. Open the Python console.
+4. Run `EMCL.py`
+
+The script will compute the results of the experiments and save the results associated with Table 2, 5 and 6 of the paper, in the `results` folder.
 
-For each value of the parameter `k`, Pang will create a file `KResults.txt` containing the results of the classification and a file `KPatterns.txt` containing the patterns.
 
-## Data Format
-We use the same format as SPMF for the graph input files. Each graph is defined as follows:
+## To Apply Pang to Other Data
+If you want to use Pang with your own data, you need to set up the data, then identify the patterns, and finally perform the classification.
+
+### Data Preparation
+Create an `XXX` folder in the `data` folder (where `XXX` is the name of your dataset), in order to host your data. This folder must contain the following files:
+
+* `XXX_graph.txt` : a file containing all the graphs.
+* `XXX_label.txt` : a file indicating the labels (classes) of these graphs.
+
+We use the same format as SPMF for the graph input files, i.e.:
 
 1. `t # N  N`: graph id
 2. `v M L  M`: node id, L: node label
 3. `e P Q L P`: source node id, Q: destination node id, L: edge label
 
-For the patterns output files, each pattern contains one more line than the graphs:
+For information, the files produced by our scripts to list the identified patterns are similar, except they contain an extra line:
 
 4. `x A B C A,B,C` : graphs containing the pattern
 
-## Datasets
-The datasets used in the paper are available in the `data` folder. The following datasets are available:
-* `MUTAG` : MUTAG dataset, representing chemical compounds and their mutagenic properties [[D'91](#references)],
-* `NCI1` : NCI1 dataset, representing molecules and classified according to carcinogenicity [[W'06](#references)],
-* `PTC` : PTC dataset, representing molecules and classified according to carcinogenicity  [[T'03](#references)], 
-* `DD` : DD dataset, representing amino acids and their interactions [[D'03](#references)],
+The format of the file containing the graph labels is as follows:
+
+**TODO**
+
+### Processing
+
+Once the data are ready, you need to run a script to identify the patterns, and produce the files required by Pang:
+
+1. Open the `Python` console.
+2. Run the script `Patterns.sh` in order to create the files `XXX_patterns.txt`.
+3. Run `ProcessingPattern.py`with the option `-d XXX` in order to create the files `XXX_mono.txt` and `XXX_iso.txt`.
+4. Run `PANG.py` with the option `-d XXX` in order to run Pang on the data `XXX`.
+
+For each value of the parameter `k` **TODO c'est quoi ce k ?**, Pang will create a file `KResults.txt` containing the results of the classification and a file `KPatterns.txt` containing the patterns.
+
 
-Each of these datasets can be found [here](https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php).
-* `FOPPA` : dataset extracted from FOPPA, a database of French public procurement notices [[P'22](#references)]. 
 # Dependencies
-Tested with `SPMF` version 2.54, and `python` version 3.6.13 with the following packages:
+Tested with `python` version 3.6.13 and the following packages:
 * [`pandas`](https://pypi.org/project/pandas/): version 1.1.5
 * [`numpy`](https://pypi.org/project/numpy/): version 1.19.5
 * [`networkx`](https://pypi.org/project/numpy/): version 2.5.1
@@ -90,27 +127,26 @@ Tested with `SPMF` version 2.54, and `python` version 3.6.13 with the following
 * [`karateclub`](https://pypi.org/project/numpy/): version 1.3.3
 * [`stellargraph`](https://pypi.org/project/numpy/): version 1.2.1
 
+The VF2 [[C'04](#references)] and ISMAGS [[H'14](#references)] algorithms are included in the [`Networkx` library](https://networkx.org/)
+
+Tested with `SPMF` version 2.54, which implements gSpan [[Y'02](#references)] (to mine frequent patterns) and cgSpan [[S'21](#references)] (closed frequent patterns).
 
-The VF2 and ISMAGS algortihms are included in the [`Networkx` library](https://networkx.org/)
+For the ECML PKDD assessment, we use the following algorithms for the sake of comparison:
 
-For the baselines:
-* The WL and WLOA algorithms are included in the Grakel library, documentation available [here](https://ysig.github.io/GraKeL/0.1a8/benchmarks.html)
-* Graph2Vec is included in the karateclub library, documentation available [here](https://karateclub.readthedocs.io/en/latest/)
-* DGCNN is included in the stellargraph library, documentation available [here](https://stellargraph.readthedocs.io/en/stable/).
-* We use the implementation of CORK from Marisa Thoma. This implementation is available in the `CORKcpp.zip` archive.
+* The `WL` and `WLOA` algorithms are included in the `Grakel` library, documentation available [here](https://ysig.github.io/GraKeL/0.1a8/benchmarks.html)
+* `Graph2Vec` is included in the `karateclub` library, documentation available [here](https://karateclub.readthedocs.io/en/latest/)
+* `DGCNN` is included in the `stellargraph` library, documentation available [here](https://stellargraph.readthedocs.io/en/stable/).
+* We use the implementation of `CORK` from Marisa Thoma. This implementation is available in the `CORKcpp.zip` archive.
 
 
 # References
+* **[D'91]** A. S. Debnath, R. L. Lopez, G. Debnath, A. Shusterman, C. Hansch. *Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity*, Journal of Medicinal Chemistry 34(2):786–797, 1991. DOI: [10.1021/jm00106a046](https://doi.org/10.1021/jm00106a046)
+* **[D'03]** P. D. Dobson, A. J. Doig. *Distinguishing enzyme structures from non-enzymes without alignments*, Journal of Molecular Biology 330(4):771–783, 2003. DOI: [10.1016/S0022-2836(03)00628-4](https://doi.org/10.1016/S0022-2836(03)00628-4)
+* **[H'14']** M. Houbraken, S. Demeyer, T. Michoel, P. Audenaert, D. Colle, M. Pickavet. *The Index-Based Subgraph Matching Algorithm with General Symmetries (ISMAGS): Exploiting Symmetry for Faster Subgraph Enumeration*, PLoS ONE 9(5):e97896, 2014. DOI: [10.1371/journal.pone.0097896](https://doi.org/10.1371/journal.pone.0097896).
 * **[P'22]** L. Potin, V. Labatut, R. Figueiredo, C. Largeron, P.-H. Morand. *FOPPA: A database of French Open Public Procurement Award notices*, Technical Report, Avignon University, 2022.  [⟨hal-03796734⟩](https://hal.archives-ouvertes.fr/hal-03796734)
-* **[D'91]** A.S. Debnath, R.L. Lopez, G. Debnath, A. Shusterman, C. Hansch. *Structure-
-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds.
-correlation with molecular orbital energies and hydrophobicity*, Journal of Medic-
-inal Chemistry 34(2), 786–797, 1991.
-* **[W'06]** N.Wale, G. Karypis. *Comparison of descriptor spaces for chemical compound
-retrieval and classification*, 6th International Conference on Data Mining, pp.
-678–689, 2006.
-* **[T'03]** H . Toivonen, A. Srinivasan, R.D. King, S. Kramer, C. Helma.*Statistical eval-
-uation of the predictive toxicology challenge 2000-2001*, Bioinformatics 19(10),
-1183–1193, 2003.
-* **[D'03]** P.D. Dobson, A.J. Doig. *Distinguishing enzyme structures from non-enzymes
-without alignments*, Journal of Molecular Biology 330(4), 771–783 ,2003.
+* **[S'21]** Z. Shaul, S. Naaz. *cgSpan: Closed Graph-Based Substructure Pattern Mining, IEEE International Conference on Big Data, pp. 4989-4998, 2021. DOI: [10.1109/bigdata52589.2021.9671995](https://doi.org/10.1109/bigdata52589.2021.9671995)
+* **[T'03]** H. Toivonen, A. Srinivasan, R. D. King, S. Kramer, C. Helma. *Statistical evaluation of the predictive toxicology challenge 2000-2001*, Bioinformatics 19(10):1183–1193, 2003. DOI: [10.1093/bioinformatics/btg130](https://doi.org/10.1093/bioinformatics/btg130)
+* **[W'06]** N. Wale, G. Karypis. *Comparison of descriptor spaces for chemical compound retrieval and classification*, 6th International Conference on Data Mining, pp. 678–689, 2006. DOI: [10.1007/s10115-007-0103-5](https://doi.org/10.1007/s10115-007-0103-5)
+* **[Y'02]** X. Yan, J. Han. *gSpan: Graph-based substructure pattern mining*, IEEE International Conference on Data Mining, pp.721-724, 2002. DOI: [10.1109/ICDM.2002.1184038](https://doi.org/10.1109/ICDM.2002.1184038)
+* ** [C'04]** L. P. Cordella, P. Foggia, C. Sansone, M. Vento. *A (sub)graph isomorphism algorithm for matching large graphs*, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367-1372, 2004. DOI: [10.1109/tpami.2004.75](https://doi.org/10.1109/tpami.2004.75)
+* 
diff --git a/results/.gitignore b/results/.gitignore
@@ -0,0 +1,2 @@
+*
+!.gitignore