ChEB-AI
diff --git a/‎README.md‎
Lines changed: 60 additions & 26 deletions b/‎README.md‎
Lines changed: 60 additions & 26 deletions
diff --git a/‎chebai_graph/preprocessing/bin/Aromaticity/indices.txt‎ b/‎chebai_graph/preprocessing/bin/Aromaticity/indices.txt‎
diff --git a/‎chebai_graph/preprocessing/bin/AtomType/indices.txt‎ b/‎chebai_graph/preprocessing/bin/AtomType/indices.txt‎
diff --git a/‎chebai_graph/preprocessing/bin/BondType/indices.txt‎ b/‎chebai_graph/preprocessing/bin/BondType/indices.txt‎
diff --git a/‎chebai_graph/preprocessing/bin/FormalCharge/indices_one_hot.txt‎ b/‎chebai_graph/preprocessing/bin/FormalCharge/indices_one_hot.txt‎
diff --git a/‎chebai_graph/preprocessing/bin/MoleculeNumRings/indices_one_hot.txt‎ b/‎chebai_graph/preprocessing/bin/MoleculeNumRings/indices_one_hot.txt‎
diff --git a/‎chebai_graph/preprocessing/datasets/chebi.py‎
Lines changed: 7 additions & 4 deletions b/‎chebai_graph/preprocessing/datasets/chebi.py‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎chebai_graph/preprocessing/property_encoder.py‎
Lines changed: 27 additions & 11 deletions b/‎chebai_graph/preprocessing/property_encoder.py‎
Lines changed: 27 additions & 11 deletions
diff --git a/‎chebai_graph/preprocessing/reader.py‎
Lines changed: 1 addition & 1 deletion b/‎chebai_graph/preprocessing/reader.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 4 deletions b/‎pyproject.toml‎
Lines changed: 1 addition & 4 deletions
@@ -1,43 +1,77 @@
 
+# ChEB-AI Graph
+
+Graph-based models for molecular property prediction and ontology classification, built on top of the [`python-chebai`](https://github.com/ChEB-AI/python-chebai) codebase.
+
+
 
 ## Installation
 
-Some requirements may not be installed successfully automatically.
-To install the `torch-` libraries, use
+To install this repository, download it and run
 
-`pip install torch-${lib} -f https://data.pyg.org/whl/torch-2.1.0+${CUDA}.html`
+```bash
+pip install .
+```
 
-where `${lib}` is either `scatter`, `geometric`, `sparse` or `cluster`, and
-`${CUDA}` is either `cpu`, `cu118` or `cu121` (depending on your system, see e.g.
-[torch-geometric docs](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html))
+or install it directly with
+```bash
+pip install git+https://github.com/ChEB-AI/python-chebai-graph.git
+```
 
+The dependencies `torch`, `torch_geometric` and `torch-sparse` cannot be installed automatically.
 
-## Commands
+Use the following command:
 
-For training, config files from the `python-chebai` and `python-chebai-graph` repositories can be combined. This requires that you download the [source code of python-chebai](https://github.com/ChEB-AI/python-chebai). Make sure that you are in the right folder and know the relative path to the other repository.
+```bash
+pip install torch torch_scatter torch_geometric -f https://data.pyg.org/whl/torch-${TORCH}+${CUDA}.html
+```
 
-We recommend the following setup:
+Replace:
+- `${TORCH}` with a PyTorch version (e.g., `2.6.0`; for later versions, check first if they are compatible with torch_scatter and torch_geometric)
+- `${CUDA}` with e.g. `cpu`, `cu118`, or `cu121` depending on your system and CUDA version
 
-  my_projects
-    python-chebai
-      chebai
-      configs
-      data
-      ...
-    python-chebai-graph
-      chebai_graph
-      configs
-      ...
+If you already have `torch` installed, make sure that `torch_scatter` and `torch_geometric` are compatible with your
+PyTorch version and are installed with the same CUDA version.
 
-  If you run the command from the `python-chebai` directory, you can use the same data for both chebai- and chebai-graph-models (e.g., Transformers and GNNs).
-  Then you have to use `{path-to-chebai} -> .` and `{path-to-chebai-graph} -> ../python-chebai-graph`.
+For a full list of currently available PyTorch versions and CUDA compatibility, please refer to libraries' official documentation:
+- [torch](https://pytorch.org/get-started/locally/)
+- [torch_geometric](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html#installation)
+- [torch-scatter](https://github.com/rusty1s/pytorch_scatter)
 
-Pretraining on a atom / bond masking task with PubChem data (feature-branch):
-```
-python3 -m chebai fit --model={path-to-chebai-graph}/configs/model/gnn_resgated_pretrain.yml --data={path-to-chebai-graph}/configs/data/pubchem_graph.yml --trainer={path-to-chebai}/configs/training/pretraining_trainer.yml
+_Note for developers_: If you want to install the package in editable mode, use the following command instead:
+
+```bash
+pip install -e .
 ```
 
-Training on the ontology prediction task (here for ChEBI50, v231, 200 epochs)
+## Recommended Folder Structure
+
+ChEB-AI Graph is not a standalone library. Instead, it provides additional models and datasets for [`python-chebai`](https://github.com/ChEB-AI/python-chebai).
+The training relies on config files that are located either in `python-chebai` or in this repository.
+
+Therefore, for training, we recommend to clone both repositories into a common parent directory. For instance, your project can look like this:
+
 ```
-python3 -m chebai fit --trainer={path-to-chebai}/configs/training/default_trainer.yml --trainer.callbacks={path-to-chebai}/configs/training/default_callbacks.yml --model={path-to-chebai-graph}/configs/model/gnn_res_gated.yml --model.train_metrics={path-to-chebai}/configs/metrics/micro-macro-f1.yml --model.test_metrics={path-to-chebai}/configs/metrics/micro-macro-f1.yml --model.val_metrics={path-to-chebai}/configs/metrics/micro-macro-f1.yml --data={path-to-chebai-graph}/configs/data/chebi50_graph_properties.yml --model.criterion=c{path-to-chebai}/onfigs/loss/bce.yml --data.init_args.batch_size=40 --trainer.logger.init_args.name=chebi50_bce_unweighted_resgatedgraph --data.init_args.num_workers=12 --model.pass_loss_kwargs=false --data.init_args.chebi_version=231 --trainer.min_epochs=200 --trainer.max_epochs=200
+my_projects/
+├── python-chebai/
+│   ├── chebai/
+│   ├── configs/
+│   └── ...
+└── python-chebai-graph/
+    ├── chebai_graph/
+    ├── configs/
+    └── ...
+```
+
+## Training & Pretraining
+
+### Ontology Prediction
+
+
+This example command trains a Residual Gated Graph Convolutional Network on the ChEBI50 dataset (see [wiki](https://github.com/ChEB-AI/python-chebai/wiki/Data-Management)).
+The dataset has a customizable list of properties for atoms, bonds and molecules that are added to the graph.
+The list can be found in the `configs/data/chebi50_graph_properties.yml` file.
+
+```bash
+python -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.logger=configs/training/csv_logger.yml --model=../python-chebai-graph/configs/model/gnn_res_gated.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --data=../python-chebai-graph/configs/data/chebi50_graph_properties.yml --data.init_args.batch_size=128 --trainer.accumulate_grad_batches=4 --data.init_args.num_workers=10 --model.pass_loss_kwargs=false --data.init_args.chebi_version=241 --trainer.min_epochs=200 --trainer.max_epochs=200 --model.criterion=configs/loss/bce.yml
 ```
@@ -139,11 +139,14 @@ def get_property_path(self, property: MolecularProperty):
             f"{property.name}_{property.encoder.name}.pt",
         )
 
-    def setup(self, **kwargs):
-        super().setup(keep_reader=True, **kwargs)
-        self._setup_properties()
+    def _after_setup(self, **kwargs):
+        """
+        Finalize the setup process after ensuring the processed data is available.
 
-        self.reader.on_finish()
+        This method performs post-setup tasks like finalizing the reader and setting internal properties.
+        """
+        self._setup_properties()
+        super()._after_setup(**kwargs)
 
     def _merge_props_into_base(self, row):
         geom_data = row["features"]
 
@@ -37,11 +37,13 @@ class IndexEncoder(PropertyEncoder):
     def __init__(self, property, indices_dir=None, **kwargs):
         super().__init__(property, **kwargs)
         if indices_dir is None:
-            indices_dir = os.path.dirname(__file__)
+            indices_dir = os.path.dirname(inspect.getfile(self.__class__))
         self.dirname = indices_dir
         # load already existing cache
         with open(self.index_path, "r") as pk:
-            self.cache = [x.strip() for x in pk]
+            self.cache: dict[str, int] = {
+                token.strip(): idx for idx, token in enumerate(pk)
+            }
         self.index_length_start = len(self.cache)
         self.offset = 0
 
@@ -65,19 +67,33 @@ def index_path(self):
 
     def on_finish(self):
         """Save cache"""
-        with open(self.index_path, "w") as pk:
-            new_length = len(self.cache) - self.index_length_start
-            pk.writelines([f"{c}\n" for c in self.cache])
-            print(
-                f"saved index of property {self.property.name} to {self.index_path}, "
-                f"index length: {len(self.cache)} (new: {new_length})"
-            )
+        total_tokens = len(self.cache)
+        if total_tokens > self.index_length_start:
+            print("New tokens added to the cache, Saving them to index token file.....")
+
+            assert sys.version_info >= (
+                3,
+                7,
+            ), "This code requires Python 3.7 or higher."
+            # For python 3.7+, the standard dict type preserves insertion order, and is iterated over in same order
+            # https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights
+            # https://mail.python.org/pipermail/python-dev/2017-December/151283.html
+            new_tokens = list(islice(self.cache, self.index_length_start, total_tokens))
+
+            with open(self.index_path, "a") as pk:
+                pk.writelines([f"{c}\n" for c in new_tokens])
+                print(
+                    f"New {len(new_tokens)} tokens append to index of property {self.property.name} to {self.index_path}..."
+                )
+                print(
+                    f"Now, the total length of the index of property {self.property.name} is {total_tokens}"
+                )
 
     def encode(self, token):
         """Returns a unique number for each token, automatically adds new tokens to the cache."""
         if not str(token) in self.cache:
-            self.cache.append(str(token))
-        return torch.tensor([self.cache.index(str(token)) + self.offset])
+            self.cache[(str(token))] = len(self.cache)
+        return torch.tensor([self.cache[str(token)] + self.offset])
 
 
 class OneHotEncoder(IndexEncoder):
 
@@ -14,7 +14,7 @@
 from chebai_graph.preprocessing.collate import GraphCollator
 
 
-class GraphPropertyReader(dr.ChemDataReader):
+class GraphPropertyReader(dr.DataReader):
     COLLATOR = GraphCollator
 
     def __init__(
 
@@ -6,10 +6,7 @@ authors = [
     { name = "Martin Glauer", email = "[email protected]" }
 ]
 dependencies = [
-    "torch_geometric",
-    "torch-scatter",
-    "torch-sparse",
-    "torch-cluster",
+    "chebai",
     "descriptastorus"
 ]
Original file line number	Diff line number	Diff line change
`@@ -6,10 +6,7 @@ authors = [`
`6`	`6`	`{ name = "Martin Glauer", email = "[email protected]" }`
`7`	`7`	`]`
`8`	`8`	`dependencies = [`
`9`		`- "torch_geometric",`
`10`		`- "torch-scatter",`
`11`		`- "torch-sparse",`
`12`		`- "torch-cluster",`
	`9`	`+ "chebai",`
`13`	`10`	`"descriptastorus"`
`14`	`11`	`]`
`15`	`12`