This repository contains code and data described in detail in our paper (Engler Hart et al., 2025). DOI: doi.org/10.1093/gigascience/giaf033
If you have found our manuscript useful in your work, please consider citing:
To reproduce the results, the Python virtual environment can be installed using Poetry.
Datasets are publically available and can be directly downloaded from Zenodo. The files should be unzipped and placed in the data directory.
Furthermore, the directory figures contains all the figures of the manuscript (generated by the notebooks) as well as the raw and intermediary files (also generated by the notebooks).
Run the notebooks located in the notebooks corresponding to each analysis. The prefix of the notebooks indicates the order in which it is run. The notebooks reproduce the figures in the manuscript and supplementary.
- Clone the repository to get its contents locally:
git clone https://github.com/enveda/chemical-space-estimation.git
cd chemical-space-estimation-
Prepare the data directory as mentioned in the Data section.
-
Initiate the poetry environment with python using the following commands:
poetry install
poetry initNOTE: Please ensure you have Python version 3.10 installed on your system to run the codes efficiently. If not, please install from here according to you OS.
For running the Jupyter notebooks, the virtual environment installed above can be selected.
For running python files, use the following command in the terminal:
poetry run python REPLACE_WITH_FILE_NAME.py- Once all dependencies are installed, you can run the notebooks for the following analysis:
- Data Statistics - Data collection and summary on ENPKG.
- Chemical class distribution - Chemical class distribution of in the ENPKG dataset.
- Entropy score - Calculating the Spectral Entropy Scores for the dataset.
- Entropy score matrix - Generating the Entropy Score matrix
- Spectral clustering - Generating MS2 spectra clusters in the dataset
- Comparing the known space - Comparison of the known chemical space between literature curated databases and the ENPKG dataset.
- Estimating chemical space - Approximate calculations for the chemical space required for understanding the plant landscape
- Diversity assessment - Assessing the diversity of ENPKG datasets with the plant kingdom.