Protein-Engineering-Framework
diff --git a/‎.github/imgs/mut_performance_DCA_ESM.png‎
5.26 MB b/‎.github/imgs/mut_performance_DCA_ESM.png‎
5.26 MB
diff --git a/‎.github/imgs/mut_performance_violin_DCA_ESM.png‎
767 KB b/‎.github/imgs/mut_performance_violin_DCA_ESM.png‎
767 KB
diff --git a/‎.github/workflows/build.yml‎
Lines changed: 32 additions & 4 deletions b/‎.github/workflows/build.yml‎
Lines changed: 32 additions & 4 deletions
diff --git a/‎.gitignore‎
Lines changed: 25 additions & 0 deletions b/‎.gitignore‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎.vscode/launch.json‎
Lines changed: 94 additions & 2 deletions b/‎.vscode/launch.json‎
Lines changed: 94 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 18 additions & 13 deletions b/‎README.md‎
Lines changed: 18 additions & 13 deletions
@@ -9,13 +9,12 @@ permissions:
   contents: read
 
 jobs:
-  build:
-
+  ubuntu:
+    name: ubuntu
     runs-on: [ubuntu-latest]
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
-        
+        python-version: ["3.10", "3.11", "3.12"]
     steps:
     - uses: actions/checkout@v4
     - name: Set up Python ${{ matrix.python-version }}
@@ -37,3 +36,32 @@ jobs:
     - name: Export Pythonpath and run PyPEF API and CLI version test with pytest
       run: |
         export PYTHONPATH="${PYTHONPATH}:${PWD}" && python -m pytest tests/
+
+  windows:
+    name: windows
+    runs-on: [windows-latest]
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+    steps:
+    - uses: actions/checkout@v4
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Display Path and Python version
+      run: |
+        python -c "import sys, platform; print(sys.version, platform.system())"
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install flake8 pytest
+        pip install -r requirements.txt
+    - name: Lint with flake8
+      run: |
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 .\pypef --count --select=E9,F63,F7,F82 --show-source --statistics
+    - name: Export Pythonpath and run PyPEF API and CLI version test with pytest
+      shell: pwsh
+      run: |
+        $env:PYTHONPATH = "${PWD};${env:PYTHONPATH}";python -m pytest .\tests\
@@ -27,6 +27,7 @@ scripts/ProteinGym_runs/single_point_mut_performance.png
 scripts/ProteinGym_runs/multi_point_mut_performance.png
 
 # Created test/output files
+model_saves/*
 scripts/Setup/windows/Miniconda3-latest-Windows-x86_64.exe
 scripts/Setup/windows/Miniconda3/*
 scripts/Encoding_low_N/apc.png
@@ -402,3 +403,27 @@ scripts/Runtime_tests/runtimes.png
 datasets/AVGFP/Recomb_Double_Split/Predictions_Hybrid_TopRecomb_Double_Split.txt
 scripts/ProteinGym_runs/single_point_mut_performance_violin.png
 scripts/ProteinGym_runs/multi_point_mut_performance_violin.png
+scripts/ESM_finetuning/DMS_msa_files/
+scripts/ESM_finetuning/DMS_ProteinGym_substitutions/
+scripts/ESM_finetuning/ProteinGym_AF2_structures/
+
+scripts/ESM_finetuning/higher_point_dms_mut_data.json
+scripts/ESM_finetuning/single_point_dms_mut_data.json
+scripts/ESM_finetuning/results/dca_esm_and_hybrid_opt_results_clean.csv
+scripts/ESM_finetuning/results/dca_esm_and_hybrid_opt_results.csv
+scripts/ESM_finetuning/mut_performance.png
+scripts/ESM_finetuning/_Description_DMS_substitutions_data.csv
+scripts/ESM_finetuning/mut_performance_violin.png
+datasets/ANEH/SSM_landscape.png
+datasets/ANEH/SSM_landscape.csv
+datasets/AVGFP/model_saves/*
+datasets/AVGFP/Pickles/*
+datasets/AVGFP/DCA_Hybrid_Model_Performance_ESM1v_no_ML.png
+datasets/AVGFP/DCA_Hybrid_Model_Performance_ProSST_no_ML.png
+
+# Large files // LFS in niklases/PyPEF
+datasets/ANEH/ANEH_72.6.params
+datasets/AVGFP/uref100_avgfp_jhmmer_119_plmc_42.6.params
+datasets/AVGFP/uref100_avgfp_jhmmer_119.sto
+datasets/GRB2/GRB2_HUMAN_full_11-26-2021_b05.a2m
+datasets/ANEH/ANEH_jhmmer.sto
@@ -123,6 +123,46 @@
         },
 
         {
+            "name": "Python: PyPEF hybrid LS-TS GREMLIN-DCA-ESM1v avGFP",
+            "type": "debugpy",
+            "request": "launch",
+            "env": {"PYTHONPATH": "${workspaceFolder}"},
+            "program": "${workspaceFolder}/pypef/main.py",
+            "console": "integratedTerminal",
+            "justMyCode": true,
+            "cwd": "${workspaceFolder}/datasets/AVGFP/",
+            "args": [
+                "hybrid", 
+                //"-m", "GREMLIN",   // optional, not required  
+                "--ls", "LS.fasl",
+                "--ts", "TS.fasl", 
+                "--params", "GREMLIN",
+                "--llm", "esm"
+            ]
+        },
+
+        {
+            "name": "Python: PyPEF hybrid LS-TS GREMLIN-DCA-ProSST avGFP",
+            "type": "debugpy",
+            "request": "launch",
+            "env": {"PYTHONPATH": "${workspaceFolder}"},
+            "program": "${workspaceFolder}/pypef/main.py",
+            "console": "integratedTerminal",
+            "justMyCode": true,
+            "cwd": "${workspaceFolder}/datasets/AVGFP/",
+            "args": [
+                "hybrid", 
+                //"-m", "GREMLIN",   // optional, not required  
+                "--ls", "LS.fasl",
+                "--ts", "TS.fasl", 
+                "--params", "GREMLIN",
+                "--llm", "prosst",
+                "--wt", "P42212_F64L.fasta",
+                "--pdb", "GFP_AEQVI.pdb"
+            ]
+        },
+
+        { // Test on test set
             "name": "Python: PyPEF hybrid/only-TS-zero-shot GREMLIN-DCA avGFP",
             "type": "debugpy",
             "request": "launch",
@@ -139,6 +179,24 @@
             ]
         },
 
+        { // Test on test set: Hybrid DCA-LLM ESM1v
+            "name": "Python: PyPEF hybrid/only-TS-zero-shot GREMLIN-DCA-ESM1v avGFP",
+            "type": "debugpy",
+            "request": "launch",
+            "env": {"PYTHONPATH": "${workspaceFolder}"},
+            "program": "${workspaceFolder}/pypef/main.py",
+            "console": "integratedTerminal",
+            "justMyCode": true,
+            "cwd": "${workspaceFolder}/datasets/AVGFP/",
+            "args": [
+                "hybrid", 
+                //"-m", "GREMLIN",   // optional, not required  
+                "--ts", "TS.fasl", 
+                "--params", "GREMLIN",
+                "--llm", "esm"
+            ]
+        },
+
         {
             "name": "Python: PyPEF hybrid/only-PS-zero-shot GREMLIN-DCA avGFP",
             "type": "debugpy",
@@ -156,6 +214,23 @@
             ]
         },
 
+        {
+            "name": "Python: PyPEF hybrid/only-PS-zero-shot GREMLIN-DCA avGFP drecomb",
+            "type": "debugpy",
+            "request": "launch",
+            "env": {"PYTHONPATH": "${workspaceFolder}"},
+            "program": "${workspaceFolder}/pypef/main.py",
+            "console": "integratedTerminal",
+            "justMyCode": true,
+            "cwd": "${workspaceFolder}/datasets/AVGFP/",
+            "args": [
+                "hybrid", 
+                "-m", "GREMLIN", 
+                "--pmult", "--drecomb", 
+                "--params", "GREMLIN"
+            ]
+        },
+
         {
             "name": "Python: PyPEF hybrid/only-PS-zero-shot GREMLIN-DCA avGFP drecomb II",
             "type": "debugpy",
@@ -174,7 +249,7 @@
         },
 
         {
-            "name": "Python: PyPEF hybrid/only-PS-zero-shot GREMLIN-DCA avGFP drecomb",
+            "name": "Python: PyPEF hybrid/only-PS-zero-shot GREMLIN-DCA avGFP drecomb III: ESM",
             "type": "debugpy",
             "request": "launch",
             "env": {"PYTHONPATH": "${workspaceFolder}"},
@@ -184,7 +259,24 @@
             "cwd": "${workspaceFolder}/datasets/AVGFP/",
             "args": [
                 "hybrid", 
-                "-m", "GREMLIN", 
+                "-m", "HYBRIDgremlinesm", 
+                "--pmult", "--drecomb", 
+                "--params", "GREMLIN"
+            ]
+        },
+
+        {
+            "name": "Python: PyPEF hybrid/only-PS-zero-shot GREMLIN-DCA avGFP drecomb IV: ProSST",
+            "type": "debugpy",
+            "request": "launch",
+            "env": {"PYTHONPATH": "${workspaceFolder}"},
+            "program": "${workspaceFolder}/pypef/main.py",
+            "console": "integratedTerminal",
+            "justMyCode": true,
+            "cwd": "${workspaceFolder}/datasets/AVGFP/",
+            "args": [
+                "hybrid", 
+                "-m", "HYBRIDgremlinprosst", 
                 "--pmult", "--drecomb", 
                 "--params", "GREMLIN"
             ]
 
@@ -22,7 +22,7 @@
 # PyPEF: Pythonic Protein Engineering Framework
 [![PyPI version](https://img.shields.io/pypi/v/PyPEF?color=blue)](https://pypi.org/project/pypef/)
 [![Python version](https://img.shields.io/pypi/pyversions/PyPEF)](https://www.python.org/downloads/)
-[![Build](https://github.com/Protein-Engineering-Framework/PyPEF/actions/workflows/build.yml/badge.svg)](https://github.com/Protein-Engineering-Framework/PyPEF/actions?query=workflow:build)
+[![Build](https://github.com/niklases/PyPEF/actions/workflows/build.yml/badge.svg)](https://github.com/niklases/PyPEF/actions/?query=workflow:build)
 [![PyPI Downloads](https://static.pepy.tech/badge/pypef)](https://pepy.tech/projects/pypef)
 
 a framework written in Python 3 for performing sequence-based machine learning-assisted protein engineering to predict a protein's fitness from its sequence using different forms of sequence encoding:
@@ -69,15 +69,15 @@ A rudimentary graphical user interface (GUI) can be installed using the gui_setu
 
 Windows (PowerShell)
 ```powershell
-Invoke-WebRequest https://raw.githubusercontent.com/Protein-Engineering-Framework/PyPEF/refs/heads/master/gui_setup.bat -OutFile gui_setup.bat
-Invoke-WebRequest https://raw.githubusercontent.com/Protein-Engineering-Framework/PyPEF/refs/heads/master/gui/qt_window.py -OutFile ( New-Item -Path ".\gui\qt_window.py" -Force )
+Invoke-WebRequest https://raw.githubusercontent.com/niklases/PyPEF/refs/heads/main/gui_setup.bat -OutFile gui_setup.bat
+Invoke-WebRequest https://raw.githubusercontent.com/niklases/PyPEF/refs/heads/main/gui/qt_window.py -OutFile ( New-Item -Path ".\gui\qt_window.py" -Force )
 .\gui_setup.bat
 ```
 
 Linux
 ```bash
-wget https://raw.githubusercontent.com/Protein-Engineering-Framework/PyPEF/refs/heads/master/gui_setup.sh -O gui_setup.sh
-mkdir -p ./gui/ && wget https://raw.githubusercontent.com/Protein-Engineering-Framework/PyPEF/refs/heads/master/gui/qt_window.py -O ./gui/qt_window.py
+wget https://raw.githubusercontent.com/niklases/PyPEF/refs/heads/main/gui_setup.sh -O gui_setup.sh
+mkdir -p ./gui/ && wget https://raw.githubusercontent.com/niklases/PyPEF/refs/heads/main/gui/qt_window.py -O ./gui/qt_window.py
 chmod a+x ./gui_setup.sh && ./gui_setup.sh
 ```
 
@@ -218,8 +218,8 @@ bash Anaconda3-2023.03-1-Linux-x86_64.sh
 ```
 
 After accepting all steps, the conda setup should also be written to your `~/.bashrc`file, so that you can call anaconda typing `conda`.
-Next, to download this repository click Code > Download ZIP and unzip the zipped file, e.g. with `unzip PyPEF-master.zip`, or just clone this repository using your bash shell to your local machine `git clone https://github.com/Protein-Engineering-Framework/PyPEF`.
-To set up a new environment with conda you can either create the conda environment from the provided YAML file inside the PyPEF directory (`cd PyPEF` or `cd PyPEF-master` dependent on the downloaded file name and chose YAML file for your operating system):
+Next, to download this repository click Code > Download ZIP and unzip the zipped file, e.g. with `unzip PyPEF-main.zip`, or just clone this repository using your bash shell to your local machine `git clone https://github.com/niklases/PyPEF`.
+To set up a new environment with conda you can either create the conda environment from the provided YAML file inside the PyPEF directory (`cd PyPEF` or `cd PyPEF-main` dependent on the downloaded file name and chose YAML file for your operating system):
 
 ```
 conda env create --file linux_env.yml
@@ -237,7 +237,7 @@ To activate the environment you can define:
 conda activate pypef
 ```
 
-After activating the environment you can install required packages after changing the directory to the PyPEF directory (`cd PyPEF` or `cd PyPEF-master`) and install required packages with pip if you did not use the YAML file for creating the environment (if using conda, packages will be installed in anaconda3/envs/pypef/lib/python3.10/site-packages):
+After activating the environment you can install required packages after changing the directory to the PyPEF directory (`cd PyPEF` or `cd PyPEF-main`) and install required packages with pip if you did not use the YAML file for creating the environment (if using conda, packages will be installed in anaconda3/envs/pypef/lib/python3.10/site-packages):
 
 ```
 python3 -m pip install -r requirements.txt
@@ -327,23 +327,23 @@ The following model hyperparameter ranges are tested during (*k*-fold) cross-val
 PyPEF was developed to be run from a command-line interface while `python3 ./pypef/main.py` (when using the downloaded version of this repository and setting the `PYTHONPATH`) is equal to `pypef` when installed with pip. 
 Downloading/cloning the repository files (manually or with `wget`/`git clone`):<br>
 ```
-wget https://github.com/Protein-Engineering-Framework/PyPEF/archive/refs/heads/master.zip
+wget https://github.com/niklases/PyPEF/archive/main.zip
 ```
 
 Unzipping the zipped file (manually or e.g. with `unzip`):
 ```
-unzip master.zip
+unzip main.zip
 ```
 
 Setting the `PYTHONPATH` (so that no import errors occur stating that the package `pypef` and thus dependent absolute imports are unknown):<br>
 &nbsp;&nbsp;Windows (example path, PowerShell)
 ```
-$env:PYTHONPATH="C:\Users\name\path\to\PyPEF-master"
+$env:PYTHONPATH="C:\Users\name\path\to\PyPEF-main"
 ```
 
 &nbsp;&nbsp;Linux (example path)
 ```
-export PYTHONPATH="${PYTHONPATH}:/home/name/path/to/PyPEF-master"
+export PYTHONPATH="${PYTHONPATH}:/home/name/path/to/PyPEF-main"
 ```
 Installing the requirements:<br>
 &nbsp;&nbsp;Windows (PowerShell)
@@ -356,7 +356,7 @@ python -m pip install -r requirements.txt
 python3 -m pip install -r requirements.txt
 ```
 
-Running the main script (from PyPEF-master directory):<br>
+Running the main script (from PyPEF-main directory):<br>
 &nbsp;&nbsp;Windows (PowerShell)
 ```
 python .\pypef\main.py
@@ -485,6 +485,11 @@ The performance of the GREMLIN model used is shown in the following for predicti
 
 for ProteinGym datasets computed using the scripts located at [scripts/ProteinGym_runs](scripts/ProteinGym_runs).
 
+A hybrid GREMLIN-ESM1v low-N-tuned model achieved even increased performances compared to the pure DCA-tuned model (script available at [scripts/ESM_finetuning](scripts/ESM_finetuning))
+<p align="center">
+    <img src=".github/imgs/mut_performance_violin_DCA_ESM.png" alt="drawing" width="250"/>
+</p>
+
 <a name="api-usage"></a>
 ## API Usage for Sequence Encoding
 For script-based encoding of sequences using PyPEF and the available AAindex-, OneHot- or DCA-based techniques, the classes and corresponding functions can be imported, i.e. `OneHotEncoding`, `AAIndexEncoding`, `GREMLIN` (DCA),  `PLMC` (DCA), and `DCAHybridModel`. In addition, implemented functions for CV-based tuning of regression models can be used to train and validate models, eventually deriving them to obtain performances on retained data for testing. An exemplary script and a Jupyter notebook for CV-based (low-*N*) tuning of models and using them for testing is provided at [scripts/Encoding_low_N/api_encoding_train_test.py](scripts/Encoding_low_N/api_encoding_train_test.py) and [scripts/Encoding_low_N/api_encoding_train_test.ipynb](scripts/Encoding_low_N/api_encoding_train_test.ipynb), respectively.