Skip to content

Commit ce95c87

Browse files
authored
Filter preprocessed assembly mmCIF files, fix various bugs in the filtering criterion functions, and optimize the clustering script (#69)
* Update README.md * Update biomolecule.py * Update mmcif_parsing.py * Update test_data_parsing.py * Update filter_pdb_mmcifs.py * Update README.md * Update cluster_pdb_mmcifs.py * Update README.md
1 parent 19de641 commit ce95c87

File tree

6 files changed

+275
-199
lines changed

6 files changed

+275
-199
lines changed

README.md

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ A fork with full Lightning + Hydra support is being maintained by <a href="https
2828

2929
- <a href="https://github.com/amorehead">Alex</a> for the PDB dataset preparation script!
3030

31+
- <a href="https://github.com/milot-mirdita">Milot</a> for optimizing the PDB dataset clustering script!
32+
3133
- <a href="https://github.com/patrick-kidger">Patrick</a> for <a href="https://docs.kidger.site/jaxtyping/">jaxtyping</a>, <a href="https://github.com/fferflo">Florian</a> for <a href="https://github.com/fferflo/einx">einx</a>, and of course, <a href="https://github.com/arogozhnikov">Alex</a> for <a href="https://einops.rocks/">einops</a>
3234

3335
## Install
@@ -198,18 +200,26 @@ assert sampled_atom_pos.shape == (1, (6 + 5), 3)
198200

199201
### PDB dataset curation
200202

201-
To acquire the AlphaFold 3 PDB dataset, first download all complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The Python script below (i.e., `filter_pdb_mmcifs.py`) assumes you have downloaded the PDB in the **mmCIF file format**, placing it at `data/pdb_data/unfiltered_mmcifs/`. On the RCSB website, navigate down to "Download Protocols", and follow the download instructions depending on your location.
203+
To acquire the AlphaFold 3 PDB dataset, first download all first-assembly (and asymmetric unit) complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The Python script below (i.e., `filter_pdb_mmcifs.py`) assumes you have downloaded the PDB in the **mmCIF file format**, placing it at `data/pdb_data/unfiltered_assembly_mmcifs/` (and `data/pdb_data/unfiltered_asym_mmcifs/`, respectively). On the RCSB website, navigate down to "Download Protocols", and follow the download instructions depending on your location.
202204

203-
For example, one can use the following command to download the PDB as a collection of mmCIF files:
205+
For example, one can use the following commands to download the PDB as a collection of mmCIF files:
204206
```bash
207+
# For `assembly1` complexes
208+
rsync -rlpt -v -z --delete --port=33444 \
209+
rsync.rcsb.org::ftp_data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs/
210+
# For asymmetric unit complexes
205211
rsync -rlpt -v -z --delete --port=33444 \
206-
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_mmcifs/
212+
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs/
207213
```
208214

209215
> WARNING: Downloading PDB can take up to 1TB of space.
210216
211-
After downloading, you should have a directory formatted like this:
212-
https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/
217+
> NOTE: PDB also hosts snapshots on AWS: https://pdbsnapshots.s3.us-west-2.amazonaws.com/index.html.
218+
219+
> TODO: Use a specific snapshot to make training reproducible.
220+
221+
After downloading, you should have two directories formatted like this:
222+
https://files.rcsb.org/pub/pdb/data/assemblies/mmCIF/divided/ & https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/
213223
```bash
214224
00/
215225
01/
@@ -218,9 +228,10 @@ https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/
218228
zz/
219229
```
220230

221-
In this directory, unzip all the files:
231+
For these directories, unzip all the files:
222232
```bash
223-
find . -type f -name "*.gz" -exec gzip -d {} \;
233+
find ./data/pdb_data/unfiltered_assembly_mmcifs/ -type f -name "*.gz" -exec gzip -d {} \;
234+
find ./data/pdb_data/unfiltered_asym_mmcifs/ -type f -name "*.gz" -exec gzip -d {} \;
224235
```
225236

226237
Next run the commands
@@ -235,12 +246,12 @@ find data/ccd_data/ -type f -name "*.gz" -exec gzip -d {} \;
235246

236247
### PDB dataset filtering
237248

238-
Then run the following with `pdb_dir`, `ccd_dir`, and `mmcif_output_dir` replaced with the locations of your local copies of the PDB, CCD, and your desired dataset output directory (i.e., `./data/pdb_data/unfiltered_mmcifs/`, `./data/ccd_data/`, and `./data/pdb_data/mmcifs/`).
249+
Then run the following with `pdb_assembly_dir`, `pdb_asym_dir`, `ccd_dir`, and `mmcif_output_dir` replaced with the locations of your local copies of the first-assembly PDB, asymmetric unit PDB, CCD, and your desired dataset output directory (i.e., `./data/pdb_data/unfiltered_assembly_mmcifs/`, `./data/pdb_data/unfiltered_asym_mmcifs/`, `./data/ccd_data/`, and `./data/pdb_data/mmcifs/`).
239250
```bash
240-
python scripts/filter_pdb_mmcifs.py --mmcif_dir <pdb_dir> --ccd_dir <ccd_dir> --output_dir <mmcif_output_dir>
251+
python scripts/filter_pdb_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --ccd_dir <ccd_dir> --output_dir <mmcif_output_dir>
241252
```
242253

243-
See the script for more options. Each mmCIF that successfully passes
254+
See the script for more options. Each first-assembly mmCIF that successfully passes
244255
all processing steps will be written to `mmcif_output_dir` within a subdirectory
245256
named according to the mmCIF's second and third PDB ID characters (e.g. `5c`).
246257

alphafold3_pytorch/common/biomolecule.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -721,10 +721,6 @@ def to_mmcif(
721721
mmcif_dict["_pdbx_struct_assembly.oligomeric_count"].append(
722722
str(pdbx_struct_assembly_oligomeric_count[assembly_id])
723723
)
724-
assert mmcif_dict[
725-
"_pdbx_struct_assembly_gen.assembly_id"
726-
], "No _pdbx_struct_assembly_gen.assembly_id entries found."
727-
assert mmcif_dict["_pdbx_struct_assembly.id"], "No _pdbx_struct_assembly.id entries found."
728724

729725
# Populate the _chem_comp table.
730726
for chem_comp in biomol.chem_comp_table:

alphafold3_pytorch/data/mmcif_parsing.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -594,7 +594,7 @@ def _get_header(parsed_info: MmCIFDict) -> PdbHeader:
594594
if "_pdbx_audit_revision_history.revision_date" in parsed_info:
595595
header["release_date"] = get_release_date(parsed_info)
596596
else:
597-
logging.warning("Could not determine release_date: %s", parsed_info["_entry.id"])
597+
logging.warning("Could not determine release_date for entry: %s", parsed_info["_entry.id"])
598598

599599
header["resolution"] = 0.00
600600
for res_key in (

scripts/cluster_pdb_mmcifs.py

Lines changed: 53 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
import os
2626
import subprocess
2727
from concurrent.futures import ProcessPoolExecutor, as_completed
28-
from typing import Dict, List, Literal, Optional, Set, Tuple
28+
from typing import Dict, List, Literal, Optional, Set, Tuple, Union
2929

3030
import numpy as np
3131
import pandas as pd
@@ -140,16 +140,20 @@ def convert_modified_residue_three_to_one(
140140

141141
if residue_mol_type == "protein":
142142
return (
143-
PROTEIN_LETTERS_3TO1[mapped_residue]
144-
if mapped_residue in PROTEIN_LETTERS_3TO1
145-
else "X",
143+
(
144+
PROTEIN_LETTERS_3TO1[mapped_residue]
145+
if mapped_residue in PROTEIN_LETTERS_3TO1
146+
else "X"
147+
),
146148
"protein",
147149
)
148150
elif residue_mol_type in {"rna", "dna"}:
149151
return (
150-
NUCLEIC_LETTERS_3TO1[mapped_residue]
151-
if mapped_residue in NUCLEIC_LETTERS_3TO1
152-
else "X",
152+
(
153+
NUCLEIC_LETTERS_3TO1[mapped_residue]
154+
if mapped_residue in NUCLEIC_LETTERS_3TO1
155+
else "X"
156+
),
153157
"nucleic_acid",
154158
)
155159
else:
@@ -366,8 +370,7 @@ def cluster_sequences_using_mmseqs2(
366370
min_seq_id: float = 0.5,
367371
coverage: float = 0.8,
368372
coverage_mode: Literal[0, 1, 2, 3] = 1,
369-
k_mer_length: Optional[int] = None,
370-
spaced_k_mer_pattern: Optional[str] = None,
373+
extra_parameters: Optional[Dict[str, Union[int, float, str]]] = None,
371374
) -> Dict[str, int]:
372375
"""Run MMseqs2 on the input FASTA file and write the resulting clusters to a local output directory."""
373376
assert input_filepath.endswith(".fasta"), "The input file must be a FASTA file."
@@ -396,10 +399,10 @@ def cluster_sequences_using_mmseqs2(
396399
"--cov-mode",
397400
str(coverage_mode),
398401
]
399-
if k_mer_length:
400-
mmseqs_command.extend(["-k", str(k_mer_length)])
401-
if spaced_k_mer_pattern:
402-
mmseqs_command.extend(["--spaced-kmer-pattern", spaced_k_mer_pattern])
402+
if extra_parameters:
403+
for key, value in extra_parameters.items():
404+
mmseqs_command.extend([key, str(value)])
405+
403406
subprocess.run(mmseqs_command)
404407
assert os.path.isfile(
405408
output_cluster_filepath
@@ -748,6 +751,11 @@ def cluster_interfaces(
748751
min_seq_id=0.4,
749752
coverage=0.8,
750753
coverage_mode=0,
754+
extra_parameters={
755+
# cluster reassign improves clusters by reassigning sequences to the best cluster
756+
# and fixes transitivity issues of the cascade clustering
757+
"--cluster-reassign": 1,
758+
},
751759
)
752760

753761
if not nucleic_acid_chain_cluster_mapping:
@@ -762,9 +770,14 @@ def cluster_interfaces(
762770
min_seq_id=1.0,
763771
coverage=0.8,
764772
coverage_mode=0,
765-
# NOTE: The following arguments were taken from: https://github.com/soedinglab/MMseqs2/issues/373#issuecomment-728166556
766-
k_mer_length=6,
767-
spaced_k_mer_pattern="11011101",
773+
extra_parameters={
774+
# 7 or 8 should work best, something to test
775+
"-k": 8,
776+
# there is currently an issue in mmseqs2 with nucleotide search and spaced k-mers
777+
"--spaced-kmer-mode": 0,
778+
# see above
779+
"--cluster-reassign": 1,
780+
},
768781
)
769782

770783
if not peptide_chain_cluster_mapping:
@@ -779,9 +792,30 @@ def cluster_interfaces(
779792
min_seq_id=1.0,
780793
coverage=0.8,
781794
coverage_mode=0,
782-
# NOTE: The following arguments were taken from: https://github.com/soedinglab/MMseqs2/issues/373#issuecomment-728166556
783-
k_mer_length=6,
784-
spaced_k_mer_pattern="11011101",
795+
# some of these parameters are from the spacepharer optimized parameters
796+
# these were for short CRISPR spacer recognition, so they should work well for arbitrary peptides
797+
# this is a adhoc solution, with some recent new introduction like the ungapped prefilter
798+
extra_parameters={
799+
# spacepharer optimized parameters
800+
"--gap-open": 16,
801+
"--gap-extend": 2,
802+
"--sub-mat": "VTML40.out",
803+
# we would like to try using ungapped prefilter mode to avoid
804+
# minimum consecutive k-mer match restrictions, but the cluster workflow doesn't expose this yet
805+
# let's use a real small k-mer size instead
806+
# "--prefilter-mode": 1,
807+
"-k": 5,
808+
"--spaced-kmer-mode": 0,
809+
# Don't try suppresing FP hits since the peptides are too short
810+
"--mask": 0,
811+
"--comp-bias-corr": 0,
812+
# let more things through the prefilter
813+
"--min-ungapped-score": 5,
814+
# Try disabling completely with "inf"?
815+
"-e": 1,
816+
# see above
817+
"--cluster-reassign": 1,
818+
},
785819
)
786820

787821
if not ligand_chain_cluster_mapping:

0 commit comments

Comments
 (0)