Skip to content

Commit ee68e6d

Browse files
authored
Add AF3 validation dataset curation code (#111)
* Update README.md * Update filter_pdb_mmcifs.py * Rename filter_pdb_mmcifs.py to filter_pdb_train_mmcifs.py * Create filter_pdb_val_mmcifs.py * Update cluster_pdb_mmcifs.py * Rename cluster_pdb_mmcifs.py to cluster_pdb_train_mmcifs.py * Create cluster_pdb_val_mmcifs.py * Update test_data_parsing.py * Update filter_pdb_train_mmcifs.py * Update filter_pdb_val_mmcifs.py
1 parent 45d887e commit ee68e6d

File tree

6 files changed

+2280
-40
lines changed

6 files changed

+2280
-40
lines changed

README.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ assert sampled_atom_pos.shape == (1, (5 + 4), 3)
212212

213213
### PDB dataset curation
214214

215-
To acquire the AlphaFold 3 PDB dataset, first download all first-assembly (and asymmetric unit) complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The two Python scripts below (i.e., `filter_pdb_mmcifs.py` and `cluster_pdb_mmcifs.py`) assume you have downloaded the PDB in the **mmCIF file format**, placing its first-assembly and asymmetric unit mmCIF files at `data/pdb_data/unfiltered_assembly_mmcifs/` and `data/pdb_data/unfiltered_asym_mmcifs/`, respectively.
215+
To acquire the AlphaFold 3 PDB dataset, first download all first-assembly (and asymmetric unit) complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The two Python scripts below (i.e., `filter_pdb_{train,val}_mmcifs.py` and `cluster_pdb_{train,val}_mmcifs.py`) assume you have downloaded the PDB in the **mmCIF file format**, placing its first-assembly and asymmetric unit mmCIF files at `data/pdb_data/unfiltered_assembly_mmcifs/` and `data/pdb_data/unfiltered_asym_mmcifs/`, respectively.
216216

217217
For reproducibility, we recommend downloading the PDB using AWS snapshots (e.g., `20240101`). To do so, refer to [AWS's documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html) to set up the AWS CLI locally. Alternatively, on the RCSB website, navigate down to "Download Protocols", and follow the download instructions depending on your location.
218218

@@ -263,25 +263,27 @@ find data/ccd_data/ -type f -name "*.gz" -exec gzip -d {} \;
263263

264264
### PDB dataset filtering
265265

266-
Then run the following with `pdb_assembly_dir`, `pdb_asym_dir`, `ccd_dir`, and `mmcif_output_dir` replaced with the locations of your local copies of the first-assembly PDB, asymmetric unit PDB, CCD, and your desired dataset output directory (i.e., `./data/pdb_data/unfiltered_assembly_mmcifs/`, `./data/pdb_data/unfiltered_asym_mmcifs/`, `./data/ccd_data/`, and `./data/pdb_data/mmcifs/`).
266+
Then run the following with `pdb_assembly_dir`, `pdb_asym_dir`, `ccd_dir`, and `mmcif_output_dir` replaced with the locations of your local copies of the first-assembly PDB, asymmetric unit PDB, CCD, and your desired dataset output directory (i.e., `./data/pdb_data/unfiltered_assembly_mmcifs/`, `./data/pdb_data/unfiltered_asym_mmcifs/`, `./data/ccd_data/`, and `./data/pdb_data/{train,val}_mmcifs/`).
267267
```bash
268-
python scripts/filter_pdb_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --ccd_dir <ccd_dir> --output_dir <mmcif_output_dir>
268+
python scripts/filter_pdb_train_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --ccd_dir <ccd_dir> --output_dir <mmcif_output_dir>
269+
python scripts/filter_pdb_val_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --output_dir <mmcif_output_dir>
269270
```
270271

271-
See the script for more options. Each first-assembly mmCIF that successfully passes
272+
See the scripts for more options. Each first-assembly mmCIF that successfully passes
272273
all processing steps will be written to `mmcif_output_dir` within a subdirectory
273274
named according to the mmCIF's second and third PDB ID characters (e.g. `5c`).
274275

275276
### PDB dataset clustering
276277

277-
Next, run the following with `mmcif_dir` and `clustering_output_dir` replaced, respectively, with your local output directory created using the dataset filtering script above and with your desired clustering output directory (i.e., `./data/pdb_data/mmcifs/` and `./data/pdb_data/data_caches/clusterings/`):
278+
Next, run the following with `mmcif_dir` and `{train,val}_clustering_output_dir` replaced, respectively, with your local output directory created using the dataset filtering script above and with your desired clustering output directories (i.e., `./data/pdb_data/{train,val}_mmcifs/` and `./data/pdb_data/data_caches/{train,val}_clusterings/`):
278279
```bash
279-
python scripts/cluster_pdb_mmcifs.py --mmcif_dir <mmcif_dir> --output_dir <clustering_output_dir> --clustering_filtered_pdb_dataset
280+
python scripts/cluster_pdb_train_mmcifs.py --mmcif_dir <mmcif_dir> --output_dir <train_clustering_output_dir> --clustering_filtered_pdb_dataset
281+
python scripts/cluster_pdb_val_mmcifs.py --mmcif_dir <mmcif_dir> --reference_clustering_dir <train_clustering_output_dir> --output_dir <val_clustering_output_dir> --clustering_filtered_pdb_dataset
280282
```
281283

282-
**Note**: The `--clustering_filtered_pdb_dataset` flag is recommended when clustering the filtered PDB dataset as curated using the script above, as this flag will enable faster runtimes in this context (since filtering leaves each chain's residue IDs 1-based). However, this flag must **not** be provided when clustering other (i.e., non-PDB) datasets of mmCIF files. Otherwise, interface clustering may be performed incorrectly, as these datasets' mmCIF files may not use strict 1-based residue indexing for each chain.
284+
**Note**: The `--clustering_filtered_pdb_dataset` flag is recommended when clustering the filtered PDB dataset as curated using the scripts above, as this flag will enable faster runtimes in this context (since filtering leaves each chain's residue IDs 1-based). However, this flag must **not** be provided when clustering other (i.e., non-PDB) datasets of mmCIF files. Otherwise, interface clustering may be performed incorrectly, as these datasets' mmCIF files may not use strict 1-based residue indexing for each chain.
283285

284-
**Note**: One can also download preprocessed (i.e., filtered) mmCIF files (~20GB, comprising 148k complexes) and chain/interface clustering files (~1GB) for the PDB's `20240101` AWS snapshot via a [shared OneDrive folder](https://mailmissouri-my.sharepoint.com/:f:/g/personal/acmwhb_umsystem_edu/EqU8tjUmmKxJr-FAlq4tzaIBi2TIBtmw5Vl3k_kmgNlepA?e=mzlyv6).
286+
**Note**: One can instead download preprocessed (i.e., filtered) mmCIF (`train`/`val`) files (~20GB, comprising 148k complexes) and chain/interface clustering (`train`/`val`) files (~1GB) for the PDB's `20240101` AWS snapshot via a [shared OneDrive folder](https://mailmissouri-my.sharepoint.com/:f:/g/personal/acmwhb_umsystem_edu/EqU8tjUmmKxJr-FAlq4tzaIBi2TIBtmw5Vl3k_kmgNlepA?e=mzlyv6). Each of these `tar` archives should be uncompressed within the `data/pdb_data/` directory.
285287

286288
## Contributing
287289

scripts/cluster_pdb_mmcifs.py renamed to scripts/cluster_pdb_train_mmcifs.py

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# %% [markdown]
2-
# # Clustering AlphaFold 3 PDB Dataset
2+
# # Clustering AlphaFold 3 PDB Training Dataset
33
#
4-
# For clustering AlphaFold 3's PDB dataset, we follow the clustering procedure outlined in Abramson et al (2024).
4+
# For clustering AlphaFold 3's PDB training dataset, we follow the clustering procedure outlined in Abramson et al (2024).
55
#
66
# In order to reduce bias in the training and evaluation sets, clustering was performed on PDB chains and interfaces, as
77
# follows.
@@ -147,7 +147,6 @@ def convert_modified_residue_three_to_one(
147147
return mapped_residue, "ligand"
148148

149149

150-
@typecheck
151150
def parse_chain_sequences_and_interfaces_from_mmcif(
152151
filepath: str,
153152
assume_one_based_residue_ids: bool = False,
@@ -265,7 +264,6 @@ def parse_chain_sequences_and_interfaces_from_mmcif(
265264
return sequences, interface_chain_ids
266265

267266

268-
@typecheck
269267
def parse_chain_sequences_and_interfaces_from_mmcif_file(
270268
cif_filepath: str, assume_one_based_residue_ids: bool = False
271269
) -> Tuple[str, Dict[str, str], Set[str]]:
@@ -682,24 +680,24 @@ def cluster_interfaces(
682680

683681
if __name__ == "__main__":
684682
parser = argparse.ArgumentParser(
685-
description="Cluster chains and interfaces within the AlphaFold 3 PDB dataset's filtered mmCIF files."
683+
description="Cluster chains and interfaces within the AlphaFold 3 PDB training dataset's filtered mmCIF files."
686684
)
687685
parser.add_argument(
688686
"--mmcif_dir",
689687
type=str,
690-
default=os.path.join("data", "pdb_data", "mmcifs"),
688+
default=os.path.join("data", "pdb_data", "train_mmcifs"),
691689
help="Path to the input directory containing (filtered) mmCIF files.",
692690
)
693691
parser.add_argument(
694692
"--output_dir",
695693
type=str,
696-
default=os.path.join("data", "pdb_data", "data_caches", "clusterings"),
697-
help="Path to the output FASTA file.",
694+
default=os.path.join("data", "pdb_data", "data_caches", "train_clusterings"),
695+
help="Path to the output clustering directory.",
698696
)
699697
parser.add_argument(
700698
"--clustering_filtered_pdb_dataset",
701699
action="store_true",
702-
help="Whether the clustering is being performed on the filtered PDB dataset.",
700+
help="Whether the clustering is being performed on a filtered PDB dataset.",
703701
)
704702
parser.add_argument(
705703
"-n",

0 commit comments

Comments
 (0)