Skip to content

Commit 9380be0

Browse files
authored
Default to downloading the PDB from its 20240101 AWS snapshot (#70)
1 parent 7e35fcf commit 9380be0

File tree

1 file changed

+13
-8
lines changed

1 file changed

+13
-8
lines changed

README.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -200,23 +200,28 @@ assert sampled_atom_pos.shape == (1, (6 + 5), 3)
200200

201201
### PDB dataset curation
202202

203-
To acquire the AlphaFold 3 PDB dataset, first download all first-assembly (and asymmetric unit) complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The Python script below (i.e., `filter_pdb_mmcifs.py`) assumes you have downloaded the PDB in the **mmCIF file format**, placing it at `data/pdb_data/unfiltered_assembly_mmcifs/` (and `data/pdb_data/unfiltered_asym_mmcifs/`, respectively). On the RCSB website, navigate down to "Download Protocols", and follow the download instructions depending on your location.
203+
To acquire the AlphaFold 3 PDB dataset, first download all first-assembly (and asymmetric unit) complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The two Python scripts below (i.e., `filter_pdb_mmcifs.py` and `cluster_pdb_mmcifs.py`) assume you have downloaded the PDB in the **mmCIF file format**, placing its first-assembly and asymmetric unit mmCIF files at `data/pdb_data/unfiltered_assembly_mmcifs/` and `data/pdb_data/unfiltered_asym_mmcifs/`, respectively.
204204

205-
For example, one can use the following commands to download the PDB as a collection of mmCIF files:
205+
For reproducibility, we recommend downloading the PDB using AWS snapshots (e.g., `20240101`). To do so, refer to [AWS's documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html) to set up the AWS CLI locally. Alternatively, on the RCSB website, navigate down to "Download Protocols", and follow the download instructions depending on your location.
206+
207+
For example, one can use the following commands to download the PDB as two collections of mmCIF files:
206208
```bash
207-
# For `assembly1` complexes
209+
# For `assembly1` complexes, use the PDB's `20240101` AWS snapshot:
210+
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs
211+
# Or as a fallback, use rsync:
208212
rsync -rlpt -v -z --delete --port=33444 \
209213
rsync.rcsb.org::ftp_data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs/
210-
# For asymmetric unit complexes
214+
215+
# For asymmetric unit complexes, also use the PDB's `20240101` AWS snapshot:
216+
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs
217+
# Or as a fallback, use rsync:
211218
rsync -rlpt -v -z --delete --port=33444 \
212219
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs/
213220
```
214221

215-
> WARNING: Downloading PDB can take up to 1TB of space.
216-
217-
> NOTE: PDB also hosts snapshots on AWS: https://pdbsnapshots.s3.us-west-2.amazonaws.com/index.html.
222+
> WARNING: Downloading the PDB can take up to 700GB of space.
218223
219-
> TODO: Use a specific snapshot to make training reproducible.
224+
> NOTE: The PDB hosts all available AWS snapshots here: https://pdbsnapshots.s3.us-west-2.amazonaws.com/index.html.
220225
221226
After downloading, you should have two directories formatted like this:
222227
https://files.rcsb.org/pub/pdb/data/assemblies/mmCIF/divided/ & https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/

0 commit comments

Comments
 (0)