Skip to content

Commit c7766ae

Browse files
authored
Add training data necessary for overfitting experiments (#182)
* Update README.md * Create 209d-assembly1.cif * Create 721p-assembly1.cif * Create 209d-assembly1C_protein_pdball_230102_db.m8 * Create 209d-assembly1C_protein.a3m * Add files via upload * Add files via upload * Create 7a4d-assembly1.cif
1 parent baca744 commit c7766ae

22 files changed

+172351
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ python scripts/cluster_pdb_test_mmcifs.py --mmcif_dir <mmcif_dir> --reference_1_
307307

308308
**Note**: The `--clustering_filtered_pdb_dataset` flag is recommended when clustering the filtered PDB dataset as curated using the scripts above, as this flag will enable faster runtimes in this context (since filtering leaves each chain's residue IDs 1-based). However, this flag must **not** be provided when clustering other (i.e., non-PDB) datasets of mmCIF files. Otherwise, interface clustering may be performed incorrectly, as these datasets' mmCIF files may not use strict 1-based residue indexing for each chain.
309309

310-
**Note**: One can instead download preprocessed (i.e., filtered) mmCIF (`train`/`val`/`test`) files (~25GB, comprising 148k complexes) and chain/interface clustering (`train`/`val`/`test`) files (~3GB) for the PDB's `20240101` AWS snapshot via a [shared OneDrive folder](https://mailmissouri-my.sharepoint.com/:f:/g/personal/acmwhb_umsystem_edu/EqU8tjUmmKxJr-FAlq4tzaIBi2TIBtmw5Vl3k_kmgNlepA?e=mzlyv6). Each of these `tar` archives should be uncompressed within the `data/pdb_data/` directory.
310+
**Note**: One can instead download preprocessed (i.e., filtered) mmCIF (`train`/`val`/`test`) files (~25GB, comprising 148k complexes) and chain/interface clustering (`train`/`val`/`test`) files (~3GB) for the PDB's `20240101` AWS snapshot via a [shared OneDrive folder](https://mailmissouri-my.sharepoint.com/:f:/g/personal/acmwhb_umsystem_edu/EqU8tjUmmKxJr-FAlq4tzaIBi2TIBtmw5Vl3k_kmgNlepA?e=mzlyv6). Each of these `tar.gz` archives should be decompressed within the `data/pdb_data/` directory e.g., via `tar -xzf data_caches.tar.gz -C data/pdb_data/`.
311311

312312
## Contributing
313313

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
>101
2+
TXPGVTXPGV
3+
>101
4+
TXPGVTXPGV

data/pdb_data/data_caches/msa/train_msas/721p-assembly1A_protein.a3m

Lines changed: 31906 additions & 0 deletions
Large diffs are not rendered by default.

data/pdb_data/data_caches/msa/train_msas/721p-assembly1B_protein.a3m

Lines changed: 31906 additions & 0 deletions
Large diffs are not rendered by default.

data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1A_protein.a3m

Lines changed: 24192 additions & 0 deletions
Large diffs are not rendered by default.

data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1B_protein.a3m

Lines changed: 22428 additions & 0 deletions
Large diffs are not rendered by default.

data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1C_protein.a3m

Lines changed: 23224 additions & 0 deletions
Large diffs are not rendered by default.

data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1D_protein.a3m

Lines changed: 23304 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
>101
2+
LEEELKQLEEELQAIEEQLAQLQWKAQARKEKLAQLKEKL
3+
>101
4+
LEEELKQLEEELQAIEEQLAQLQWKAQARKEKLAQLKEKL
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
>101
2+
LEEELKQLEEELQAIEEQLAQLQWKAQARKEKLAQLKEKL
3+
>101
4+
LEEELKQLEEELQAIEEQLAQLQWKAQARKEKLAQLKEKL

0 commit comments

Comments
 (0)