-
Notifications
You must be signed in to change notification settings - Fork 208
Description
Hi All!
Thank you for your effort in developing the open-source AF3.
Issue Description
I have encountered an issue with the mmcif_parsing module related to unresolved residues. It appears that when the protein sequence is parsed directly from the structure object in Biopython, the unresolved residues — those that do not appear in the mmcif coordinates part (_atom_site) — are not included in the MmcifObject.
Impact
We need the unresolved residues for some computations, such as calculating the unresolved relative solvent accessible surface area (RASA).
Example
For instance, the actual sequence for the protein with PDB ID 7a4d is:
QVQLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGRYPYDVPDYGSGRA
However, when using mmcif_parsing, the parsed sequence is:
QLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGR
Additional Observations
We also noticed that the cached MSAs in data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1A_protein.a3m are computed based on the latter sequence, which excludes the unresolved residues.
Request for Assistance
Is there a solution to include the unresolved residues in the parsed sequence? Any guidance or help with this issue would be greatly appreciated.
Best regards,
Shaoning