Skip to content

Missing Unresolved Residues in mmcif_parsing #284

@v-shaoningli

Description

@v-shaoningli

Hi All!

Thank you for your effort in developing the open-source AF3.

Issue Description

I have encountered an issue with the mmcif_parsing module related to unresolved residues. It appears that when the protein sequence is parsed directly from the structure object in Biopython, the unresolved residues — those that do not appear in the mmcif coordinates part (_atom_site) — are not included in the MmcifObject.

Impact

We need the unresolved residues for some computations, such as calculating the unresolved relative solvent accessible surface area (RASA).

Example

For instance, the actual sequence for the protein with PDB ID 7a4d is:

QVQLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGRYPYDVPDYGSGRA

However, when using mmcif_parsing, the parsed sequence is:

QLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGR

Additional Observations

We also noticed that the cached MSAs in data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1A_protein.a3m are computed based on the latter sequence, which excludes the unresolved residues.

Request for Assistance

Is there a solution to include the unresolved residues in the parsed sequence? Any guidance or help with this issue would be greatly appreciated.

Best regards,
Shaoning

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions