Skip to content

Conversation

@ricard1997
Copy link

@ricard1997 ricard1997 commented Nov 17, 2025

The current MDAnalysis atom guesser incorrectly assigns the element type for certain atoms. This misassignment leads to incorrect bond guessing, particularly in DSM lipids—which is how the issue was originally identified.

Fixes #5102

Changes made in this Pull Request:

Added two lines in default_guesser.py to the method guess_atom_element():

        if name in tables.atomelements:
            if name == "CS" and  isinstance(eval(atomname[1:-1]), int): #"Adding these two lines correctly assigns this C*S atoms as carbons"
                name = "AC"    # AC maps to C as per the MDAnalysis table
            return tables.atomelements[name]

A detailed explanation of the bug and the proposed solution can be found at https://github.com/ricard1997/MDAnalysis_issue_fix.git

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5149.org.readthedocs.build/en/5149/

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.

@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.71%. Comparing base (3189d48) to head (e636bac).

Files with missing lines Patch % Lines
package/MDAnalysis/guesser/default_guesser.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5149      +/-   ##
===========================================
- Coverage    92.72%   92.71%   -0.01%     
===========================================
  Files          180      180              
  Lines        22458    22460       +2     
  Branches      3186     3187       +1     
===========================================
  Hits         20824    20824              
- Misses        1177     1178       +1     
- Partials       457      458       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm blocking this for further discussion - generally I don't think it's a good idea to special case something like this.

If we really want to support "C*S" (I personally think this might be a case where the PDB should have elements defined to begin with), then we should change the behaviour to not strip symbols / numbers, and instead just split and try either the left or right entries.


# just in case
if name in tables.atomelements:
if name == "CS" and isinstance(eval(atomname[1:-1]), int): #"Adding these two lines correctly assigns this C*S atoms as carbons"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that "C*S" is a force field specific atom type rather than something that comes out of the PDB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the comments about avoiding special-case handling. And yes, I agree that C*S is a force-field–specific atom name—the one CHARMM assigns to SM lipids.

I’ve noticed that MDAnalysis 2.7 is able to guess these atom types correctly. Were there any specific changes to the name-guessing logic between version 2.7 and the newer releases that could explain this difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the v2.7 code, it looks like it used to always attempt to use the first letter of the number/symbol stripped atomname, i.e. in this case it would have used 'C', instead of 'CS'.

That behaviour was also kinda problematic because you could have 'CS1' which I would guess as being 'CS' rather than 'C'. I suspect the change was a "fix" that attempted to deal with that case and it introduced a regression where "?" isn't split, where it probably more likely should be.

My suggestion would be to something closer to calling re.split over SYMBOLS and NUMBERs and then attempting to match over each subgroup going from left to right.

That being said, it's something I'd be keen to hear other @MDAnalysis/coredevs weigh in on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps I don't understand the issue deep enough, but I'd be very much for explicit lookup table mapping names to guesses, minimising regexp usage to maybe only number extraction (or none at all -- afairc doing simple str filter is faster on shorter strings). It seems that there aren't that many chemical elements to go over, and explicit mapping is better than implicit function that needs to be tested with the same explicit mapping anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we implemented the "guesser" infrastructure -- the end goal was being able to create Contexts with specific behaviour for cases like these, e.g. a Charmm lipid context with a custom name-to-type guesser/mapper function. Continuously updating the global look-up table feels unsustainable and will likely inevitably run into a clash, e.g. "CA".

I agree with @IAlibay that we should probably replace the stripping logic with splitting.

I also don't think I understand the logic of changing to "AC" when you could just return "C".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd be happier if we were just failing. Then at least we could provide the user with a note on what to do to ensure that their data are correct. "Sorry, I cannot guess bonds because I cannot ascertain the elements from the atom types. Provide a full topology file or provide elements. Or work on a guesser..."

If @lilyminium as the expert on the guessers thinks that splitting will improve the default guesser then I am happy to support this direction. I just don't believe that it really is a solution because something else will likely break. The real solution is to write new guessers that codify the specific knowledge about the force field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I have tested the guesser (with various CHARMM lipid systems), the only case where it consistently fails is with CS, because it is interpreted as the element cesium, which has a large vdW radius (3.43 Å) compared to carbon (2.31 Å for CA). This seems to be a very localized issue mainly affecting C*S atom names in SM lipids from CHARMM, since “CS” is not commonly used in a biological context. Following that reasoning, is it possible that the small fix I proposed might be sufficient?

If not, I’m happy to work on a solution using splitting instead. In that case—and if I understood correctly—should we issue a warning when an atom name is ambiguous?

@lilyminium The logic behind mapping to "AC" was that it directly maps to carbon in the table, but you’re completely right: returning "C" would be clearer.

I also agree that the long-term solution is to codify the force-field knowledge. Is the intention to do this for all force fields, or only for specific ones, depending on what the user specifies?

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that any fix here will be an improvement and the real solution is to start writing a guesser that knows how to work with CHARMM FF files.

(If @lilyminium thinks the default_guesser would be improved by changing the logic for dealing with names that include numbers and symbols then I'll support her decision.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

guess_TopologyAttrs guesses incorrect bonds for DMS lipids in MDAnalysis > 2.7.0

5 participants