Fix wrong element assignment for carbon of the style C*S #5149

ricard1997 · 2025-11-17T23:25:04Z

The current MDAnalysis atom guesser incorrectly assigns the element type for certain atoms. This misassignment leads to incorrect bond guessing, particularly in DSM lipids—which is how the issue was originally identified.

Fixes #5102

Changes made in this Pull Request:

Added two lines in default_guesser.py to the method guess_atom_element():

        if name in tables.atomelements:
            if name == "CS" and  isinstance(eval(atomname[1:-1]), int): #"Adding these two lines correctly assigns this C*S atoms as carbons"
                name = "AC"    # AC maps to C as per the MDAnalysis table
            return tables.atomelements[name]

A detailed explanation of the bug and the proposed solution can be found at https://github.com/ricard1997/MDAnalysis_issue_fix.git

PR Checklist

Issue raised/referenced?
Tests updated/added?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.

📚 Documentation preview 📚: https://mdanalysis--5149.org.readthedocs.build/en/5149/

…SM lipids)

github-actions

Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.

codecov · 2025-11-17T23:35:16Z

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.71%. Comparing base (3189d48) to head (e636bac).

Files with missing lines	Patch %	Lines
package/MDAnalysis/guesser/default_guesser.py	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #5149      +/-   ##
===========================================
- Coverage    92.72%   92.71%   -0.01%     
===========================================
  Files          180      180              
  Lines        22458    22460       +2     
  Branches      3186     3187       +1     
===========================================
  Hits         20824    20824              
- Misses        1177     1178       +1     
- Partials       457      458       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

IAlibay

I'm blocking this for further discussion - generally I don't think it's a good idea to special case something like this.

If we really want to support "C*S" (I personally think this might be a case where the PDB should have elements defined to begin with), then we should change the behaviour to not strip symbols / numbers, and instead just split and try either the left or right entries.

IAlibay · 2025-11-18T01:18:16Z

package/MDAnalysis/guesser/default_guesser.py


            # just in case
            if name in tables.atomelements:
+                if name == "CS" and  isinstance(eval(atomname[1:-1]), int): #"Adding these two lines correctly assigns this C*S atoms as carbons"


Am I correct in understanding that "C*S" is a force field specific atom type rather than something that comes out of the PDB?

I understand the comments about avoiding special-case handling. And yes, I agree that C*S is a force-field–specific atom name—the one CHARMM assigns to SM lipids.

I’ve noticed that MDAnalysis 2.7 is able to guess these atom types correctly. Were there any specific changes to the name-guessing logic between version 2.7 and the newer releases that could explain this difference?

Looking at the v2.7 code, it looks like it used to always attempt to use the first letter of the number/symbol stripped atomname, i.e. in this case it would have used 'C', instead of 'CS'.

That behaviour was also kinda problematic because you could have 'CS1' which I would guess as being 'CS' rather than 'C'. I suspect the change was a "fix" that attempted to deal with that case and it introduced a regression where "?" isn't split, where it probably more likely should be.

My suggestion would be to something closer to calling re.split over SYMBOLS and NUMBERs and then attempting to match over each subgroup going from left to right.

That being said, it's something I'd be keen to hear other @MDAnalysis/coredevs weigh in on.

perhaps I don't understand the issue deep enough, but I'd be very much for explicit lookup table mapping names to guesses, minimising regexp usage to maybe only number extraction (or none at all -- afairc doing simple str filter is faster on shorter strings). It seems that there aren't that many chemical elements to go over, and explicit mapping is better than implicit function that needs to be tested with the same explicit mapping anyway.

That's why we implemented the "guesser" infrastructure -- the end goal was being able to create Contexts with specific behaviour for cases like these, e.g. a Charmm lipid context with a custom name-to-type guesser/mapper function. Continuously updating the global look-up table feels unsustainable and will likely inevitably run into a clash, e.g. "CA".

I agree with @IAlibay that we should probably replace the stripping logic with splitting.

I also don't think I understand the logic of changing to "AC" when you could just return "C".

I think I'd be happier if we were just failing. Then at least we could provide the user with a note on what to do to ensure that their data are correct. "Sorry, I cannot guess bonds because I cannot ascertain the elements from the atom types. Provide a full topology file or provide elements. Or work on a guesser..."

If @lilyminium as the expert on the guessers thinks that splitting will improve the default guesser then I am happy to support this direction. I just don't believe that it really is a solution because something else will likely break. The real solution is to write new guessers that codify the specific knowledge about the force field.

As far as I have tested the guesser (with various CHARMM lipid systems), the only case where it consistently fails is with CS, because it is interpreted as the element cesium, which has a large vdW radius (3.43 Å) compared to carbon (2.31 Å for CA). This seems to be a very localized issue mainly affecting C*S atom names in SM lipids from CHARMM, since “CS” is not commonly used in a biological context. Following that reasoning, is it possible that the small fix I proposed might be sufficient?

If not, I’m happy to work on a solution using splitting instead. In that case—and if I understood correctly—should we issue a warning when an atom name is ambiguous?

@lilyminium The logic behind mapping to "AC" was that it directly maps to carbon in the table, but you’re completely right: returning "C" would be clearer.

I also agree that the long-term solution is to codify the force-field knowledge. Is the intention to do this for all force fields, or only for specific ones, depending on what the user specifies?

orbeckst

I don't think that any fix here will be an improvement and the real solution is to start writing a guesser that knows how to work with CHARMM FF files.

(If @lilyminium thinks the default_guesser would be improved by changing the logic for dealing with names that include numbers and symbols then I'll support her decision.)

Fix wrong element assignment for carbon of the style C*S (Common in D…

e636bac

…SM lipids)

github-actions bot reviewed Nov 17, 2025

View reviewed changes

IAlibay requested changes Nov 18, 2025

View reviewed changes

orbeckst requested changes Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix wrong element assignment for carbon of the style C*S #5149

Fix wrong element assignment for carbon of the style C*S #5149

Uh oh!

ricard1997 commented Nov 17, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

codecov bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

IAlibay left a comment

Uh oh!

IAlibay Nov 18, 2025

Uh oh!

ricard1997 Nov 19, 2025

Uh oh!

IAlibay Nov 19, 2025

Uh oh!

marinegor Nov 19, 2025

Uh oh!

lilyminium Nov 19, 2025

Uh oh!

orbeckst Nov 20, 2025

Uh oh!

ricard1997 Nov 20, 2025

Uh oh!

orbeckst left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix wrong element assignment for carbon of the style C*S #5149

Are you sure you want to change the base?

Fix wrong element assignment for carbon of the style C*S #5149

Uh oh!

Conversation

ricard1997 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Checklist

Developers Certificate of Origin

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

IAlibay left a comment

Choose a reason for hiding this comment

Uh oh!

IAlibay Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ricard1997 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

IAlibay Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

marinegor Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

lilyminium Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

orbeckst Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

ricard1997 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

orbeckst left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ricard1997 commented Nov 17, 2025 •

edited

Loading

codecov bot commented Nov 17, 2025 •

edited

Loading