-
Notifications
You must be signed in to change notification settings - Fork 747
Fix wrong element assignment for carbon of the style C*S #5149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5149 +/- ##
===========================================
- Coverage 92.72% 92.71% -0.01%
===========================================
Files 180 180
Lines 22458 22460 +2
Branches 3186 3187 +1
===========================================
Hits 20824 20824
- Misses 1177 1178 +1
- Partials 457 458 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
IAlibay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm blocking this for further discussion - generally I don't think it's a good idea to special case something like this.
If we really want to support "C*S" (I personally think this might be a case where the PDB should have elements defined to begin with), then we should change the behaviour to not strip symbols / numbers, and instead just split and try either the left or right entries.
|
|
||
| # just in case | ||
| if name in tables.atomelements: | ||
| if name == "CS" and isinstance(eval(atomname[1:-1]), int): #"Adding these two lines correctly assigns this C*S atoms as carbons" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding that "C*S" is a force field specific atom type rather than something that comes out of the PDB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the comments about avoiding special-case handling. And yes, I agree that C*S is a force-field–specific atom name—the one CHARMM assigns to SM lipids.
I’ve noticed that MDAnalysis 2.7 is able to guess these atom types correctly. Were there any specific changes to the name-guessing logic between version 2.7 and the newer releases that could explain this difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the v2.7 code, it looks like it used to always attempt to use the first letter of the number/symbol stripped atomname, i.e. in this case it would have used 'C', instead of 'CS'.
That behaviour was also kinda problematic because you could have 'CS1' which I would guess as being 'CS' rather than 'C'. I suspect the change was a "fix" that attempted to deal with that case and it introduced a regression where "?" isn't split, where it probably more likely should be.
My suggestion would be to something closer to calling re.split over SYMBOLS and NUMBERs and then attempting to match over each subgroup going from left to right.
That being said, it's something I'd be keen to hear other @MDAnalysis/coredevs weigh in on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps I don't understand the issue deep enough, but I'd be very much for explicit lookup table mapping names to guesses, minimising regexp usage to maybe only number extraction (or none at all -- afairc doing simple str filter is faster on shorter strings). It seems that there aren't that many chemical elements to go over, and explicit mapping is better than implicit function that needs to be tested with the same explicit mapping anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we implemented the "guesser" infrastructure -- the end goal was being able to create Contexts with specific behaviour for cases like these, e.g. a Charmm lipid context with a custom name-to-type guesser/mapper function. Continuously updating the global look-up table feels unsustainable and will likely inevitably run into a clash, e.g. "CA".
I agree with @IAlibay that we should probably replace the stripping logic with splitting.
I also don't think I understand the logic of changing to "AC" when you could just return "C".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd be happier if we were just failing. Then at least we could provide the user with a note on what to do to ensure that their data are correct. "Sorry, I cannot guess bonds because I cannot ascertain the elements from the atom types. Provide a full topology file or provide elements. Or work on a guesser..."
If @lilyminium as the expert on the guessers thinks that splitting will improve the default guesser then I am happy to support this direction. I just don't believe that it really is a solution because something else will likely break. The real solution is to write new guessers that codify the specific knowledge about the force field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I have tested the guesser (with various CHARMM lipid systems), the only case where it consistently fails is with CS, because it is interpreted as the element cesium, which has a large vdW radius (3.43 Å) compared to carbon (2.31 Å for CA). This seems to be a very localized issue mainly affecting C*S atom names in SM lipids from CHARMM, since “CS” is not commonly used in a biological context. Following that reasoning, is it possible that the small fix I proposed might be sufficient?
If not, I’m happy to work on a solution using splitting instead. In that case—and if I understood correctly—should we issue a warning when an atom name is ambiguous?
@lilyminium The logic behind mapping to "AC" was that it directly maps to carbon in the table, but you’re completely right: returning "C" would be clearer.
I also agree that the long-term solution is to codify the force-field knowledge. Is the intention to do this for all force fields, or only for specific ones, depending on what the user specifies?
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that any fix here will be an improvement and the real solution is to start writing a guesser that knows how to work with CHARMM FF files.
(If @lilyminium thinks the default_guesser would be improved by changing the logic for dealing with names that include numbers and symbols then I'll support her decision.)
The current MDAnalysis atom guesser incorrectly assigns the element type for certain atoms. This misassignment leads to incorrect bond guessing, particularly in DSM lipids—which is how the issue was originally identified.
Fixes #5102
Changes made in this Pull Request:
Added two lines in default_guesser.py to the method guess_atom_element():
A detailed explanation of the bug and the proposed solution can be found at https://github.com/ricard1997/MDAnalysis_issue_fix.git
PR Checklist
Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5149.org.readthedocs.build/en/5149/