Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions package/MDAnalysis/guesser/default_guesser.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,8 @@ def guess_atom_element(self, atomname):

# just in case
if name in tables.atomelements:
if name == "CS" and isinstance(eval(atomname[1:-1]), int): #"Adding these two lines correctly assigns this C*S atoms as carbons"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that "C*S" is a force field specific atom type rather than something that comes out of the PDB?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the comments about avoiding special-case handling. And yes, I agree that C*S is a force-field–specific atom name—the one CHARMM assigns to SM lipids.

I’ve noticed that MDAnalysis 2.7 is able to guess these atom types correctly. Were there any specific changes to the name-guessing logic between version 2.7 and the newer releases that could explain this difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the v2.7 code, it looks like it used to always attempt to use the first letter of the number/symbol stripped atomname, i.e. in this case it would have used 'C', instead of 'CS'.

That behaviour was also kinda problematic because you could have 'CS1' which I would guess as being 'CS' rather than 'C'. I suspect the change was a "fix" that attempted to deal with that case and it introduced a regression where "?" isn't split, where it probably more likely should be.

My suggestion would be to something closer to calling re.split over SYMBOLS and NUMBERs and then attempting to match over each subgroup going from left to right.

That being said, it's something I'd be keen to hear other @MDAnalysis/coredevs weigh in on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps I don't understand the issue deep enough, but I'd be very much for explicit lookup table mapping names to guesses, minimising regexp usage to maybe only number extraction (or none at all -- afairc doing simple str filter is faster on shorter strings). It seems that there aren't that many chemical elements to go over, and explicit mapping is better than implicit function that needs to be tested with the same explicit mapping anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we implemented the "guesser" infrastructure -- the end goal was being able to create Contexts with specific behaviour for cases like these, e.g. a Charmm lipid context with a custom name-to-type guesser/mapper function. Continuously updating the global look-up table feels unsustainable and will likely inevitably run into a clash, e.g. "CA".

I agree with @IAlibay that we should probably replace the stripping logic with splitting.

I also don't think I understand the logic of changing to "AC" when you could just return "C".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd be happier if we were just failing. Then at least we could provide the user with a note on what to do to ensure that their data are correct. "Sorry, I cannot guess bonds because I cannot ascertain the elements from the atom types. Provide a full topology file or provide elements. Or work on a guesser..."

If @lilyminium as the expert on the guessers thinks that splitting will improve the default guesser then I am happy to support this direction. I just don't believe that it really is a solution because something else will likely break. The real solution is to write new guessers that codify the specific knowledge about the force field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I have tested the guesser (with various CHARMM lipid systems), the only case where it consistently fails is with CS, because it is interpreted as the element cesium, which has a large vdW radius (3.43 Å) compared to carbon (2.31 Å for CA). This seems to be a very localized issue mainly affecting C*S atom names in SM lipids from CHARMM, since “CS” is not commonly used in a biological context. Following that reasoning, is it possible that the small fix I proposed might be sufficient?

If not, I’m happy to work on a solution using splitting instead. In that case—and if I understood correctly—should we issue a warning when an atom name is ambiguous?

@lilyminium The logic behind mapping to "AC" was that it directly maps to carbon in the table, but you’re completely right: returning "C" would be clearer.

I also agree that the long-term solution is to codify the force-field knowledge. Is the intention to do this for all force fields, or only for specific ones, depending on what the user specifies?

name = "AC" # AC maps to C in the elements table
return tables.atomelements[name]

while name:
Expand Down
Loading