Skip to content

Commit 19e08df

Browse files
Ericwang6njzjz
andauthored
Add support for AmberTools sqm/out and BondOrderSystem (#157)
* add BondOrderSystem * add BondOrderSystem * add BondOrderSystem * add BondOrderSystem * add BondOrderSystem * Update BondOrderSystem * solve oveall dependency on rdkit * Add dpdata/rdkit package * Add information on functions * Spelling correction * Add bond order assignment * Add test cases for bond order assignment * Add support for AmberTools sqm/out * Add bond order assignment * Add support for AmberTools sqm/out * Creat __init__.py * Add rdkit & openbabel environment * Update README.md * Update test.yml * Remove non user-friendly warnings * Update dpdata/rdkit/sanitize.py Co-authored-by: Jinzhe Zeng <[email protected]> * Use obabel python interface Co-authored-by: Jinzhe Zeng <[email protected]>
1 parent 1766b56 commit 19e08df

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+3579
-9
lines changed

.github/workflows/test.yml

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,18 @@ jobs:
1313

1414
steps:
1515
- uses: actions/checkout@v2
16+
# set up conda
1617
- name: Set up Python ${{ matrix.python-version }}
17-
uses: actions/setup-python@v2
18+
uses: conda-incubator/setup-miniconda@v2
1819
with:
19-
python-version: ${{ matrix.python-version }}
20+
auto-activate-base: true
21+
activate-environment: ""
22+
# install rdkit and openbabel
23+
- name: Install rdkit
24+
run: conda create -c conda-forge -n my-rdkit-env python=${{ matrix.python-version }} rdkit openbabel;
2025
- name: Install dependencies
21-
run: pip install .[amber] coverage codecov
26+
run: source $CONDA/bin/activate my-rdkit-env && pip install .[amber] coverage codecov
2227
- name: Test
23-
run: cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
24-
- run: codecov
28+
run: source $CONDA/bin/activate my-rdkit-env && cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
29+
- name: Run codecov
30+
run: source $CONDA/bin/activate my-rdkit-env && codecov

README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ The `System` or `LabeledSystem` can be constructed from the following file forma
7878
| PWmat | movement | True | True | LabeledSystem | 'pwmat/movement' |
7979
| PWmat | OUT.MLMD | True | True | LabeledSystem | 'pwmat/out.mlmd' |
8080
| Amber | multi | True | True | LabeledSystem | 'amber/md' |
81+
| Amber/sqm | sqm.out | False | False | System | 'sqm/out' |
8182
| Gromacs | gro | True | False | System | 'gromacs/gro' |
8283

8384

@@ -206,4 +207,48 @@ s.replace('Hf', 'Zr', 8)
206207
s.to_vasp_poscar('POSCAR.P42nmc.replace')
207208
```
208209

210+
# BondOrderSystem
211+
A new class `BondOrderSystem` which inherits from class `System` is introduced in dpdata. This new class contains information of chemical bonds and formal charges (stored in `BondOrderSystem.data['bonds']`, `BondOrderSystem.data['formal_charges']`). Now BondOrderSystem can only read from .mol/.sdf formats, because of its dependency on rdkit (which means rdkit must be installed if you want to use this function). Other formats, such as pdb, must be converted to .mol/.sdf format (maybe with software like open babel).
212+
```python
213+
import dpdata
214+
system_1 = dpdata.BondOrderSystem("tests/bond_order/CH3OH.mol", fmt="mol") # read from .mol file
215+
system_2 = dpdata.BondOrderSystem("tests/bond_order/methane.sdf", fmt="sdf") # read from .sdf file
216+
```
217+
In sdf file, all molecules must be of the same topology (i.e. conformers of the same molecular configuration).
218+
`BondOrderSystem` also supports initialize from a `rdkit.Chem.rdchem.Mol` object directly.
219+
```python
220+
from rdkit import Chem
221+
from rdkit.Chem import AllChem
222+
import dpdata
223+
224+
mol = Chem.MolFromSmiles("CC")
225+
mol = Chem.AddHs(mol)
226+
AllChem.EmbedMultipleConfs(mol, 10)
227+
system = dpdata.BondOrderSystem(rdkit_mol=mol)
228+
```
229+
230+
## Bond Order Assignment
231+
The `BondOrderSystem` implements a more robust sanitize procedure for rdkit Mol, as defined in `dpdata.rdkit.santizie.Sanitizer`. This class defines 3 level of sanitization process by: low, medium and high. (default is medium).
232+
+ low: use `rdkit.Chem.SanitizeMol()` function to sanitize molecule.
233+
+ medium: before using rdkit, the programm will first assign formal charge of each atom to avoid inappropriate valence exceptions. However, this mode requires the rightness of the bond order information in the given molecule.
234+
+ high: the program will try to fix inappropriate bond orders in aromatic hetreocycles, phosphate, sulfate, carboxyl, nitro, nitrine, guanidine groups. If this procedure fails to sanitize the given molecule, the program will then try to call `obabel` to pre-process the mol and repeat the sanitization procedure. **That is to say, if you wan't to use this level of sanitization, please ensure `obabel` is installed in the environment.**
235+
According to our test, our sanitization procedure can successfully read 4852 small molecules in the PDBBind-refined-set. It is necessary to point out that the in the molecule file (mol/sdf), the number of explicit hydrogens has to be correct. Thus, we recommend to use
236+
`obabel xxx -O xxx -h` to pre-process the file. The reason why we do not implement this hydrogen-adding procedure in dpdata is that we can not ensure its correctness.
237+
238+
```python
239+
import dpdata
240+
241+
for sdf_file in glob.glob("bond_order/refined-set-ligands/obabel/*sdf"):
242+
syst = dpdata.BondOrderSystem(sdf_file, sanitize_level='high', verbose=False)
243+
```
244+
## Formal Charge Assignment
245+
BondOrderSystem implement a method to assign formal charge for each atom based on the 8-electron rule (see below). Note that it only supports common elements in bio-system: B,C,N,O,P,S,As
246+
```python
247+
import dpdata
248+
249+
syst = dpdata.BondOrderSystem("tests/bond_order/CH3NH3+.mol", fmt='mol')
250+
print(syst.get_formal_charges()) # return the formal charge on each atom
251+
print(syst.get_charge()) # return the total charge of the system
252+
```
209253

254+
If a valence of 3 is detected on carbon, the formal charge will be assigned to -1. Because for most cases (in alkynyl anion, isonitrile, cyclopentadienyl anion), the formal charge on 3-valence carbon is -1, and this is also consisent with the 8-electron rule.

dpdata/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,15 @@
99
from ._version import version as __version__
1010
except ImportError:
1111
from .__about__ import __version__
12+
13+
# BondOrder System has dependency on rdkit
14+
try:
15+
import rdkit
16+
USE_RDKIT = True
17+
except ModuleNotFoundError:
18+
USE_RDKIT = False
19+
20+
if USE_RDKIT:
21+
from .bond_order_system import BondOrderSystem
22+
23+

dpdata/amber/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

dpdata/amber/sqm.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
import numpy as np
2+
3+
4+
START = 0
5+
READ_CHARGE = 1
6+
READ_CHARGE_SUCCESS = 2
7+
READ_COORDS_START = 3
8+
READ_COORDS = 6
9+
10+
def parse_sqm_out(fname):
11+
'''
12+
Read atom symbols, charges and coordinates from ambertools sqm.out file
13+
'''
14+
atom_symbols = []
15+
coords = []
16+
charges = []
17+
with open(fname) as f:
18+
flag = 0
19+
for line in f:
20+
if line.startswith(" Atom Element Mulliken Charge"):
21+
flag = READ_CHARGE
22+
elif line.startswith(" Total Mulliken Charge"):
23+
flag = READ_CHARGE_SUCCESS
24+
elif line.startswith(" Final Structure"):
25+
flag = READ_COORDS_START
26+
elif flag == READ_CHARGE:
27+
ls = line.strip().split()
28+
atom_symbols.append(ls[-2])
29+
charges.append(float(ls[-1]))
30+
elif READ_COORDS_START <= flag < READ_COORDS:
31+
flag += 1
32+
elif flag == READ_COORDS:
33+
ls = line.strip()
34+
if not ls:
35+
break
36+
else:
37+
symbol = line.strip().split()[-4]
38+
coord = list(map(float, line.strip().split()[-3:]))
39+
coords.append(coord)
40+
return atom_symbols, charges, np.array(coords)
41+
42+
43+
def to_system_data(fname):
44+
data = {}
45+
atom_symbols, charges, coords = parse_sqm_out(fname)
46+
atom_names, data['atom_types'], atom_numbs = np.unique(atom_symbols, return_inverse=True, return_counts=True)
47+
data['atom_names'] = list(atom_names)
48+
data['atom_numbs'] = list(atom_numbs)
49+
data['charges'] = np.array([charges])
50+
data['coords'] = np.array([coords])
51+
data['orig'] = np.array([0, 0, 0])
52+
data['cells'] = np.array([[[100., 0., 0.], [0., 100., 0.], [0., 0., 100.]]])
53+
data['nopbc'] = True
54+
return data

dpdata/bond_order_system.py

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
#%%
2+
# Bond Order System
3+
from dpdata.system import Register, System, LabeledSystem, check_System
4+
import rdkit.Chem
5+
import dpdata.rdkit.utils
6+
from dpdata.rdkit.sanitize import Sanitizer, SanitizeError
7+
from copy import deepcopy
8+
# import dpdata.rdkit.mol2
9+
10+
def check_BondOrderSystem(data):
11+
check_System(data)
12+
assert ('bonds' in data.keys())
13+
14+
class BondOrderSystem(System):
15+
'''
16+
The system with chemical bond and formal charges information
17+
18+
For example, a labeled methane system named `d_example` has one molecule (5 atoms, 4 bonds) and `n_frames` frames. The bond order and formal charge information can be accessed by
19+
- `d_example['bonds']` : a numpy array of size 4 x 3, and
20+
the first column represents the index of begin atom,
21+
the second column represents the index of end atom,
22+
the third columen represents the bond order:
23+
1 - single bond, 2 - double bond, 3 - triple bond, 1.5 - aromatic bond
24+
- `d_example['formal_charges']` : a numpy array of size 5 x 1
25+
'''
26+
def __init__(self,
27+
file_name = None,
28+
fmt = 'auto',
29+
type_map = None,
30+
begin = 0,
31+
step = 1,
32+
data = None,
33+
rdkit_mol = None,
34+
sanitize_level = "medium",
35+
raise_errors = True,
36+
verbose = False,
37+
**kwargs):
38+
"""
39+
Constructor
40+
41+
Parameters
42+
----------
43+
file_name : str
44+
The file to load the system
45+
fmt : str
46+
Format of the file, supported formats are
47+
- ``auto`` : inferred from `file_name`'s extention
48+
- ``mol`` : .mol file
49+
- ``sdf`` : .sdf file
50+
type_map : list of str
51+
Needed by formats deepmd/raw and deepmd/npy. Maps atom type to name. The atom with type `ii` is mapped to `type_map[ii]`.
52+
If not provided the atom names are assigned to `'Type_1'`, `'Type_2'`, `'Type_3'`...
53+
begin : int
54+
The beginning frame when loading MD trajectory.
55+
step : int
56+
The number of skipped frames when loading MD trajectory.
57+
data : dict
58+
System data dict.
59+
rdkit_mol : rdkit.Chem.rdchem.Mol
60+
If `file_name` is None, you must init with a rdkit Mol type.
61+
sanitize_level : str
62+
The level of sanitizer, 'low', 'medium' or 'high'.
63+
raise_errors : bool
64+
whether to raise an Exception if sanitization procedure fails.
65+
verbose : bool
66+
whether to print information in the sanitization procedure.
67+
"""
68+
69+
System.__init__(self)
70+
self.sanitizer = Sanitizer(sanitize_level, raise_errors, verbose)
71+
72+
if data:
73+
mol = dpdata.rdkit.utils.system_data_to_mol(data)
74+
self.from_rdkit_mol(mol)
75+
if file_name:
76+
self.from_fmt(file_name,
77+
fmt,
78+
type_map=type_map,
79+
begin=begin,
80+
step=step,
81+
**kwargs)
82+
elif rdkit_mol:
83+
self.from_rdkit_mol(rdkit_mol)
84+
else:
85+
raise ValueError("Please specify a mol/sdf file or a rdkit Mol object")
86+
87+
if type_map:
88+
self.apply_type_map(type_map)
89+
90+
register_from_funcs = Register()
91+
register_to_funcs = System.register_to_funcs + Register()
92+
93+
def __repr__(self):
94+
return self.__str__()
95+
96+
def __str__(self):
97+
'''
98+
A brief summary of the system
99+
'''
100+
ret = "Data Summary"
101+
ret += "\nBondOrder System"
102+
ret += "\n-------------------"
103+
ret += f"\nFrame Numbers : {self.get_nframes()}"
104+
ret += f"\nAtom Numbers : {self.get_natoms()}"
105+
ret += f"\nBond Numbers : {self.get_nbonds()}"
106+
ret += "\nElement List :"
107+
ret += "\n-------------------"
108+
ret += "\n"+" ".join(map(str,self.get_atom_names()))
109+
ret += "\n"+" ".join(map(str,self.get_atom_numbs()))
110+
return ret
111+
112+
def get_nbonds(self):
113+
'''
114+
Return the number of bonds
115+
'''
116+
return len(self.data['bonds'])
117+
118+
def get_charge(self):
119+
'''
120+
Return the total formal charge of the moleclue
121+
'''
122+
return sum(self.data['formal_charges'])
123+
124+
def get_mol(self):
125+
'''
126+
Return the rdkit.Mol object
127+
'''
128+
return self.rdkit_mol
129+
130+
def get_bond_order(self, begin_atom_idx, end_atom_idx):
131+
'''
132+
Return the bond order between given atoms
133+
'''
134+
return self.data['bond_dict'][f'{int(begin_atom_idx)}-{int(end_atom_idx)}']
135+
136+
def get_formal_charges(self):
137+
'''
138+
Return the formal charges on each atom
139+
'''
140+
return self.data['formal_charges']
141+
142+
def copy(self):
143+
new_mol = deepcopy(self.rdkit_mol)
144+
self.__class__(data=deepcopy(self.data),
145+
rdkit_mol=new_mol)
146+
147+
# def __add__(self, other):
148+
# '''
149+
# magic method "+" operation
150+
# '''
151+
# if isinstance(other, BondOrderSystem):
152+
# if dpdata.rdkit.utils.check_same_molecule(self.rdkit_mol, other.rdkit_mol):
153+
# self.__class__(self, data=other.data)
154+
# else:
155+
# raise RuntimeError("The two systems are not of the same topology.")
156+
# else:
157+
# raise RuntimeError(f"Unsupported data structure: {type(other)}")
158+
159+
def from_rdkit_mol(self, rdkit_mol):
160+
'''
161+
Initialize from a rdkit.Chem.rdchem.Mol object
162+
'''
163+
rdkit_mol = self.sanitizer.sanitize(rdkit_mol)
164+
self.data = dpdata.rdkit.utils.mol_to_system_data(rdkit_mol)
165+
self.data['bond_dict'] = dict([(f'{int(bond[0])}-{int(bond[1])}', bond[2]) for bond in self.data['bonds']])
166+
self.rdkit_mol = rdkit_mol
167+
168+
@register_from_funcs.register_funcs('mol')
169+
def from_mol_file(self, file_name):
170+
mol = rdkit.Chem.MolFromMolFile(file_name, sanitize=False, removeHs=False)
171+
self.from_rdkit_mol(mol)
172+
173+
@register_to_funcs.register_funcs("mol")
174+
def to_mol_file(self, file_name, frame_idx=0):
175+
assert (frame_idx < self.get_nframes())
176+
rdkit.Chem.MolToMolFile(self.rdkit_mol, file_name, confId=frame_idx)
177+
178+
@register_from_funcs.register_funcs("sdf")
179+
def from_sdf_file(self, file_name):
180+
'''
181+
Note that it requires all molecules in .sdf file must be of the same topology
182+
'''
183+
mols = [m for m in rdkit.Chem.SDMolSupplier(file_name, sanitize=False, removeHs=False)]
184+
if len(mols) > 1:
185+
mol = dpdata.rdkit.utils.combine_molecules(mols)
186+
else:
187+
mol = mols[0]
188+
self.from_rdkit_mol(mol)
189+
190+
@register_to_funcs.register_funcs("sdf")
191+
def to_sdf_file(self, file_name, frame_idx=-1):
192+
sdf_writer = rdkit.Chem.SDWriter(file_name)
193+
if frame_idx == -1:
194+
for ii in self.get_nframes():
195+
sdf_writer.write(self.rdkit_mol, confId=ii)
196+
else:
197+
assert (frame_idx < self.get_nframes())
198+
sdf_writer.write(self.rdkit_mol, confId=frame_idx)
199+
sdf_writer.close()

dpdata/deepmd/comp.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,11 @@ def dump(folder,
7171
# dump raw
7272
np.savetxt(os.path.join(folder, 'type.raw'), data['atom_types'], fmt = '%d')
7373
np.savetxt(os.path.join(folder, 'type_map.raw'), data['atom_names'], fmt = '%s')
74+
# BondOrder System
75+
if "bonds" in data:
76+
np.savetxt(os.path.join(folder, "bonds.raw"), data['bonds'], header="begin_atom, end_atom, bond_order")
77+
if "formal_charges" in data:
78+
np.savetxt(os.path.join(folder, "formal_charges.raw"), data['formal_charges'])
7479
# reshape frame properties and convert prec
7580
nframes = data['cells'].shape[0]
7681
cells = np.reshape(data['cells'], [nframes, 9]).astype(comp_prec)

dpdata/deepmd/raw.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,12 @@ def dump (folder, data) :
6363
np.savetxt(os.path.join(folder, 'type_map.raw'), data['atom_names'], fmt = '%s')
6464
np.savetxt(os.path.join(folder, 'box.raw'), np.reshape(data['cells'], [nframes, 9]))
6565
np.savetxt(os.path.join(folder, 'coord.raw'), np.reshape(data['coords'], [nframes, -1]))
66+
# BondOrder System
67+
if "bonds" in data:
68+
np.savetxt(os.path.join(folder, "bonds.raw"), data['bonds'], header="begin_atom, end_atom, bond_order")
69+
if "formal_charges" in data:
70+
np.savetxt(os.path.join(folder, "formal_charges.raw"), data['formal_charges'])
71+
# Labeled System
6672
if 'energies' in data :
6773
np.savetxt(os.path.join(folder, 'energy.raw'), np.reshape(data['energies'], [nframes, 1]))
6874
if 'forces' in data :

dpdata/pwmat/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

dpdata/rdkit/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)