Skip to content

Commit 59dfe7d

Browse files
authored
Merge pull request #163 from amcadmus/master
Merge devel into master
2 parents b6247d5 + d044d90 commit 59dfe7d

File tree

136 files changed

+59935
-108
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

136 files changed

+59935
-108
lines changed

.github/workflows/test.yml

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,18 @@ jobs:
1313

1414
steps:
1515
- uses: actions/checkout@v2
16+
# set up conda
1617
- name: Set up Python ${{ matrix.python-version }}
17-
uses: actions/setup-python@v2
18+
uses: conda-incubator/setup-miniconda@v2
1819
with:
19-
python-version: ${{ matrix.python-version }}
20+
auto-activate-base: true
21+
activate-environment: ""
22+
# install rdkit and openbabel
23+
- name: Install rdkit
24+
run: conda create -c conda-forge -n my-rdkit-env python=${{ matrix.python-version }} rdkit openbabel;
2025
- name: Install dependencies
21-
run: pip install . coverage codecov
26+
run: source $CONDA/bin/activate my-rdkit-env && pip install .[amber] coverage codecov
2227
- name: Test
23-
run: cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
24-
- run: codecov
28+
run: source $CONDA/bin/activate my-rdkit-env && cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
29+
- name: Run codecov
30+
run: source $CONDA/bin/activate my-rdkit-env && codecov

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,5 @@ dist
2020
dpdata.egg-info
2121
_version.py
2222
!tests/cp2k/aimd/cp2k.log
23+
!tests/cp2k/restart_aimd/ch4.log
2324
__pycache__

README.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,8 @@ The `System` or `LabeledSystem` can be constructed from the following file forma
7878
| PWmat | movement | True | True | LabeledSystem | 'pwmat/movement' |
7979
| PWmat | OUT.MLMD | True | True | LabeledSystem | 'pwmat/out.mlmd' |
8080
| Amber | multi | True | True | LabeledSystem | 'amber/md' |
81-
| Gromacs | gro | False | False | System | 'gromacs/gro' |
81+
| Amber/sqm | sqm.out | False | False | System | 'sqm/out' |
82+
| Gromacs | gro | True | False | System | 'gromacs/gro' |
8283

8384

8485
The Class `dpdata.MultiSystems` can read data from a dir which may contains many files of different systems, or from single xyz file which contains different systems.
@@ -206,4 +207,48 @@ s.replace('Hf', 'Zr', 8)
206207
s.to_vasp_poscar('POSCAR.P42nmc.replace')
207208
```
208209

210+
# BondOrderSystem
211+
A new class `BondOrderSystem` which inherits from class `System` is introduced in dpdata. This new class contains information of chemical bonds and formal charges (stored in `BondOrderSystem.data['bonds']`, `BondOrderSystem.data['formal_charges']`). Now BondOrderSystem can only read from .mol/.sdf formats, because of its dependency on rdkit (which means rdkit must be installed if you want to use this function). Other formats, such as pdb, must be converted to .mol/.sdf format (maybe with software like open babel).
212+
```python
213+
import dpdata
214+
system_1 = dpdata.BondOrderSystem("tests/bond_order/CH3OH.mol", fmt="mol") # read from .mol file
215+
system_2 = dpdata.BondOrderSystem("tests/bond_order/methane.sdf", fmt="sdf") # read from .sdf file
216+
```
217+
In sdf file, all molecules must be of the same topology (i.e. conformers of the same molecular configuration).
218+
`BondOrderSystem` also supports initialize from a `rdkit.Chem.rdchem.Mol` object directly.
219+
```python
220+
from rdkit import Chem
221+
from rdkit.Chem import AllChem
222+
import dpdata
223+
224+
mol = Chem.MolFromSmiles("CC")
225+
mol = Chem.AddHs(mol)
226+
AllChem.EmbedMultipleConfs(mol, 10)
227+
system = dpdata.BondOrderSystem(rdkit_mol=mol)
228+
```
229+
230+
## Bond Order Assignment
231+
The `BondOrderSystem` implements a more robust sanitize procedure for rdkit Mol, as defined in `dpdata.rdkit.santizie.Sanitizer`. This class defines 3 level of sanitization process by: low, medium and high. (default is medium).
232+
+ low: use `rdkit.Chem.SanitizeMol()` function to sanitize molecule.
233+
+ medium: before using rdkit, the programm will first assign formal charge of each atom to avoid inappropriate valence exceptions. However, this mode requires the rightness of the bond order information in the given molecule.
234+
+ high: the program will try to fix inappropriate bond orders in aromatic hetreocycles, phosphate, sulfate, carboxyl, nitro, nitrine, guanidine groups. If this procedure fails to sanitize the given molecule, the program will then try to call `obabel` to pre-process the mol and repeat the sanitization procedure. **That is to say, if you wan't to use this level of sanitization, please ensure `obabel` is installed in the environment.**
235+
According to our test, our sanitization procedure can successfully read 4852 small molecules in the PDBBind-refined-set. It is necessary to point out that the in the molecule file (mol/sdf), the number of explicit hydrogens has to be correct. Thus, we recommend to use
236+
`obabel xxx -O xxx -h` to pre-process the file. The reason why we do not implement this hydrogen-adding procedure in dpdata is that we can not ensure its correctness.
237+
238+
```python
239+
import dpdata
240+
241+
for sdf_file in glob.glob("bond_order/refined-set-ligands/obabel/*sdf"):
242+
syst = dpdata.BondOrderSystem(sdf_file, sanitize_level='high', verbose=False)
243+
```
244+
## Formal Charge Assignment
245+
BondOrderSystem implement a method to assign formal charge for each atom based on the 8-electron rule (see below). Note that it only supports common elements in bio-system: B,C,N,O,P,S,As
246+
```python
247+
import dpdata
248+
249+
syst = dpdata.BondOrderSystem("tests/bond_order/CH3NH3+.mol", fmt='mol')
250+
print(syst.get_formal_charges()) # return the formal charge on each atom
251+
print(syst.get_charge()) # return the total charge of the system
252+
```
209253

254+
If a valence of 3 is detected on carbon, the formal charge will be assigned to -1. Because for most cases (in alkynyl anion, isonitrile, cyclopentadienyl anion), the formal charge on 3-valence carbon is -1, and this is also consisent with the 8-electron rule.

dpdata/__init__.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,15 @@
99
from ._version import version as __version__
1010
except ImportError:
1111
from .__about__ import __version__
12+
13+
# BondOrder System has dependency on rdkit
14+
try:
15+
import rdkit
16+
USE_RDKIT = True
17+
except ModuleNotFoundError:
18+
USE_RDKIT = False
19+
20+
if USE_RDKIT:
21+
from .bond_order_system import BondOrderSystem
22+
23+

dpdata/abacus/scf.py

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
import os,sys
2+
import numpy as np
3+
4+
bohr2ang = 0.5291770
5+
ry2ev = 13.605698
6+
kbar2evperang3 = 1e3 / 1.6021892e6
7+
# The consts are cited from $ABACUS_ROOT/source/src_global/constant.h
8+
9+
10+
def get_block (lines, keyword, skip = 0, nlines = None):
11+
ret = []
12+
found = False
13+
if not nlines:
14+
nlines = 1e6
15+
for idx,ii in enumerate(lines) :
16+
if keyword in ii :
17+
found = True
18+
blk_idx = idx + 1 + skip
19+
line_idx = 0
20+
while len(lines[blk_idx]) == 0:
21+
blk_idx += 1
22+
while len(lines[blk_idx]) != 0 and line_idx < nlines and blk_idx != len(lines):
23+
ret.append(lines[blk_idx])
24+
blk_idx += 1
25+
line_idx += 1
26+
break
27+
if not found:
28+
raise RuntimeError("The keyword %s is not found in the script." %keyword)
29+
return ret
30+
31+
def get_geometry_in(fname, inlines):
32+
geometry_path_in = os.path.join(fname, "STRU")
33+
for line in inlines:
34+
if "atom_file" in line and "atom_file"==line.split()[0]:
35+
atom_file = line.split()[1]
36+
geometry_path_in = os.path.join(fname, atom_file)
37+
break
38+
return geometry_path_in
39+
40+
def get_path_out(fname, inlines):
41+
path_out = os.path.join(fname, "OUT.ABACUS/running_scf.log")
42+
for line in inlines:
43+
if "suffix" in line and "suffix"==line.split()[0]:
44+
suffix = line.split()[1]
45+
path_out = os.path.join(fname, "OUT.%s/running_scf.log" % suffix)
46+
break
47+
return path_out
48+
49+
def get_cell(geometry_inlines):
50+
cell_lines = get_block(geometry_inlines, "LATTICE_VECTORS", skip = 0, nlines = 3)
51+
celldm_lines = get_block(geometry_inlines, "LATTICE_CONSTANT", skip=0, nlines=1)
52+
53+
celldm = float(celldm_lines[0].split()[0]) * bohr2ang # lattice const is in Bohr
54+
cell = []
55+
for ii in range(3):
56+
cell.append([float(jj) for jj in cell_lines[ii].split()[0:3]])
57+
cell = celldm*np.array(cell)
58+
return celldm, cell
59+
60+
def get_coords(celldm, cell, geometry_inlines, inlines):
61+
coords_lines = get_block(geometry_inlines, "ATOMIC_POSITIONS", skip=0)
62+
# assuming that ATOMIC_POSITIONS is at the bottom of the STRU file
63+
coord_type = coords_lines[0].split()[0].lower() # cartisan or direct
64+
atom_names = [] # element abbr in periodic table
65+
atom_types = [] # index of atom_names of each atom in the geometry
66+
atom_numbs = [] # of atoms for each element
67+
coords = [] # coordinations of atoms
68+
ntype = 0
69+
for line in inlines:
70+
if "ntype" in line and "ntype"==line.split()[0]:
71+
ntype = int(line.split()[1])
72+
break
73+
if ntype <= 0:
74+
raise RuntimeError('ntype cannot be found in INPUT file.')
75+
line_idx = 1 # starting line of first element
76+
for it in range(ntype):
77+
atom_names.append(coords_lines[line_idx].split()[0])
78+
line_idx+=2
79+
atom_numbs.append(int(coords_lines[line_idx].split()[0]))
80+
line_idx+=1
81+
for iline in range(atom_numbs[it]):
82+
xyz = np.array([float(xx) for xx in coords_lines[line_idx].split()[0:3]])
83+
if coord_type == "cartesian":
84+
xyz = xyz*celldm
85+
elif coord_type == "direct":
86+
tmp = np.matmul(xyz, cell)
87+
xyz = tmp
88+
else:
89+
print("coord_type = %s" % coord_type)
90+
raise RuntimeError("Input coordination type is invalid.\n Only direct and cartesian are accepted.")
91+
coords.append(xyz)
92+
atom_types.append(it)
93+
line_idx += 1
94+
coords = np.array(coords) # need transformation!!!
95+
atom_types = np.array(atom_types)
96+
return atom_names, atom_numbs, atom_types, coords
97+
98+
def get_energy(outlines):
99+
Etot = None
100+
for line in outlines:
101+
if "!FINAL_ETOT_IS" in line:
102+
Etot = float(line.split()[1]) # in eV
103+
break
104+
if not Etot:
105+
not_converge = False
106+
for line in outlines:
107+
if "convergence has NOT been achieved!" in line:
108+
not_converge = True
109+
raise RuntimeError("convergence has NOT been achieved in scf!")
110+
break
111+
if not not_converge:
112+
raise RuntimeError("Final total energy cannot be found in output. Unknown problem.")
113+
return Etot
114+
115+
def get_force (outlines):
116+
force = []
117+
force_inlines = get_block (outlines, "TOTAL-FORCE (eV/Angstrom)", skip = 4)
118+
for line in force_inlines:
119+
force.append([float(f) for f in line.split()[1:4]])
120+
force = np.array(force)
121+
return force
122+
123+
def get_stress(outlines):
124+
stress = []
125+
stress_inlines = get_block(outlines, "TOTAL-STRESS (KBAR)", skip = 3)
126+
for line in stress_inlines:
127+
stress.append([float(f) for f in line.split()])
128+
stress = np.array(stress) * kbar2evperang3
129+
return stress
130+
131+
132+
133+
def get_frame (fname):
134+
if type(fname) == str:
135+
# if the input parameter is only one string, it is assumed that it is the
136+
# base directory containing INPUT file;
137+
path_in = os.path.join(fname, "INPUT")
138+
else:
139+
raise RuntimeError('invalid input')
140+
with open(path_in, 'r') as fp:
141+
inlines = fp.read().split('\n')
142+
143+
geometry_path_in = get_geometry_in(fname, inlines)
144+
path_out = get_path_out(fname, inlines)
145+
146+
with open(geometry_path_in, 'r') as fp:
147+
geometry_inlines = fp.read().split('\n')
148+
with open(path_out, 'r') as fp:
149+
outlines = fp.read().split('\n')
150+
151+
celldm, cell = get_cell(geometry_inlines)
152+
atom_names, natoms, types, coords = get_coords(celldm, cell, geometry_inlines, inlines)
153+
154+
energy = get_energy(outlines)
155+
force = get_force (outlines)
156+
stress = get_stress(outlines) * np.linalg.det(cell)
157+
158+
data = {}
159+
data['atom_names'] = atom_names
160+
data['atom_numbs'] = natoms
161+
data['atom_types'] = types
162+
data['cells'] = cell[np.newaxis, :, :]
163+
data['coords'] = coords[np.newaxis, :, :]
164+
data['energies'] = np.array(energy)[np.newaxis]
165+
data['forces'] = force[np.newaxis, :, :]
166+
data['virials'] = stress[np.newaxis, :, :]
167+
data['orig'] = np.zeros(3)
168+
# print("atom_names = ", data['atom_names'])
169+
# print("natoms = ", data['atom_numbs'])
170+
# print("types = ", data['atom_types'])
171+
# print("cells = ", data['cells'])
172+
# print("coords = ", data['coords'])
173+
# print("energy = ", data['energies'])
174+
# print("force = ", data['forces'])
175+
# print("virial = ", data['virials'])
176+
return data
177+
178+
if __name__ == "__main__":
179+
path = "/home/lrx/work/12_ABACUS_dpgen_interface/dpdata/dpdata/tests/abacus.scf"
180+
data = get_frame(path)

dpdata/amber/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

dpdata/amber/mask.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
"""Amber mask"""
2+
try:
3+
import parmed
4+
except ImportError:
5+
pass
6+
7+
def pick_by_amber_mask(param, maskstr, coords=None):
8+
"""Pick atoms by amber masks
9+
10+
Parameters
11+
----------
12+
param: str or parmed.Structure
13+
filename of Amber param file or parmed.Structure
14+
maskstr: str
15+
Amber masks
16+
coords: np.ndarray (optional)
17+
frame coordinates, shape: N*3
18+
"""
19+
parm = load_param_file(param)
20+
if coords is not None:
21+
parm.initialize_topology(xyz=coords)
22+
sele = []
23+
if len(maskstr) > 0:
24+
newmaskstr = maskstr.replace("@0", "!@*")
25+
sele = [parm.atoms[i].idx for i in parmed.amber.mask.AmberMask(
26+
parm, newmaskstr).Selected()]
27+
return sele
28+
29+
def load_param_file(param_file):
30+
if isinstance(param_file, str):
31+
return parmed.load_file(param_file)
32+
elif isinstance(param_file, parmed.Structure):
33+
return param_file
34+
else:
35+
raise RuntimeError("Unsupported structure")

0 commit comments

Comments
 (0)