Skip to content

Commit 045a996

Browse files
YfB1125Dubhe-ChangHydrogenSulfate
authored
[Example] Add Chem Suzuki-Miyaura 交叉偶联反应产率预测 (#1175)
* establish example chem * make the code consistent with the upstream * make the code consistent with the upstream * make the code consistent with the upstream * delete unnecessary file * fix: code format via pre-commit hooks * reset gitmodule * fix: restore accidentally deleted paddle_sparse submodule * fix: update XPINN_2D_PoissonsEqn.py file permissions to 755 * renamed:examples/chem/Chem.py -> examples/chem/chem.py * update examples/chem/chem.py * update chem.py * update chem.yaml * update chem.yaml * update ./ppsci/arch/chem.py * update ./ppsci/arch/chem.py * update chem.md * rename files of the example * rename files in ppsci * chem.md -> smc_reac.md * update mkdocs.yml * rename eval -> evaluate in smc_reac.py * add dataset download links in smc_reac.md * update smc_reac * delete data_set.xlsx * update smc_reac.md * update smc_reac.md * update smc_reac.yaml * fix(smc_reac): correct evaluate unpacking logic * update smc_reac.md * merge code of upstream * delete unnecessary files * update smc_reac.py * fix(smc_reac:delete unnecessary information & correct evaluation instruction) * correct evaluation instruction * Update docs/zh/examples/smc_reac.md * Update docs/zh/examples/smc_reac.md * Update docs/zh/examples/smc_reac.md * Update docs/zh/examples/smc_reac.md --------- Co-authored-by: Dubhe-Chang <[email protected]> Co-authored-by: HydrogenSulfate <[email protected]>
1 parent efa43c0 commit 045a996

File tree

7 files changed

+503
-0
lines changed

7 files changed

+503
-0
lines changed

docs/zh/examples/smc_reac.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# Suzuki-Miyaura 交叉偶联反应产率预测
2+
3+
!!! note
4+
5+
1. 开始训练、评估前,请先下载数据文件[data_set.xlsx](https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx),并对应修改 yaml 配置文件中的 `data_dir` 为实际`data_set.xlsx`的文件路径。
6+
2. 如果需要使用预训练模型进行评估,请先下载预训练模型[smc_reac_model.pdparams](https://paddle-org.bj.bcebos.com/paddlescience/models/smc_reac/smc_reac_model.pdparams), 并对应修改 yaml 配置文件中的 `load_model_path` 为模型参数路径。
7+
3. 首次训练、评估前,请执行`pip install -r requirements.txt`安装`rdkit`等相关依赖。
8+
9+
=== "模型训练命令"
10+
11+
``` sh
12+
# linux
13+
wget -nc https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx
14+
# windows
15+
curl https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx -o data_set.xlsx
16+
python smc_reac.py
17+
```
18+
19+
=== "模型评估命令"
20+
21+
``` sh
22+
# linux
23+
wget -nc https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx
24+
# windows
25+
curl https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx -o data_set.xlsx
26+
python smc_reac.py mode=eval EVAL.pretrained_model_path=https://paddle-org.bj.bcebos.com/paddlescience/models/smc_reac/smc_reac_model.pdparams
27+
```
28+
29+
## 1. 背景简介
30+
31+
Suzuki-Miyaura 交叉偶联反应表达式如下所示。
32+
33+
$$
34+
\mathrm{Ar{-}X} + \mathrm{Ar'{-}B(OH)_2} \xrightarrow[\text{Base}]{\mathrm{Pd}^0} \mathrm{Ar{-}Ar'} + \mathrm{HX}
35+
$$
36+
37+
在零价钯配合物催化下,芳基或烯基硼酸或硼酸酯与氯、溴、碘代芳烃或烯烃发生交叉偶联。该反应具有反应条件温和、转化率高的优点,在材料合成、药物研发等领域具有重要作用,但存在开发周期长,试错成本高的问题。本研究通过使用高通量实验数据分析反应底物(包括亲电试剂和亲核试剂),催化配体,碱基,溶剂对偶联反应产率的影响,从而建立预测模型。
38+
39+
40+
## 2. Suzuki-Miyaura 交叉偶联反应产率预测模型的实现
41+
42+
本节将讲解如何基于PaddleScience代码,实现对于 Suzuki-Miyaura 交叉偶联反应产率预测模型的构建、训练、测试和评估。案例的目录结构如下。
43+
``` log
44+
smc_reac/
45+
├──config/
46+
│ └── smc_reac.yaml
47+
├── data_set.xlsx
48+
├── requirements.txt
49+
└── smc_reac.py
50+
```
51+
52+
### 2.1 数据集构建和载入
53+
54+
本样例使用的数据来自参考文献[1]提供的开源数据,仅考虑试剂本身对于实验结果的影响,从中筛选了各分量均有试剂参与的部分反应数据,保存在文件 `./data_set.xlsx` 中。该工作开发了一套基于流动化学(flow chemistry)的自动化平台,该平台在氩气保护的手套箱中组装,使用改良的高效液相色谱(HPLC)系统,结合自动化取样装置,从192个储液瓶中按设定程序吸取反应组分(亲电试剂、亲核试剂、催化剂、配体和碱),并注入流动载液中。每个反应段在温控反应盘管中以设定的流速、压力、时间进行反应,反应液通过UPLC-MS进行实时检测。通过调控亲电试剂、亲核试剂、11种配体、7种碱和4种溶剂的组合,最终实现了5760个反应条件的系统性筛选。接下来以其中一条数据为例结合代码说明数据集的构建与载入流程。
55+
56+
```
57+
ClC=1C=C2C=CC=NC2=CC1 | CC=1C(=C2C=NN(C2=CC1)C1OCCCC1)B(O)O | C(C)(C)(C)P(C(C)(C)C)C(C)(C)C | [OH-].[Na+] | C(C)#N | 4.76
58+
```
59+
其中用SMILES依次表示亲电试剂、亲核试剂、催化配体、碱、溶剂和实验产率。
60+
61+
首先从表格文件中将实验材料信息和反应产率进行导入,并划分训练集和测试集,
62+
63+
``` py linenums="26" title="examples/smc_reac/smc_reac.py"
64+
--8<--
65+
examples/smc_reac/smc_reac.py:26:34
66+
--8<--
67+
```
68+
69+
应用 `rdkit.Chem.rdFingerprintGenerator` 将亲电试剂、亲核试剂、催化配体、碱和溶剂的SMILES描述转换为 Morgan 指纹。Morgan指纹是一种分子结构的向量化描述,通过局部拓扑被编码为 hash 值,映射到2048位指纹位上。用 PaddleScience 代码表示如下
70+
71+
``` py linenums="37" title="examples/smc_reac/smc_reac.py"
72+
--8<--
73+
examples/smc_reac/smc_reac.py:37:65
74+
--8<--
75+
```
76+
77+
### 2.2 约束构建
78+
79+
本案例采用监督学习,按照 PaddleScience 的API结构说明,采用内置的 `SupervisedConstraint` 构建监督约束。用 PaddleScience 代码表示如下
80+
81+
``` py linenums="73" title="examples/smc_reac/smc_reac.py"
82+
--8<--
83+
examples/smc_reac/smc_reac.py:73:88
84+
--8<--
85+
```
86+
`SupervisedConstraint` 的第二个参数表示采用均方误差 `MSELoss` 作为损失函数,第三个参数表示约束条件的名字,方便后续对其索引。
87+
88+
### 2.3 模型构建
89+
90+
本案例设计了五条独立的子网络(全连接层+ReLU激活),每条子网络分别提取对应化学物质的特征。随后,这五个特征向量通过可训练的权重参数进行加权平均,实现不同化学成分对反应产率预测影响的自适应学习。最后,将融合后的特征输入到一个全连接层进行进一步映射,输出反应产率预测值。整个网络结构体现了对反应中各组成成分信息的独立提取与有权重的融合,符合反应机理特性。用 PaddleScience 代码表示如下
91+
92+
``` py linenums="7" title="ppsci/arch/smc_reac.py"
93+
--8<--
94+
ppsci/arch/smc_reac.py:7:107
95+
--8<--
96+
```
97+
98+
模型依据配置文件信息进行实例化
99+
100+
``` py linenums="90" title="examples/smc_reac/smc_reac.py"
101+
--8<--
102+
examples/smc_reac/smc_reac.py:90:90
103+
--8<--
104+
```
105+
106+
参数通过配置文件进行设置如下
107+
108+
``` py linenums="35" title="examples/smc_reac/config/smc_reac.yaml"
109+
--8<--
110+
examples/smc_reac/config/smc_reac.yaml:35:41
111+
--8<--
112+
```
113+
114+
### 2.4 优化器构建
115+
116+
训练器采用Adam优化器,学习率设置由配置文件给出。用 PaddleScience 代码表示如下
117+
118+
``` py linenums="92" title="examples/smc_reac/smc_reac.py"
119+
--8<--
120+
examples/smc_reac/smc_reac.py:92:92
121+
--8<--
122+
```
123+
124+
### 2.5 模型训练
125+
126+
完成上述设置之后,只需要将上述实例化的对象按顺序传递给`ppsci.solver.Solver`,然后启动训练即可。用PaddleScience 代码表示如下
127+
128+
``` py linenums="95" title="examples/smc_reac/smc_reac.py"
129+
--8<--
130+
examples/smc_reac/smc_reac.py:95:104
131+
--8<--
132+
```
133+
134+
## 3. 完整代码
135+
136+
``` py linenums="1" title="examples/smc_reac/smc_reac.py"
137+
--8<--
138+
examples/smc_reac/smc_reac.py
139+
--8<--
140+
```
141+
142+
## 4. 结果展示
143+
144+
下图展示对 Suzuki-Miyaura 交叉偶联反应产率的模型预测结果。
145+
146+
<figure markdown>
147+
![chem.png](https://paddle-org.bj.bcebos.com/paddlescience/docs/SMCReac/chem.png){ loading=lazy }
148+
<figcaption> Suzuki-Miyaura 交叉偶联反应产率的模型预测结果</figcaption>
149+
</figure>
150+
151+
## 5. 参考文献
152+
153+
[1] Perera D, Tucker J W, Brahmbhatt S, et al. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow[J]. Science, 2018, 359(6374): 429-434.
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
defaults:
2+
- ppsci_default
3+
- TRAIN: train_default
4+
- TRAIN/ema: ema_default
5+
- TRAIN/swa: swa_default
6+
- EVAL: eval_default
7+
- INFER: infer_default
8+
- hydra/job/config/override_dirname/exclude_keys: exclude_keys_default
9+
- _self_
10+
11+
hydra:
12+
run:
13+
# dynamic output directory according to running time and override name
14+
dir: ./outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}/${hydra.job.override_dirname} #
15+
job:
16+
name: ${mode} # name of logfile
17+
chdir: false # keep current working directory unchanged
18+
callbacks:
19+
init_callback: #
20+
_target_: ppsci.utils.callbacks.InitCallback #
21+
sweep:
22+
# output directory for multirun
23+
dir: ${hydra.run.dir}
24+
subdir: ./
25+
26+
# general settings
27+
mode: train # running mode: train/eval #
28+
seed: 42 #
29+
output_dir: ${hydra:run.dir} #
30+
log_freq: 20 #
31+
use_tbd: false #
32+
data_dir: "./data_set.xlsx" #
33+
34+
# model settings
35+
MODEL: #
36+
input_dim : 2048 # Assuming x_train is your DataFrame
37+
output_dim : 1
38+
hidden_dim : 512
39+
hidden_dim2 : 1024
40+
hidden_dim3 : 2048
41+
hidden_dim4 : 1024
42+
43+
# training settings
44+
TRAIN: #
45+
epochs: 1500 #
46+
iters_per_epoch: 20 #
47+
batch_size: 8 #
48+
learning_rate: 0.0001
49+
pretrained_model_path: null
50+
checkpoint_path: null
51+
52+
# evaluation settings
53+
EVAL:
54+
pretrained_model_path: null
55+
batch_size: 8 #
56+
seed: 20

examples/smc_reac/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
openpyxl
2+
rdkit
3+
scikit-learn

examples/smc_reac/smc_reac.py

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
import os
2+
3+
import hydra
4+
import matplotlib.pyplot as plt
5+
import numpy as np
6+
import paddle
7+
import pandas as pd
8+
import rdkit.Chem as Chem
9+
from omegaconf import DictConfig
10+
from rdkit.Chem import rdFingerprintGenerator
11+
from sklearn.model_selection import train_test_split
12+
13+
import ppsci
14+
15+
os.environ["HYDRA_FULL_ERROR"] = "1"
16+
os.environ["KMP_DUPLICATE_LIB_OK"] = "True"
17+
plt.rcParams["axes.unicode_minus"] = False
18+
plt.rcParams["font.sans-serif"] = ["DejaVu Sans"]
19+
20+
x_train = None
21+
x_test = None
22+
y_train = None
23+
y_test = None
24+
25+
26+
def load_data(cfg: DictConfig):
27+
data_dir = cfg.data_dir
28+
dataset = pd.read_excel(data_dir, skiprows=1)
29+
x = dataset.iloc[:, 1:6]
30+
y = dataset.iloc[:, 6]
31+
x_train, x_test, y_train, y_test = train_test_split(
32+
x, y, test_size=0.2, random_state=42
33+
)
34+
return x_train, x_test, y_train, y_test
35+
36+
37+
def data_processed(x, y):
38+
x = build_dataset(x)
39+
y = paddle.to_tensor(y.to_numpy(dtype=np.float32))
40+
y = paddle.unsqueeze(y, axis=1)
41+
return x, y
42+
43+
44+
def build_dataset(data):
45+
r1 = paddle.to_tensor(np.array(cal_print(data.iloc[:, 0])), dtype=paddle.float32)
46+
r2 = paddle.to_tensor(np.array(cal_print(data.iloc[:, 1])), dtype=paddle.float32)
47+
ligand = paddle.to_tensor(
48+
np.array(cal_print(data.iloc[:, 2])), dtype=paddle.float32
49+
)
50+
base = paddle.to_tensor(np.array(cal_print(data.iloc[:, 3])), dtype=paddle.float32)
51+
solvent = paddle.to_tensor(
52+
np.array(cal_print(data.iloc[:, 4])), dtype=paddle.float32
53+
)
54+
return paddle.concat([r1, r2, ligand, base, solvent], axis=1)
55+
56+
57+
def cal_print(smiles):
58+
vectors = []
59+
for smi in smiles:
60+
mol = Chem.MolFromSmiles(smi)
61+
generator = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
62+
fp = generator.GetFingerprint(mol)
63+
_input = np.array(list(map(float, fp.ToBitString())))
64+
vectors.append(_input)
65+
return vectors
66+
67+
68+
def train(cfg: DictConfig):
69+
global x_train, y_train
70+
x_train, y_train = data_processed(x_train, y_train)
71+
72+
# build supervised constraint
73+
sup = ppsci.constraint.SupervisedConstraint(
74+
dataloader_cfg={
75+
"dataset": {
76+
"input": {"v": x_train},
77+
"label": {"u": y_train},
78+
# "weight": {"W": param},
79+
"name": "IterableNamedArrayDataset",
80+
},
81+
"batch_size": cfg.TRAIN.batch_size,
82+
},
83+
loss=ppsci.loss.MSELoss("mean"),
84+
name="sup",
85+
)
86+
constraint = {
87+
"sup": sup,
88+
}
89+
90+
model = ppsci.arch.SuzukiMiyauraModel(**cfg.MODEL)
91+
92+
optimizer = ppsci.optimizer.optimizer.Adam(cfg.TRAIN.learning_rate)(model)
93+
94+
# Build solver
95+
solver = ppsci.solver.Solver(
96+
model,
97+
constraint=constraint,
98+
optimizer=optimizer,
99+
epochs=cfg.TRAIN.epochs,
100+
eval_during_train=False,
101+
iters_per_epoch=cfg.TRAIN.iters_per_epoch,
102+
cfg=cfg,
103+
)
104+
solver.train()
105+
106+
107+
def evaluate(cfg: DictConfig):
108+
global x_test, y_test
109+
110+
x_test, y_test = data_processed(x_test, y_test)
111+
112+
test_validator = ppsci.validate.SupervisedValidator(
113+
dataloader_cfg={
114+
"dataset": {
115+
"input": {"v": x_test},
116+
"label": {"u": y_test},
117+
"name": "IterableNamedArrayDataset",
118+
},
119+
"batch_size": cfg.EVAL.batch_size,
120+
"shuffle": False,
121+
},
122+
loss=ppsci.loss.MSELoss("mean"),
123+
metric={
124+
"MAE": ppsci.metric.MAE(),
125+
"RMSE": ppsci.metric.RMSE(),
126+
"R2": ppsci.metric.R2Score(),
127+
},
128+
name="test_eval",
129+
)
130+
validators = {"test_eval": test_validator}
131+
132+
model = ppsci.arch.SuzukiMiyauraModel(**cfg.MODEL)
133+
solver = ppsci.solver.Solver(
134+
model,
135+
validator=validators,
136+
cfg=cfg,
137+
)
138+
139+
loss_val, metric_dict = solver.eval()
140+
141+
ypred = model({"v": x_test})["u"].numpy()
142+
ytrue = y_test.numpy()
143+
144+
mae = metric_dict["MAE"]["u"]
145+
rmse = metric_dict["RMSE"]["u"]
146+
r2 = metric_dict["R2"]["u"]
147+
148+
plt.figure()
149+
plt.scatter(ytrue, ypred, s=15, color="royalblue", marker="s", linewidth=1)
150+
plt.plot([ytrue.min(), ytrue.max()], [ytrue.min(), ytrue.max()], "r-", lw=1)
151+
plt.legend(title="R²={:.3f}\n\nMAE={:.3f}".format(r2, mae))
152+
plt.xlabel("Test Yield(%)")
153+
plt.ylabel("Predicted Yield(%)")
154+
save_path = "smc_reac.png"
155+
plt.savefig(save_path)
156+
print(f"Image saved to: {save_path}")
157+
plt.show()
158+
159+
print("Evaluation metrics:")
160+
print(f"Loss: {loss_val:.4f}")
161+
print(f"MAE : {mae:.4f}")
162+
print(f"RMSE: {rmse:.4f}")
163+
print(f"R2 : {r2:.4f}")
164+
165+
166+
@hydra.main(version_base=None, config_path="./config", config_name="smc_reac.yaml")
167+
def main(cfg: DictConfig):
168+
global x_train, x_test, y_train, y_test
169+
170+
x_train, x_test, y_train, y_test = load_data(cfg)
171+
172+
if cfg.mode == "train":
173+
train(cfg)
174+
elif cfg.mode == "eval":
175+
evaluate(cfg)
176+
else:
177+
raise ValueError(f"cfg.mode should in ['train', 'eval'], but got '{cfg.mode}'")
178+
179+
180+
if __name__ == "__main__":
181+
main()

0 commit comments

Comments
 (0)