Skip to content

Commit ade45c8

Browse files
feat: add ml2ddb (#1182)
* feat: add ml2ddb * move ml2ddb to examples * move *.png to bce * feat: add ml2ddb
1 parent 3790628 commit ade45c8

File tree

5 files changed

+121
-1
lines changed

5 files changed

+121
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ PaddleScience 是一个基于深度学习框架 PaddlePaddle 开发的科学计
108108
| 晶体材料属性预测 | [CGCNN](https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/examples/cgcnn/) | 数据驱动 | GNN | 监督学习 | [MP](https://next-gen.materialsproject.org/) / [Perovskite](https://cmr.fysik.dtu.dk/cubic_perovskites/cubic_perovskites.html) / [C2DB](https://cmr.fysik.dtu.dk/c2db/c2db.html) / [test](https://paddle-org.bj.bcebos.com/paddlescience%2Fdatasets%2Fcgcnn%2Fcgcnn-test.zip) | [Paper](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.120.145301) |
109109
| 分子生成 | [MoFlow](https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/examples/moflow/) | 数据驱动 | Flow Model | 监督学习 | [qm9/ zink250k](https://aistudio.baidu.com/datasetdetail/282687) | [Paper](https://arxiv.org/abs/2006.10137v1) |
110110
| 分子属性预测 | [IFM](https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/examples/ifm/) | 数据驱动 | MLP | 监督学习 | [tox21/sider/hiv/bace/bbbp](https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/examples/ifm/#:~:text=molecules%20%E6%95%B0%E6%8D%AE%E9%9B%86-,dataset.zip,-%EF%BC%8C%E6%88%96Google%20Drive) | [Paper](https://openreview.net/pdf?id=NLFqlDeuzt) |
111-
111+
| 二维材料生成与数据库 | [ML2DDB](./en/examples/ml2ddb.md) | 数据驱动 | GNN/Diffusion | 监督学习 | Coming Soon | [Paper](https://arxiv.org/pdf/2507.00584) |
112112

113113
<br>
114114
<p align="center"><b>地球科学(AI for Earth Science)</b></p>

docs/en/examples/ml2ddb.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# ML2DDB
2+
3+
[Monolayer Two-dimensional Materials Database (ML2DDB) and Applications](https://arxiv.org/pdf/2507.00584)
4+
5+
Zhongwei Liu<sup>a, b, #</sup>,
6+
Zhimin Zhang<sup>c, #</sup>,
7+
Xuwei Liu<sup>c, #</sup>,
8+
Mingjia Yao<sup>b</sup>,
9+
Xin He<sup>a</sup>,
10+
Yuanhui Sun<sup>b, *</sup>,
11+
Xin Chen<sup>b, *</sup>,
12+
Lijun Zhang<sup>a, b, *</sup>
13+
14+
<sup>a</sup>
15+
State Key Laboratory of Integrated Optoelectronics, Key Laboratory of Automobile Materials of MOE and College of Materials Science and Engineering, Jilin University, Changchun 130012, China
16+
17+
<sup>b</sup> Suzhou Laboratory, Suzhou, 215123, China
18+
19+
<sup>c</sup> Baidu Inc., Beijing, P.R. China.
20+
21+
<sup>#</sup> These authors contributed equally to this work.
22+
23+
24+
25+
## Abstract
26+
27+
The discovery of two-dimensional (2D) materials with tailored properties is critical to meet the increasing demands of high-performance applications across flexible electronics, optoelectronics, catalysis, and energy storage. However, current 2D material databases are constrained by limited scale and compositional diversity. In this study, we introduce a scalable active learning workflow that integrates deep neural networks with density functional theory (DFT) calculations to efficiently explore a vast set of candidate structures. These structures are generated through physics-informed elemental substitution strategies, enabling broad and systematic discovery of stable 2D materials. Through six iterative screening cycles, we established the creation of the Monolayer 2D Materials Database (ML2DDB), which contains 242,546 DFT-validated stable structures—an order-of-magnitude increase over the largest known 2D materials databases. In particular, the number of ternary and quaternary compounds showed the most significant increase. Combining this database with a generative diffusion model, we demonstrated effective structure generation under specified chemistry and symmetry constraints. This work accomplished an organically interconnected loop of 2D material data expansion and application, which provides a new paradigm for the discovery of new materials.
28+
29+
![ML2DDB](https://paddle-org.bj.bcebos.com/paddlescience/docs/ML2DDB/ml2ddb.png)
30+
31+
## Dataset of 2D materials
32+
33+
We developed ML2DDB, a large-scale 2D material database containing >242k DFT-validated monolayer structures (𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> <50 meV/atom), representing a 10× increase over existing datasets. Key features:
34+
35+
- Broad elemental coverage: 81 elements across the periodic table (excluding radioactive/noble gases).
36+
- Enhanced diversity: Significantly more compounds with 3–4 distinct elements compared to prior work.
37+
- Structural richness: Diverse prototypes and cation-anion combinations.
38+
- Extended resource: >1M candidate structures (𝐸<sub>hull</sub><sup>MLIP</sup> <200 meV/atom) for future studies.
39+
40+
![dataset](https://paddle-org.bj.bcebos.com/paddlescience/docs/ML2DDB/ml2ddb_dataset.png)
41+
42+
## Diffusion model generation of S.U.N. materials
43+
44+
The capability to generate S.U.N. (stable, unique, new) 2D materials are prerequisites for diffusion models. We considered a generated structure as stable with 𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> < 100 meV/atom with respect to ML2DDB. The unique is specified whether a generated structure matches any other structure generated in the same batch or not, and the new is whether it is identical to any of the structures in ML2DDB. As shown in Figure 5b, we performed DFT structure optimization on 1024 structures to evaluate the stable attribute. The results show that 74.8% of them are considered stable (𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> < 100 meV/atom), which is comparable to the success rate of 3D stable structure generation of MatterGen. When the constraint is set to 𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> < 0 meV/atom, our method achieved a success rate of 59.6%, which is significantly higher than that of MatterGen (~13%). In addition, the Root-mean-square displacement (RMSD) of the generated structure is lower than 0.26 Å compared to the DFT relaxation structure, which is still less than the radius of the hydrogen atom (0.53 Å). For the generation of unique structures, the success rate accounts for 100% when generating one thousand structures. The rate only decreases 4.4% when generating ten thousand structures. For the generation of new structures, the rate decreases from 100% to 73.5% when the generated structures grow from one thousand to two thousand. This indicates that our model has a relatively excellent ability to generate completely new stable structures.
45+
46+
![dataset](https://paddle-org.bj.bcebos.com/paddlescience/docs/ML2DDB/gen_2d.png)
47+
48+
## Conclusion
49+
50+
This study establishes a novel framework integrating active learning workflows with conditional diffusion-based structural generation, achieving unprecedented expansion of 2D materials databases. Key contributions include:
51+
52+
1. **Dataset Advancement**
53+
- Created ML2DDB containing >242,546 thermodynamically stable 2D materials (E_hull^DFT <50 meV/atom), exceeding existing databases by ≥10x
54+
- Achieved 1100% and 960% growth in ternary/quaternary compounds respectively
55+
- Generated >1 million candidate structures (𝐸<sub>hull</sub><sup>MLIP</sup> <200 meV/atom)
56+
2. **Methodological Innovation**
57+
- Developed MLIP model with 92.36% accuracy in stability classification
58+
- Enabled phase diagram generation and space group-specific design through diffusion model integration
59+
- Demonstrated applicability to nonlinear optical and ferroelectric materials discovery

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,7 @@
141141
|-----|---------|-----|---------|----|---------|---------|
142142
| 材料设计 | [散射板设计(反问题)](./zh/examples/hpinns.md) | 机理驱动 | Transformer | 无监督学习 | [Train Data](https://paddle-org.bj.bcebos.com/paddlescience/datasets/hPINNs/hpinns_holo_train.mat)<br>[Eval Data](https://paddle-org.bj.bcebos.com/paddlescience/datasets/hPINNs/hpinns_holo_valid.mat) | [Paper](https://arxiv.org/pdf/2102.04626.pdf) |
143143
| 晶体材料属性预测 | [CGCNN](./zh/examples/cgcnn.md) | 数据驱动 | GNN | 监督学习 | [MP](https://next-gen.materialsproject.org/) / [Perovskite](https://cmr.fysik.dtu.dk/cubic_perovskites/cubic_perovskites.html) / [C2DB](https://cmr.fysik.dtu.dk/c2db/c2db.html) / [test](https://paddle-org.bj.bcebos.com/paddlescience%2Fdatasets%2Fcgcnn%2Fcgcnn-test.zip) | [Paper](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.120.145301) |
144+
| 二维材料生成与数据库 | [ML2DDB](./en/examples/ml2ddb.md) | 数据驱动 | GNN/Diffusion | 监督学习 | Coming Soon | [Paper](https://arxiv.org/pdf/2507.00584) |
144145

145146
<br>
146147
<p align="center"><b>地球科学(AI for Earth Science)</b></p>

examples/ML2DDB/readme.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# ML2DDB
2+
3+
[Monolayer Two-dimensional Materials Database (ML2DDB) and Applications](https://arxiv.org/pdf/2507.00584)
4+
5+
Zhongwei Liu<sup>a, b, #</sup>,
6+
Zhimin Zhang<sup>c, #</sup>,
7+
Xuwei Liu<sup>c, #</sup>,
8+
Mingjia Yao<sup>b</sup>,
9+
Xin He<sup>a</sup>,
10+
Yuanhui Sun<sup>b, *</sup>,
11+
Xin Chen<sup>b, *</sup>,
12+
Lijun Zhang<sup>a, b, *</sup>
13+
14+
<sup>a</sup>
15+
State Key Laboratory of Integrated Optoelectronics, Key Laboratory of Automobile Materials of MOE and College of Materials Science and Engineering, Jilin University, Changchun 130012, China
16+
17+
<sup>b</sup> Suzhou Laboratory, Suzhou, 215123, China
18+
19+
<sup>c</sup> Baidu Inc., Beijing, P.R. China.
20+
21+
<sup>#</sup> These authors contributed equally to this work.
22+
23+
24+
25+
## Abstract
26+
27+
The discovery of two-dimensional (2D) materials with tailored properties is critical to meet the increasing demands of high-performance applications across flexible electronics, optoelectronics, catalysis, and energy storage. However, current 2D material databases are constrained by limited scale and compositional diversity. In this study, we introduce a scalable active learning workflow that integrates deep neural networks with density functional theory (DFT) calculations to efficiently explore a vast set of candidate structures. These structures are generated through physics-informed elemental substitution strategies, enabling broad and systematic discovery of stable 2D materials. Through six iterative screening cycles, we established the creation of the Monolayer 2D Materials Database (ML2DDB), which contains 242,546 DFT-validated stable structures—an order-of-magnitude increase over the largest known 2D materials databases. In particular, the number of ternary and quaternary compounds showed the most significant increase. Combining this database with a generative diffusion model, we demonstrated effective structure generation under specified chemistry and symmetry constraints. This work accomplished an organically interconnected loop of 2D material data expansion and application, which provides a new paradigm for the discovery of new materials.
28+
29+
![ML2DDB](https://paddle-org.bj.bcebos.com/paddlescience/docs/ML2DDB/ml2ddb.png)
30+
31+
## Dataset of 2D materials
32+
33+
We developed ML2DDB, a large-scale 2D material database containing >242k DFT-validated monolayer structures (𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> <50 meV/atom), representing a 10× increase over existing datasets. Key features:
34+
35+
- Broad elemental coverage: 81 elements across the periodic table (excluding radioactive/noble gases).
36+
- Enhanced diversity: Significantly more compounds with 3–4 distinct elements compared to prior work.
37+
- Structural richness: Diverse prototypes and cation-anion combinations.
38+
- Extended resource: >1M candidate structures (𝐸<sub>hull</sub><sup>MLIP</sup> <200 meV/atom) for future studies.
39+
40+
![dataset](https://paddle-org.bj.bcebos.com/paddlescience/docs/ML2DDB/ml2ddb_dataset.png)
41+
42+
## Diffusion model generation of S.U.N. materials
43+
44+
The capability to generate S.U.N. (stable, unique, new) 2D materials are prerequisites for diffusion models. We considered a generated structure as stable with 𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> < 100 meV/atom with respect to ML2DDB. The unique is specified whether a generated structure matches any other structure generated in the same batch or not, and the new is whether it is identical to any of the structures in ML2DDB. As shown in Figure 5b, we performed DFT structure optimization on 1024 structures to evaluate the stable attribute. The results show that 74.8% of them are considered stable (𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> < 100 meV/atom), which is comparable to the success rate of 3D stable structure generation of MatterGen. When the constraint is set to 𝐸<sub>hull</sub><sup>𝐷𝐹𝑇</sup> < 0 meV/atom, our method achieved a success rate of 59.6%, which is significantly higher than that of MatterGen (~13%). In addition, the Root-mean-square displacement (RMSD) of the generated structure is lower than 0.26 Å compared to the DFT relaxation structure, which is still less than the radius of the hydrogen atom (0.53 Å). For the generation of unique structures, the success rate accounts for 100% when generating one thousand structures. The rate only decreases 4.4% when generating ten thousand structures. For the generation of new structures, the rate decreases from 100% to 73.5% when the generated structures grow from one thousand to two thousand. This indicates that our model has a relatively excellent ability to generate completely new stable structures.
45+
46+
![dataset](https://paddle-org.bj.bcebos.com/paddlescience/docs/ML2DDB/gen_2d.png)
47+
48+
## Conclusion
49+
50+
This study establishes a novel framework integrating active learning workflows with conditional diffusion-based structural generation, achieving unprecedented expansion of 2D materials databases. Key contributions include:
51+
52+
1. **Dataset Advancement**
53+
- Created ML2DDB containing >242,546 thermodynamically stable 2D materials (E_hull^DFT <50 meV/atom), exceeding existing databases by ≥10x
54+
- Achieved 1100% and 960% growth in ternary/quaternary compounds respectively
55+
- Generated >1 million candidate structures (𝐸<sub>hull</sub><sup>MLIP</sup> <200 meV/atom)
56+
2. **Methodological Innovation**
57+
- Developed MLIP model with 92.36% accuracy in stability classification
58+
- Enabled phase diagram generation and space group-specific design through diffusion model integration
59+
- Demonstrated applicability to nonlinear optical and ferroelectric materials discovery

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ nav:
9595
- 材料科学(AI for Material):
9696
- hPINNs: zh/examples/hpinns.md
9797
- CGCNN: zh/examples/cgcnn.md
98+
- ML2DDB: en/examples/ml2ddb.md
9899
- 地球科学(AI for Earth Science):
99100
- Extformer-MoE: zh/examples/extformer_moe.md
100101
- FourCastNet: zh/examples/fourcastnet.md

0 commit comments

Comments
 (0)