Skip to content

Commit 3457832

Browse files
authored
Add files via upload (#138)
1 parent 843318f commit 3457832

File tree

3 files changed

+653
-0
lines changed

3 files changed

+653
-0
lines changed

source/_posts/dp_v3_tutorial1.md

Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
---
2+
title: "DeePMD-kit v3 Tutorial 1 | Multi-Backend Framework"
3+
date: 2024-11-24
4+
categories:
5+
- DeePMD-kit
6+
mathjax: true
7+
---
8+
9+
One of the highlights of DeePMD-kit v3 is its multi-backend framework. This article introduces it from three aspects: background and principles, usage tutorial, and development tutorial.
10+
11+
<!-- more -->
12+
13+
## Background and Basics
14+
15+
Through a pluggable mechanism, **DeePMD-kit** currently supports four major backends: **TensorFlow**, **PyTorch**, **DP**, and **JAX**, with plans to support the **Paddle** backend in the next version.
16+
17+
### TensorFlow Backend:
18+
- **Implementation**: Based on the TensorFlow v1 API.
19+
- **Usage**:
20+
- A static graph is first constructed using Python APIs.
21+
- During training or inference, input and output nodes of the static graph must be specified. Data is provided for all input nodes, which are then used to infer results for all output nodes.
22+
- **Freezing**:
23+
- The static graph is serialized into a `.pb` model file.
24+
- The TensorFlow C++ library reads the static graph from this file for inference.
25+
26+
---
27+
28+
### PyTorch Backend:
29+
- **Implementation**: Designed for dynamic computation graphs.
30+
- **Usage**:
31+
- During each training and inference cycle, all Python code is re-executed.
32+
- PyTorch separates Python execution and GPU scheduling into different threads. This allows asynchronous computation on the GPU, reducing the performance loss caused by Python execution.
33+
- **Freezing**:
34+
- All code is stored in a `.pth` file using TorchScript.
35+
- The PyTorch C++ library (libtorch) reads this file for inference using the same method.
36+
37+
---
38+
39+
### DP Backend:
40+
- **Implementation**: Built using **NumPy** and the **Array API** to serve as a reference backend.
41+
- **Characteristics**:
42+
- The Array API standardizes array operations across different software.
43+
- Supported by NumPy (v2.0.0+) and JAX (v0.4.33+).
44+
- Does not support gradient computation or training as NumPy lacks these capabilities.
45+
- Cannot run directly in C++.
46+
- **Freezing**:
47+
- Models are serialized and stored in HDF5 files.
48+
49+
---
50+
51+
### JAX Backend:
52+
- **Implementation**: Based on the Array API and reuses much of the DP backend’s code, making it the most compact backend in terms of codebase.
53+
- **Training**: The training feature will be introduced in **DeePMD-kit v3.1.0**, but models can currently be converted and frozen from other backends using the `dp convert-backend` tool.
54+
- **Freezing**:
55+
- Utilizes `jax2tf` to convert parts of the model with static shapes into **StableHLO** (serialized XLA JIT format).
56+
- Dynamic shape components (e.g., neighbor list construction) are implemented using TensorFlow v2 API.
57+
- Both parts are stored in the TensorFlow **SavedModel** format.
58+
- TensorFlow’s C library reads this format for inference.
59+
60+
---
61+
62+
### Paddle Backend:
63+
- Currently under active development and will be introduced in **DeePMD-kit v3.1.0**.
64+
65+
---
66+
67+
### Ensuring Compatibility Across Backends:
68+
1. **Serialization and Deserialization**:
69+
- All modules include methods for serialization and deserialization.
70+
- The serialized output for the same module (and parameters) is consistent across backends, ensuring identical inference results.
71+
2. **Unified Parameters**:
72+
- All backends share the same input parameters (though not all parameters are supported by every backend). Detailed documentation clarifies these differences.
73+
3. **Consistent Inference Interfaces**:
74+
- Identical Python/C++/command-line interfaces are used for inference across backends.
75+
- The file extension of the model (e.g., `.pb`, `.pth`) determines which backend is used for implementation.
76+
77+
## Tutorial
78+
79+
This tutorial can be run directly in a Notebook on Bohrium:
80+
https://bohrium.dp.tech/notebooks/38388452597/
81+
82+
### **Multi-Backend Training/Freezing/Compression**
83+
84+
This section demonstrates training, freezing, and compressing models using the `se_atten_compressible` example from **DeePMD-kit**, with the number of training steps set to **1000**.
85+
86+
⚠️ **Note**: Setting the training steps to 1000 renders the model completely unsuitable for production. The results are unreliable and for demonstration purposes only.
87+
88+
```bash
89+
git clone https://github.com/deepmodeling/deepmd-kit
90+
cd deepmd-kit/examples/water/se_atten_compressible
91+
sed -i "s/1000000/1000/g" input.json
92+
```
93+
---
94+
95+
Training/Freezing/Compression Process:
96+
1. **Backend Specification**:
97+
- Use `dp --tf` or `dp --pt` to differentiate between TensorFlow and PyTorch backends.
98+
```bash
99+
dp --tf train input.json
100+
dp --tf freeze
101+
dp --tf compress
102+
dp --pt train input.json
103+
dp --pt freeze
104+
dp --pt compress
105+
```
106+
2. **Generated Models**:
107+
After the training, freezing, and compression, the following files are obtained:
108+
- `frozen_model.pb` (TensorFlow backend)
109+
- `frozen_model_compressed.pb` (compressed TensorFlow model)
110+
- `frozen_model.pth` (PyTorch backend)
111+
- `frozen_model_compressed.pth` (compressed PyTorch model)
112+
113+
---
114+
115+
### Model Conversion:
116+
Since the **JAX backend** does not currently support training, use `dp convert-backend` to convert a PyTorch model file into a JAX-compatible format.
117+
118+
```bash
119+
dp convert-backend frozen_model.pth frozen_model.savedmodel
120+
```
121+
---
122+
123+
### Model Testing:
124+
- The command `dp test` automatically determines the backend based on the model file extension, removing the need for flags like `--tf` or `--pt`.
125+
126+
```bash
127+
dp test -m frozen_model_compressed.pb -s ../data
128+
dp test -m frozen_model.pth -s ../data
129+
dp test -m frozen_model.savedmodel -s ../data
130+
```
131+
---
132+
133+
### LAMMPS Dynamics Simulation Performance Testing:
134+
- **Objective**: Although the quickly trained models are unsuitable for production, performance testing is conducted across backends to compare speeds.
135+
- **Test System**: A water system with **12,000 atoms**.
136+
- **Notes**:
137+
- No path integral simulations like NVE or NVT were conducted; thus, the coordinates at each step remain identical.
138+
- Each model is run for:
139+
- **100 steps** (cold start)
140+
- **500 steps** (actual speed testing)
141+
```
142+
cat<<EOF > water.in
143+
units metal
144+
boundary p p p
145+
atom_style atomic
146+
147+
neighbor 0.0 bin
148+
neigh_modify every 50 delay 0 check no
149+
150+
read_data water.lmp
151+
mass 1 16
152+
mass 2 2
153+
154+
replicate 4 4 4
155+
156+
pair_style deepmd ../se_atten_compressible/frozen_model.pb
157+
pair_coeff * *
158+
159+
velocity all create 330.0 23456789
160+
161+
timestep 0.0005
162+
thermo_style custom step pe ke etotal temp press vol
163+
thermo 20
164+
run 100
165+
run 500
166+
EOF
167+
168+
lmp -in water.in
169+
```
170+
- Replace the model name `frozen_model.pb` with other models (`frozen_model_compressed.pb`, `frozen_model.pth`, `frozen_model_compressed.pth`, `frozen_model.savedmodel`) to test different files.
171+
172+
---
173+
174+
### Results on V100 and H100 GPUs:
175+
- The performance of compressed and uncompressed models across different backends is nearly identical.
176+
177+
| Model Name | V100 Time/s | H100 Time/s |
178+
|-----------------------------|-------------|-------------|
179+
| `frozen_model.pb` | 72.6601 | 16.3344 |
180+
| `frozen_model_compressed.pb`| 19.6433 | 6.5113 |
181+
| `frozen_model.pth` | 70.5811 | 19.7320 |
182+
| `frozen_model_compressed.pth`| 23.7640 | 10.1136 |
183+
| `frozen_model.savedmodel` | 69.3794 | 28.4590 |
184+
185+
- **Important Notes**:
186+
- Results vary significantly across models, systems, and hardware.
187+
- The observations here may not apply universally.
188+
189+
Key Observations:
190+
- The compression process has little to no impact on performance for the tested backends and systems.
191+
- Speed results illustrate that the multi-backend functionality in **DeePMD-kit** ensures consistent performance across different frameworks.
192+
193+
194+
## Development Tutorial: Integrating a New Backend
195+
196+
When developing a new backend for **DeePMD-kit**, the process involves steps on both the **Python side** and the **C++ side**. Below is an overview of the key tasks and requirements:
197+
198+
### **Python-Side Development**
199+
1. **Create and Register the Backend**:
200+
- Inherit from the abstract class `deepmd.backend.backend.Backend`.
201+
- Create a subclass that implements the required hooks.
202+
- Add the new backend implementation to the `deepmd/backend` directory.
203+
204+
2. **Implement Required Hooks**:
205+
The `Backend` class defines several hooks that need to be implemented:
206+
- **`entry_point_hook`**:
207+
- Handles user commands provided via the command line interface.
208+
- **`deep_eval`**:
209+
- Implements inference functionality in Python.
210+
- Should inherit from `deepmd.infer.deep_eval.DeepEvalBackend`.
211+
- **`neighbor_stat`**:
212+
- Handles neighbor atom statistical analysis.
213+
- **`serialize_hook`** and **`deserialize_hook`**:
214+
- Implement serialization and deserialization of modules for saving to or reading from model files.
215+
216+
3. **Consistency Testing**:
217+
- Add tests to ensure consistency for the new backend in `source/tests/consistent`.
218+
- These tests should validate both:
219+
- The internal consistency of the new backend.
220+
- Consistency with the reference backend.
221+
222+
### **C++-Side Development**
223+
1. **Implement Abstract Classes**:
224+
- Add a specific implementation of the `deepmd::DeepPotBackend` and other relevant abstract classes in the `source/api_cc` directory.
225+
226+
2. **Register the Backend**:
227+
- Integrate the new backend into the backend-selection codebase.
228+
229+
---
230+
231+
### **Examples and References**
232+
- Example Pull Requests:
233+
- [#4302](https://github.com/deepmodeling/deepmd-kit/pull/4302)
234+
- [#4307](https://github.com/deepmodeling/deepmd-kit/pull/4307)
235+
236+
These PRs provide practical examples of adding new backends and implementing required features in both Python and C++.
237+
238+
---
239+
240+
### **Additional Notes**
241+
- When implementing a new backend, ensure comprehensive testing and alignment with the existing backend framework to maintain the consistency and reliability of DeePMD-kit across different backends.

0 commit comments

Comments
 (0)