feat: Performance Optimization: Data Loading and Statistics Acceleration (#5040)

OutisLi · web-flow · commit c346332d011a · 2025-11-14T11:37:53.000Z
## Overview

This PR introduces performance optimizations for data loading and
statistics computation in deepmd-kit. The changes focus on
multi-threading parallelization, memory-mapped I/O, and efficient
filesystem operations.

## Changes Summary

### 1. Multi-threaded Statistics Computation (`deepmd/pt/utils/stat.py`)

- Introduced `ThreadPoolExecutor` for parallel processing of multiple
datasets
- Refactored `make_stat_input` to use thread pool with 256 workers
- Created `_process_one_dataset` helper function for individual dataset
processing
- Significantly accelerates statistics computation for multi-system
datasets

### 2. Efficient System Path Lookup (`deepmd/common.py`)

- Optimized `expand_sys_str` to use `rglob("type.raw")` instead of
`rglob("*")` + filtering
- Added `parent` property to `DPOSPath` and `DPH5Path` classes in
`deepmd/utils/path.py`
- **Performance**: 10x speedup for system discovery (as noted in commit
message)

### 3. Memory-mapped Data Loading (`deepmd/utils/data.py`)

- Added `_get_nframes` method to read numpy file headers without loading
data
- Modified `get_numb_batch` to use the new method instead of loading
entire dataset
- Uses `np.lib.format.read_magic` and `read_array_header_*` to extract
shape information
- Reduces memory consumption for large datasets

### 4. Parallel Statistics File Loading (`deepmd/utils/env_mat_stat.py`)

- Implemented `ThreadPoolExecutor` for parallel loading of stat files
- Added `_load_stat_file` static method with error handling
- Uses 128 worker threads for I/O-bound operations
- Enhanced file format validation and malformed file handling

## Performance Impact

| Component | Before | After | Improvement |
|-----------|--------|-------|-------------|
| System path lookup | O(n) file traversal | O(k) direct match | 10x
faster |
| Statistics computation | Sequential processing | 256-thread parallel |
Significant |
| Data loading | Full dataset load | Header-only read | Memory efficient
|
| Statistics loading | Sequential file I/O | 128-thread parallel |
Significant |

## Compatibility

✅ **Backward Compatible**: All API interfaces remain unchanged
✅ **Data Format**: No changes to data file formats
✅ **Functionality**: All existing features work normally

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;
## Summary by CodeRabbit

* **Performance Improvements**
* Optimized frame detection to avoid loading complete datasets during
initialization, enhancing startup performance for large data files.
* Improved support for multiple data format variants with more efficient
metadata reading.
&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;
diff --git a/deepmd/utils/data.py b/deepmd/utils/data.py
@@ -14,6 +14,7 @@
 from typing import (
     Any,
     Optional,
+    Union,
 )
 
 import numpy as np
@@ -135,8 +136,7 @@ def __init__(
         self.shuffle_test = shuffle_test
         # set modifier
         self.modifier = modifier
-        # calculate prefix sum for get_item method
-        frames_list = [self._get_nframes(item) for item in self.dirs]
+        frames_list = [self._get_nframes(set_name) for set_name in self.dirs]
         self.nframes = np.sum(frames_list)
         # The prefix sum stores the range of indices contained in each directory, which is needed by get_item method
         self.prefix_sum = np.cumsum(frames_list).tolist()
@@ -338,8 +338,10 @@ def get_numb_set(self) -> int:
 
     def get_numb_batch(self, batch_size: int, set_idx: int) -> int:
         """Get the number of batches in a set."""
-        data = self._load_set(self.dirs[set_idx])
-        ret = data["coord"].shape[0] // batch_size
+        set_name = self.dirs[set_idx]
+        # Directly obtain the number of frames to avoid loading the entire dataset
+        nframes = self._get_nframes(set_name)
+        ret = nframes // batch_size
         if ret == 0:
             ret = 1
         return ret
@@ -578,18 +580,27 @@ def _shuffle_data(self, data: dict[str, Any]) -> dict[str, Any]:
                 ret[kk] = data[kk]
         return ret, idx
 
-    def _get_nframes(self, set_name: DPPath) -> int:
-        # get nframes
+    def _get_nframes(self, set_name: Union[DPPath, str]) -> int:
         if not isinstance(set_name, DPPath):
             set_name = DPPath(set_name)
         path = set_name / "coord.npy"
-        if self.data_dict["coord"]["high_prec"]:
-            coord = path.load_numpy().astype(GLOBAL_ENER_FLOAT_PRECISION)
+        if isinstance(set_name, DPH5Path):
+            nframes = path.root[path._name].shape[0]
         else:
-            coord = path.load_numpy().astype(GLOBAL_NP_FLOAT_PRECISION)
-        if coord.ndim == 1:
-            coord = coord.reshape([1, -1])
-        nframes = coord.shape[0]
+            # Read only the header to get shape
+            with open(str(path), "rb") as f:
+                version = np.lib.format.read_magic(f)
+                if version[0] == 1:
+                    shape, _fortran_order, _dtype = np.lib.format.read_array_header_1_0(
+                        f
+                    )
+                elif version[0] in [2, 3]:
+                    shape, _fortran_order, _dtype = np.lib.format.read_array_header_2_0(
+                        f
+                    )
+                else:
+                    raise ValueError(f"Unsupported .npy file version: {version}")
+            nframes = shape[0] if len(shape) > 1 else 1
         return nframes
 
     def reformat_data_torch(self, data: dict[str, Any]) -> dict[str, Any]: