+LAMBench adopts a highly modularized design,equipped with automation integral to task calculation, result aggregation, analysis, and visualization. Using LAMBench, we benchmarked 8 state-of-the-art LAMs. These models are compared based on their generalizaibility errors on force filed prediction tasks(M<sup>-m</sup><sub>FF</sub>) and on properties calculation tasks(M<sup>-m</sup><sub>PC</sub>).These error metrics are designed such that a dummy model has an error of 1 while a perfect model aligns with DFT labels achieves a metric of 0. Among the LAMs tested, DPA-2.4-7M demonstrated superior generalizability, owing to its multi-task multi-fedility training strategy. For applicability, we define the efficiency metric(M<sup>m</sup><sub>E</sub>) and the instability metric(M<sup>m</sup><sub>IS</sub>). A larger efficiency metric indicates higher efficiency, and a lower instability metric signifies greater stability. While achieving superior generalizability, DPA-2.4-7M also demonstrate decent stability and excellent efficiency among conservative models.
0 commit comments