Skip to content

Commit 5d571de

Browse files
authored
Merge pull request #390 from anyangml2nd/chore/update-metadata
Chore: update metadata and readme for new tasks
2 parents 5be6879 + 318c77a commit 5d571de

File tree

4 files changed

+247
-47
lines changed

4 files changed

+247
-47
lines changed

lambench/metrics/results/README.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,14 @@ Large atomistic models (LAM), also known as machine learning interatomic potenti
1515
- **Extensible**: Easily add new benchmarks and metrics.
1616
- **Detailed Reports**: Generates detailed performance reports and visualizations.
1717

18+
## Updates
19+
The following changes have been made compared to the previouly release version v0.3.1:
20+
- Added new models: MACE-MH-1, DPA-3.2-5M
21+
- Updated `Force Field Prediction` tasks, and for the domain of `Molecules`, two sets of labels were provided to support OMol25-trained models.
22+
- Added new `Property Calculation` tasks: oxygen vacancy formation energy prediction, protein-ligand binding energy prediction, and reaction energy barrier prediction.
23+
24+
<span style="color:red">⚠️ Note: To assess full LAM capacity, we use OMat24-trained task heads for *Force Field Prediction* in Inorganic Materials and Catalysis, and OMol25-trained task heads for Molecules, when available. As for *Property Calculation*, we follow a similar approach, but use OC20-trained task heads for Catalysis when available, as this tends to yield better performance.</span>
25+
1826
# LAMBench Leaderboard
1927

2028
The LAMBench Leaderboard.
@@ -34,9 +42,10 @@ Figure 2: Accuracy-Efficiency Trade-off, $\bar{M}^m_{FF}$ vs $M_E^m$.
3442

3543
We categorize all force-field prediction tasks into 3 domains:
3644

37-
- **Inorganic Materials**: `Torres2019Analysis`, `Batzner2022equivariant`, `Sours2023Applications`, `Lopanitsyna2023Modeling`, `Mazitov2024Surface`, `Gao2025Spontaneous`
38-
- **Molecules**: `ANI-1x`, `MD22`, `AIMD-Chig`
39-
- **Catalysis**: `Vandermause2022Active`, `Zhang2019Bridging`, `Villanueva2024Water`
45+
- **Inorganic Materials**: `Torres2019Analysis`, `Batzner2022equivariant`, `Sours2023Applications`, `Lopanitsyna2023Modeling`, `Mazitov2024Surface`, `Gao2025Spontaneous`, `Gao2025Mechanism`
46+
- **Molecules**: `Sandonas2024Dataset`, `Guan2022Benchmark`, `AIMD-Chig`
47+
- **Catalysis**: `Vandermause2022Active`, `Zhang2019Bridging`, `Villanueva2024Water`,
48+
`Schaaf2023Accurate`, `Liu2025Generalized`
4049

4150
To assess model performance across these domains, we use zero-shot inference with energy-bias term adjustments based on test dataset statistics. Performance metrics are aggregated as follows:
4251

@@ -46,7 +55,7 @@ To assess model performance across these domains, we use zero-shot inference wit
4655

4756
where $M^m_{k,p,i}$ is the original error metric, $m$ indicates the model, $k$ denotes the domain index, $p$ signifies the prediction index, and $i$ represents the test set index. For a model with worse accuracy than a dummy model, the error metric is set to 1.
4857
For instance, in force field tasks, the domains include Molecules, Inorganic Materials, and Catalysis, such that $k \in \{\text{Molecules, Inorganic Materials, Catalysis}\}$. The prediction types are categorized as energy ($E$), force ($F$), or virial ($V$), with $p \in \{E, F, V\}$.
49-
For the specific domain of Molecules, the test sets are indexed as $i \in \{\text{ANI-1x, MD22, AIMD-Chig}\}$. This baseline model predicts energy based solely on the chemical formula, disregarding any structural details, thereby providing a reference point for evaluating the improvement offered by more sophisticated models.
58+
For the specific domain of Molecules, the test sets are indexed as $i \in \{\text{Sandonas2024Dataset, Guan2022Benchmark, AIMD-Chig}\}$. This baseline model predicts energy based solely on the chemical formula, disregarding any structural details, thereby providing a reference point for evaluating the improvement offered by more sophisticated models.
5059

5160
2. For each domain, we compute the log-average of normalized metrics across all datasets within this domain by
5261

@@ -83,12 +92,11 @@ In contrast, an ideal model that perfectly matches Density Functional Theory (DF
8392

8493
For the domain-specific property calculation tasks, we adopt the MAE as the primary error metric.
8594

86-
In the Inorganic Materials domain, the MDR phonon benchmark predicts the maximum phonon frequency, entropy, free energy, and heat capacity at constant volume, while the elasticity benchmark evaluates the shear and bulk moduli. Each prediction type
87-
is assigned an equal weight of $\frac{1}{6}$.
95+
In the Inorganic Materials domain, the MDR phonon benchmark predicts maximum phonon frequency, entropy, free energy, and constant-volume heat capacity; the elasticity benchmark evaluates shear and bulk moduli; and the oxygen vacancy benchmark evaluates oxygen vacancy formation energies. Each prediction type is equally weighted.
8896

89-
In the Molecules domain, the TorsionNet500 benchmark evaluates the torsion profile energy, torsional barrier height, and the number of molecules for which the predicted torsional barrier height error exceeds 1 kcal/mol. The Wiggle150 benchmark assesses the relative conformer energy profile. Each prediction type in this domain is assigned a weight of 0.25.
97+
In the Molecules domain, the TorsionNet500 benchmark evaluates torsion profile energy, torsional barrier height, and the number of molecules with barrier height errors exceeding 1 kcal/mol. The Wiggle150 benchmark assesses relative conformer energy profiles. The protein–ligand binding benchmark evaluates binding energies across multiple sites for a given protein. The reaction barrier benchmark assesses forward and reverse barriers for nine reaction types common in organic chemistry and biochemistry. Each prediction type is equally weighted.
9098

91-
In the Catalysis domain, the OC20NEB-OOD benchmark evaluates the energy barrier, reaction energy change (delta energy), and the percentage of reactions with predicted energy barrier errors exceeding 0.1 eV for three reaction types: transfer, dissociation, and desorption. Each prediction type in this domain is assigned a weight of 0.2.
99+
In the Catalysis domain, the OC20NEB-OOD benchmark evaluates the energy barrier, reaction energy change (delta energy), and the percentage of reactions with predicted energy barrier errors exceeding 0.1 eV for three reaction types: transfer, dissociation, and desorption. Each prediction type is equally weighted.
92100

93101
The resulting error metric after averaging over all domains is denoted as $\bar M^{m}_{PC}$.
94102

0 commit comments

Comments
 (0)