PET training tips #762

lucasdekam · 2025-10-06T09:43:31Z

lucasdekam
Oct 6, 2025

Hello, I like the idea of the metatensor framework so I wanted to try out training a model. Since the published results suggest that PET is quite competitive with MACE/NequIP, I'm trying out PET. I'm training on a VASP RPBE-D3 dataset of 500 configurations with 400 atoms each (metal-water interfaces).

I'm using these options:

architecture:
  name: pet
  training:
    finetune:
      method: "full"
      read_from: pet-mad-v1.0.2.ckpt
    batch_size: 3
    num_epochs: 2000
    num_epochs_warmup: 10
    learning_rate: 1e-5
    scheduler_patience: 10
    scheduler_factor: 0.99
    log_interval: 1
    checkpoint_interval: 10
    scale_targets: false
    log_mae: false
    log_separate_blocks: false
    best_model_metric: rmse_prod
    grad_clip_norm: .inf
    loss:
      energy:
        type: mse
        weight: 1.0
        reduction: mean
      forces:
        type: mse
        weight: 10.0
        reduction: mean

training_set:
  systems:
    read_from: split_1_Au111-3Na.xyz
    length_unit: angstrom
  targets:
    energy:
      key: 'DFT_energy'  # name of the target value
      unit: eV  # unit of the target value
      forces:
        key: 'DFT_forces'

validation_set:
  systems:
    read_from: val_Au111-3Na.xyz
    length_unit: angstrom
  targets:
    energy:
      key: 'DFT_energy'
      unit: eV
      forces:
        key: 'DFT_forces'

test_set: 0.0

I find training to be quite slow; I'm now at about 1500 epochs and the error starts to approach the MACE force RMSE (PET train: 21 meV/A, PET valid: 23 meV/A, MACE train: 16 meV/A, MACE valid: 20 meV/A). For MACE a few hundred epochs were plenty. It also seems that the learning rate needs to be very low for the error to decrease at all. When training from scratch (not fine-tuning), it's even more difficult to get the error down.

Are these observations a result of the architecture of PET (large number of parameters, little a priori structure compared to the ACE basis used for example by MACE), or am I doing something wrong/suboptimal in the training?

I also quite like the feature available in the gracemaker package where the learning rate is only decreased when the validation error stops decreasing for a set number of epochs. I feel like that would help for training here too. But maybe there's a particular reason why you opted for another (more effective?) strategy?

If anyone has experience with a similar kind of dataset and/or has any ideas for things I can try, let me know.

Thanks and keep up the good work,
Lucas

ceriottm · 2025-10-06T10:35:33Z

ceriottm
Oct 6, 2025
Maintainer

Hi @lucasdekam, I will let @abmazitov and @frostedoyster follow up, but the short story is that we have done quite an extensive study of all this, resulting in an improved PET architecture and training parameters that are slowly being prepared for merging. Another (independent) thing that helps a lot is to set up a "non-conservative pre-training" step, cf. https://atomistic-cookbook.org/examples/pet-finetuning/pet-ft-nc.html that cuts down dramatically the training time.

5 replies

abmazitov Oct 6, 2025
Collaborator

Hello @lucasdekam ! First, thank you for reaching out and raising interest in metatrain! I definitely see your point, and let me explain a few important concepts of PET first, and then give some training recommendations later.

By default, PET is an unconstrained model, meaning - it doesn't have build-in equvariance and needs to actually learn it first during training. That said, every time you start training from scratch - you need to spend a certain amount of time (and training epochs) to teach model equivariance. Technically, this exercise doesn't even need computed energies, forces, and stresses, and can be done in the paradigm of self-supervised pre-training (for example by randomly rotating the structure and predicting the rotation angle).
If the training from scratch is still required, and you have the labeled data - you can indeed do the faster "pre-training" step by using the "non-conservative pre-training". Compared to the standard training, the speedup in this case comes from avoiding doing double-backward when computing forces and stresses "conservatively" as derivatives of energies, and instead directly predicting them with separate heads. This approach is typically 2 to 3 times faster, and still can be used for learning equivariance. Please indeed follow this link to see the example on how to do that, if needed, and don't hesitate to ask us again if it doesn't work!
That said, training from scratch is always longer, but once the equivariance is learnt - you should have a relatively quick fine-tuning step. However, you say that the fine-tuning is slow as well, so let's talk about it.

There might be a few actual issues that lead to a poor training performance. The first (most obvious and most drastic one) - is the inconsistency in a composition model. PET-MAD uses the reference energy equal to the energies of isolated atoms (details can be found in a pre-print, if needed). By default, if the target (i.e. energy) is the same in the pre-training and fine-tuning exercises - metatrain will re-use the composition model weights from a pre-training step. Your VASP calculations most likely have different energy baseline, and this can slow-down the training a lot. I can explain why it happens, if needed. The easiest way to check this - is to fit a linear composition model on the energy difference between PET-MAD and true VASP energies using your training data, and then sum the PET-MAD predictions with this linear model within doing just a raw evaluation on a validation set by hands. If you see a substantial increase in accuracy - then it's most likely the composition model, that prevents you from training smoothly. Please try to do this exercise and let me know if you have increased accuracy. Then I can give you recommendations on how to fix this during training.

Next, I'd recommend you to try switching to the latest metatrain version in main, since we have just updated our LR scheduler. We now use something called "CosineDecay", which we wound very efficient (even more efficient that the mentioned ReduceLROnPlateau used in the gracemaker). In main, it is activated by default, and you only need to provide the num_epochs and learning_rate now. In this case, for the first 1% epochs you will have a LR warmup (i.e. the linear increase of the LR from 0.0 to the learning_rate value), and then LR will start decreasing from the learning_rate value to zero within the provided number of epochs, resembling the cosine function at [0, pi/2] interval. In your case, I'd recommend setting num_epochs: 1000 and learning_rate: 1e-5 and therefore removing num_epochs_warmup, scheduler_patience and scheduler_factor keywords as deprecated.

I'd also set the energy weight to 10.0 (maybe even 100.0) and forces weight to 1.0, since the structures that you train on have a lot of forces data points per single energy data point, and therefore balancing the energy contribution is crucial.

I hope these recommendations will help. Please let us know how it goes!

lucasdekam Oct 7, 2025
Author

Thank you both for sharing your insights. It makes sense that it takes time for the model to learn equivariance. For quick tests finetuning is then probably the way to go.

@abmazitov I also thought about the reference energy being a problem, thanks for bringing that up. The training set I was using has the same composition for all structures, so here I just consider the average difference between PET-MAD prediction and true VASP energy. It seems that PET-MAD consistently predicts an energy that is about 47 eV lower than my VASP energies, with a pretty small standard deviation of about 1 eV. So adding a constant offset would indeed improve accuracy.

How do you usually go about finetuning on a dataset with different reference energies? I did calculate the VASP single-atom energies, but the sum of single-atom energies for each frame is about -400 eV, so that's quite different from the offset I calculated before. I suppose I can also try just finetuning the model heads?

The reason I was thinking of training from scratch is to get a small model that evaluates fast. But I'll try first to run the finetuned PET-MAD with LAMMPS, it might be plenty fast already. If I do train from scratch again, I'll try the pretraining with non-conservative forces (thanks for sharing the cookbook recipe), and using CosineDecay with your suggested num_epochs, learning rate and energy/force weights.

Looking forward to the architecture developments (should I watch the NanoPET page for this?)

abmazitov Oct 9, 2025
Collaborator

So, the most user-friendly way of doing this is creating a new target, associated with your dataset. This can be done by replacing the energy target in the options.yaml with a new desired keyword with the mtt:: prefix, for example mtt::energy-vasp, or mtt::energy-rpbe

targets:
    mtt::energy-rpbe:
      key: 'DFT_energy'  # name of the target value
      unit: eV  # unit of the target value
      forces:
        key: 'DFT_forces'

In this case, the trainer will fit the new composition model and the new heads, while leaving the featurizer the same. This approach should work fine, but has a little disadvantage: by default, if you export the model to use it further with ASE or LAMMPS, it expects the energy target to be the primary one, and not the new mtt::energy-rpbe. Thus, by default, the exported model will still use the old PET-MAD energies. We are currently working on enabling the target selection on runtime, but for now I can recommend doing the following trick to swap the final target labels:

import torch
import metatomic.torch

checkpoint = torch.load("your_path_to_checkpoin/model.ckpt", map_location="cpu", weights_only=False)

def set_output_head(checkpoint, head_name):
    """
    Selects the head of the model that corresponds to the given head_name
    and assigns it to `energy` output
    """
    for state_dict_name in ['model_state_dict', 'best_model_state_dict']:
        state_dict = checkpoint.get(state_dict_name)
        if state_dict is not None:
            new_state_dict = {}
            for key, value in state_dict.items(): 
                if ".energy." in key:
                    continue
                if "scaler.scales" in key:
                    value = value[:1]
                if head_name in key:
                    new_key = key.replace(head_name, "energy")
                else:
                    new_key = key
                new_state_dict[new_key] = value
            checkpoint[state_dict_name] = new_state_dict
    dataset_info = checkpoint['model_data']['dataset_info']
    if dataset_info is not None:
        new_target = dataset_info.targets.pop(head_name)
        if new_target is not None:
            dataset_info.targets['energy'] = new_target
            checkpoint['model_data']['dataset_info'] = dataset_info
    return checkpoint
            
checkpoint = set_output_head(checkpoint, "mtt::energy-rpbe")
torch.save(checkpoint, "new_checkpoint.ckpt")

Then you can run mtt export new_checkpoint.ckpt to get a torch-scripted model, which you can later use in LAMMPS or ASE. Please let me know if this works for you!

lucasdekam Nov 5, 2025
Author

Thanks @abmazitov , I tried it out and the error decreases a lot more rapidly/systematically. Still ~400 epochs of my training set takes ~12 hours, where for equivariant MACE for example this took ~4 hours. In principle both models compute forces as a gradient, so where does the difference in training time come from? Maybe I'll play around with the nonconservative training for a smaller model as well.

Apologies for the late response by the way, had some other priorities.

abmazitov Nov 5, 2025
Collaborator

Thanks for sharing your experience again @lucasdekam! Just let me clarify - what you are saying is that a single forward + backward pass done with PET is 3 times slower compared to MACE (considering that you are using the same batch size and the same number of GPUs for training)? If that is the case, would it be possible for you to share a minimalistic code example comparing the inference time of both models on your system, and also specify what kind of MACE model you are using (i.e. the cutoff and other hyper parameters)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PET training tips #762

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PET training tips #762

Uh oh!

Uh oh!

lucasdekam Oct 6, 2025

Replies: 1 comment · 5 replies

Uh oh!

ceriottm Oct 6, 2025 Maintainer

Uh oh!

Uh oh!

abmazitov Oct 6, 2025 Collaborator

Uh oh!

lucasdekam Oct 7, 2025 Author

Uh oh!

abmazitov Oct 9, 2025 Collaborator

Uh oh!

lucasdekam Nov 5, 2025 Author

Uh oh!

abmazitov Nov 5, 2025 Collaborator

lucasdekam
Oct 6, 2025

Replies: 1 comment 5 replies

ceriottm
Oct 6, 2025
Maintainer

abmazitov Oct 6, 2025
Collaborator

lucasdekam Oct 7, 2025
Author

abmazitov Oct 9, 2025
Collaborator

lucasdekam Nov 5, 2025
Author

abmazitov Nov 5, 2025
Collaborator