mlff_select

This program enables fast and fully controlled selections of basis sets for VASP machine-learning force fields. Further, training sets of several on-the-fly learning can be combined (a more detailed description with example applications will soon follow as part of a new paper currently in preparation).

It takes one or several ML_AB files and writes a new file ML_AB_sel, which can be used for subsequent generations of ML-FFs (see here).

The objective of the program is to select a diverse as possible subset of atoms in the given configurations as basis functions. In order to accelerate the process, three different selections with increasing complexity are done:

The atoms are sorted into bins according to the number of neighbors in their environments within the cutoff. All atoms with 20 neighbors are sorted into one bin, all atoms with 21 atoms are sorted into the next bin, and so forth. With this, e.g., atoms at the surface of a slab or in different regions of a gas phase are separated from those of bulk systems.
Each neighborhood number bin is subdivided into bins based on the neighborhood diversity. This diversity is calculated by multiplying the number of atoms of each element within the environment. If, e.g., 16 Ga atoms, 4 Pt atoms and 3 H atoms are within an environment of an atom, its neighborhood diversity would be 1643=192. Given by the keyword -neigh_classes, the number of these bins is defined and the atoms are sorted into them with rising diversity. The chosen total number of basis functions per element is now divided and allocated to the different resulting bins.
Each neighborhood-diversity bin is analyzed by hierarchial cluster analysis of radial and angular distribution functions. The overlap integral matrix of both functions is obtained by calculating the integrals for all pairs of atoms within the bin. Depending on the total number of basis functions allocated to this bin, one atom of each cluster in the respective hierarchy layer of the clustering (where the total number of clusters is equal to the number of available basis functions) is chosen for the final basis set.
Additionally, a predefined fraction of atoms is chosen from gradient norm outliers and from rare neighborhoods.

The program offers a number of keywords, which need to be given as command line arguments, e.g.:

mlff_select -keyword1 -keyword2 -keyword3

If a overview of all important keywords and the general functionality of the program is needed, type:

mlff_select -help

The following keywords (either required or optional) are available:

-ml_ab=[file1],[file2],... The list of ML_AB files that shall be analyzed and processed by the program. Up to 20 different files can be used. Example: -ml_ab=ML_AB_liquid,ML_AB_solid1,ML_AB_solid2
-nbasis=[number] The desired total number of basis functions to be selected by the program. In this option, the same number is used for all involved elements. In principle, arbitrary numbers can be chosen. There are, however, restraints. The number shall not be larger than the total number of different atoms of the element with the smallest number of atoms (e.g., 1000 configurations with 3 Pt atoms per configuration, then, nbasis needs to be smaller than 3000). Further, huge numbers might still lead to memory problems with the VASP Refit calculation. Numbers larger than 10.000 are thus currently not recommended.
nbasis_el=[el1:number1],[el2:number2],... Number of basis functions, resolved by elements. This option should be used if one element is much less abundant than the others. If, e.g., a system with 200 Ga and 5 Pt atoms has been trained, a useful choice might be nbasis_el=Ga:8000,Pt:4000. In general, however, the number for scarce elements should not be too low since the amount of needed total memory is determined by the element with the largest number of basis functions.
cutoff=[value] The radial (and angular) cutoff (in Angstroms) to be used in the subsequent refit calculation. This keyword is analogous to the ML_RCUT1 and ML_RCUT2 keywords of VASP (see here). Currently, radial and angular cutoffs cannot be distinguished. With this, useful basis function selections can be provided for refits with different cutoffs.
grad_frac=[value] (optional) The fraction of basis functions to be allocated for atoms with the largest gradient components. The mlff_select program calculates the gradient norm of each atom in each configuration from the gradient data provided within the ML_AB file(s). From the global list of atoms with their gradients, the N atoms with the largest gradients are taken as basis functions, where N is the total number of desired basis functions times this fraction for each element, respectively (if nbasis_el is given). The idea is that atoms with large gradients might be part of extreme situations such as almost merging atoms or unwanted dissociations, such that a disproportionately high fraction of these outliers is included into the training set, increasing the overall stability of the force field (extrapolation problems become less likely). Default: 0.1
train_div=[value] (optional) Increasing the diversity of the included basis set. All atoms within the given ML_AB file(s) are sorted by their number of neighbors. The fraction of the total basis functions given by this keyword is chosen by uniformly picking atoms of all neighbor-number bins, independent of the total number of atoms in each bin. If this value is raised, more atoms from less populated bins are taken, thus increasing the number if outliers in the training set. Default: 0.1
rdf2adf=[value] (optional) Relative weight of radial distribution function (RDF) and angular distribution function (ADF) weights in the final clustering of atoms within each neighborhood-diversity bin. Example: rdf2adf=2.0: RDF will have 67% weight, ADF will have 33% weight. This effect of this keyword is so far not well tested, therefore, the default (equal weight of both) should be a good choice. Default: 1.0
neigh_classes=[number]` (optional) Number of diversity-based neighborhood classes per total number of neighbor atoms within the cutoff. This keyword has a huge impact on the performance of the program! It should be set rather to large values in order to accelerate the calculation (the final hierarchial clustering is done with smaller subclasses). Too large values, however, will lead to stability problems of the program, since there might be less than one basis function allocated per neighborhood subclass. Reasonable values are 50-150, depending on the total number of atoms in the ML_AB files. Try smaller values first and check if the calculation is still too slow. (Note: this option will be changed/removed in the near future, a more elegant way of choosing the number of neighborhood subclasses will be published soon) Default:50
bas_scale=[word] (optional) word can be linear or root. Determines the algorithm by which the number of basis functions for each neighborhood subclass is chosen. If a certain number of neighbors is by far the most probable (more atoms within the given ML_AB file(s) fall into this category), more basis functions will be chosen for this number. If this number shall be determined by a linear function (prefactor, depending on the total number of available basis functions and the total number of atoms in the ML_AB file(s), times the atom number in the subclass), choose linear. If a square root function shall be used (prefactor times square root of the current atom number), choose root. The root option will again give more weight to outliers (or less abundant neighborhoods), since less atoms will be allocated to more abundant neighborhoods. Default: linear
max_environ=[number] (optional) The maximum number of atoms within the cutoff region of each atom, needed for initial array initializations. If larger cutoffs shall be selected by the cutoff keyword, this number should be raised as well, which will increase the memory requirements. Default: 100
s_grid=[number] (optional) The number of grid points for the numerical representation of radial and angular distribution functions within the final hierarchial clustering algorithm for neighborhood subclasses. A larger value will lead to a better representation of the functions, but also lead to much more computational effort (calculation of numerical overlap integrals). The default value should be OK for most purposes. Default: 50
rdf_exp=[value] (optional) The width of the Gaussians for the smoothed representation of the environments in the radial distribution function calculation. The default values should be OK for most purposes. Default: 20.0
adf_exp=[value] (optional) The width of the Gaussians for the smoothed representation of the environments in the angular distribution function calculation. The default values should be OK for most purposes. Default: 0.5

mlff_select

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

General Tutorial

Included Scripts/Programs

Setup

Evaluation

ML-FF

Clone this wiki locally