Why this is cooler than Grid

Shortcomings of Grid

Grid is using a "simple" approach to Expression Template, in which a syntactic tree is constructed. Each node performs statically a concrete operation on its subnodes when executed. This "naive" approach reduces tremendously the possibility to smartly treat complicated expressions, and can lead even to very poorly-performing code.

Example

As a simple example, consider the N-size matrix product a*b*c. With a dumb expression template tree this evaluate to 4 nested loops, whereas a plain c++ code overloading of the product operator, using a temporary storage for b*c this can be computed with 2 un-nested set of 3 loops.

Smart expression template

I adopted a "smart" expression template approach (...the name is not mine, it has been forged by a group of researcher of Nuremberg university, see arXiv:1104.1729 where the previous example is illustrated with great details).

This means that during the building of the tree, you are aware of the meaning and content of subnodes, which allows to shuffle/cut/optimize/execute-to-temporary the branch of the tree, on the basis of the nature of the nodes and to a cost model of the various options.

Other shortcoming of Grid

In Grid the fields must all carry the same tensor components, so that you have the funny situation that you have to define a gauge configuration as a type of lattice fields carrying "spin" indices, of size 1.
The components must be ordered in the same way, which means something like spacetime,spin,spin,color,color,lorentz,simd (I go by memory). I think this limits substantially the possibility to write generic code, and forces a specific memory layout across the whole code, which is quite limiting. Matrix-matrix and matrix-vector operations are explicitly unrolled for the specific case.
The innermost component is statically assumed to span a set of "virtual nodes", which on one hand ensure automatic SIMD-vectorization of every expression, on the other hand gives a very specific way to use SIMD operations, restricting the partitioning of lattices (which is certainly a problem when you are developing a multigrid solver) and neglecting the possibility to run vector operations on things other than "virtual nodes" index.
Thread loops are started and stopped in the most trivial way you could imagine, and the loop runs always only on lattice sites, without any thread pool or thread collapse feature

The point of the authors

I had several discussion with the main authors of Grid, who seemingly do not see any shortcoming on this, but I think this design, though innovative under many aspects, is too naive and limits the reach of the framework. It is a fact that to achieve ultimate optimization, the authors have a hand-written the critical section (from which the sentence: "But then even my carpenter would have been able to do that!!!").

SUNphi approach

The approach I wanted to pursue is:

To enable a node to expose a set of tensor components, adding/removing/shuffling/changing the nature of the nested nodes. No assumption is made on the ordering of the component, and to large extent on the presence of components other than that of interest for the node itself. This allows to have really generic code.
I implemented different kind of nodes, some of them not-performing operations, but changing the nature of the nested nodes, such as "transposer" node, which gives access to the transpose of the expression, or "binder" nodes, which set the value of a specific component, say the spin, and gives a subview on all other components of the expression.
free indices can be fused, thread loop started on the most convenient compontent
vectorization (will) occur if the size of the innermost component (possibly after splitting or merging of the component) is compatible with the SIMD size
I have implemented specialized builder which can detect specific subnode patterns, e.g. transpose(transpose(x)) instead of returning a Transposer node nesting a Transposer node holding x, simply returns x

Other shortcoming

Serialization in Grid relies on cpp magic, which is not recommended for any production code from the author themselves, and is one of the source of the long compilation times
HMC and measurement package uses two very similar API, incompatible among themselves
HMC part is very difficult to be modified/understood even having its main author sitting aside you for a whole week

Provide feedback

Saved searches

Use saved searches to filter your results more quickly