-
Notifications
You must be signed in to change notification settings - Fork 1
Why this is cooler than Grid
Grid is using a "simple" approach to Expression Template, in which a syntactic tree is constructed. Each node performs statically a concrete operation on its subnodes when executed. This "naive" approach reduces tremendously the possibility to smartly treat complicated expressions, and can lead even to very poorly-performing code.
As a simple example, consider the N-size matrix product a*b*c. With a dumb expression template tree this evaluate to 4 nested loops, whereas a plain c++ code overloading of the product operator, using a temporary storage for b*c this can be computed with 2 un-nested set of 3 loops.
I adopted a "smart" expression template approach (...the name is not mine, it has been forged by a group of researcher of Nuremberg university, see arXiv:1104.1729 where the previous example is illustrated with great details).
This means that during the building of the tree, you are aware of the meaning and content of subnodes, which allows to shuffle/cut/optimize/execute-to-temporary the branch of the tree, on the basis of the nature of the nodes and to a cost model of the various options.
-
In Grid the fields must all carry the same tensor components, so that you have the funny situation that you have to define a gauge configuration as a type of lattice fields carrying "spin" indices, of size 1.
-
The components must be ordered in the same way, which means something like
spacetime,spin,spin,color,color,lorentz,simd(I go by memory). I think this limits substantially the possibility to write generic code, and forces a specific memory layout across the whole code, which is quite limiting. Matrix-matrix and matrix-vector operations are explicitly unrolled for the specific case. -
The innermost component is statically assumed to span a set of "virtual nodes", which on one hand ensure automatic SIMD-vectorization of every expression, on the other hand gives a very specific way to use SIMD operations, restricting the partitioning of lattices (which is certainly a problem when you are developing a multigrid solver) and neglecting the possibility to run vector operations on things other than "virtual nodes" index.
-
Thread loops are started and stopped in the most trivial way you could imagine, and the loop runs always only on lattice sites, without any thread pool or thread collapse feature
I had several discussion with the main authors of Grid, who seemingly do not see any shortcoming on this, but I think this design, though innovative under many aspects, is too naive and limits the reach of the framework. It is a fact that to achieve ultimate optimization, the authors have a hand-written the critical section (from which the sentence: "But then even my carpenter would have been able to do that!!!").
The approach I wanted to pursue is:
-
To enable a node to expose a set of tensor components, adding/removing/shuffling/changing the nature of the nested nodes. No assumption is made on the ordering of the component, and to large extent on the presence of components other than that of interest for the node itself. This allows to have really generic code.
-
I implemented different kind of nodes, some of them not-performing operations, but changing the nature of the nested nodes, such as "transposer" node, which gives access to the transpose of the expression, or "binder" nodes, which set the value of a specific component, say the spin, and gives a subview on all other components of the expression.
-
free indices can be fused, thread loop started on the most convenient compontent
-
vectorization (will) occur if the size of the innermost component (possibly after splitting or merging of the component) is compatible with the SIMD size
-
I have implemented specialized builder which can detect specific subnode patterns, e.g. transpose(transpose(x)) instead of returning a Transposer node nesting a Transposer node holding x, simply returns x
-
Serialization in Grid relies on cpp magic, which is not recommended for any production code from the author themselves, and is one of the source of the long compilation times
-
HMC and measurement package uses two very similar API, incompatible among themselves
-
HMC part is very difficult to be modified/understood even having its main author sitting aside you for a whole week