Understood: apparent suboptimal SIMD scaling of FPTYPE=m color sum

While doing more detailed profiling of color sums in very complex processes such as 2->6 and 2->7, I observed that FPTYPE=m color sums clearly do not scale as single-precision SIMD. This is the table of results for gg->ttggggg in my upcoming https://arxiv.org/abs/2510.05392 v2: I would expect a x16 scaleup for color sums, but only get around x8 (Feynman diagrams have different issues for such complex processes).

<img width="852" height="276" alt="Image" src="https://github.com/user-attachments/assets/794e5edc-813e-4c33-adcb-5fb86084cc69" />

Looking at the code, I suspect the following in my implementation at the time:
```
    fptype_sv deltaMEs = { 0 };
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
    fptype_sv deltaMEs_next = { 0 };
...
#endif

...
    // Loop over icol
    for( int icol = 0; icol < ncolor; icol++ )
    {
      // Diagonal terms
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
      fptype2_sv& jampRi_sv = jampR_sv[icol];
      fptype2_sv& jampIi_sv = jampI_sv[icol];
#else
      fptype2_sv jampRi_sv = (fptype2_sv)( cxreal( jamp_sv[icol] ) );
      fptype2_sv jampIi_sv = (fptype2_sv)( cximag( jamp_sv[icol] ) );
#endif
      fptype2_sv ztempR_sv = cf2.value[icol][icol] * jampRi_sv;
      fptype2_sv ztempI_sv = cf2.value[icol][icol] * jampIi_sv;
      // Loop over jcol
      for( int jcol = icol + 1; jcol < ncolor; jcol++ )
      {
        // Off-diagonal terms
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
        fptype2_sv& jampRj_sv = jampR_sv[jcol];
        fptype2_sv& jampIj_sv = jampI_sv[jcol];
#else
        fptype2_sv jampRj_sv = (fptype2_sv)( cxreal( jamp_sv[jcol] ) );
        fptype2_sv jampIj_sv = (fptype2_sv)( cximag( jamp_sv[jcol] ) );
#endif
        ztempR_sv += cf2.value[icol][jcol] * jampRj_sv;
        ztempI_sv += cf2.value[icol][jcol] * jampIj_sv;
      }
      fptype2_sv deltaMEs2 = ( jampRi_sv * ztempR_sv + jampIi_sv * ztempI_sv ); // may underflow #831
#if defined MGONGPU_CPPSIMD and defined MGONGPU_FPTYPE_DOUBLE and defined MGONGPU_FPTYPE2_FLOAT
      deltaMEs += fpvsplit0( deltaMEs2 );
      deltaMEs_next += fpvsplit1( deltaMEs2 );
#else
      deltaMEs += deltaMEs2;
#endif
    }
```

While the loop on jcol is probably all in single precision, the accumulation to deltaME within the loop on icol is probably done in double precision, at least in the
```
      deltaMEs += fpvsplit0( deltaMEs2 );
      deltaMEs_next += fpvsplit1( deltaMEs2 );
```

I imagine that my rationale at the time was that accumulation should be in double precision, but I guess that it should be enough to do the full computation in single precision and assign it back to a double precision ME result at the end.

Probably it would be enough to use `fptype2_sv deltaMEs, deltaMEs_next` instead of  `fptype_sv deltaMEs, deltaMEs_next`. This requires some minor reshuffling, I will take a look at some point when I have time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understood: apparent suboptimal SIMD scaling of FPTYPE=m color sum #1072

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understood: apparent suboptimal SIMD scaling of FPTYPE=m color sum #1072

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions