Skip to content

Conversation

jialiangqu
Copy link
Member

A/B test was showing empty results for some metrics like gbps that return tuples instead of simple values. Now we only compare compatible metric types (floats and objects with p50 attributes) and skip the rest. Also cleaned up some duplicate code while we were at it.

see how we can get the tbps metric comparison after the change

before

$ python run.py --op flex_attention --side-a="" --side-b="--max-autotune"


======================================================================
A/B Test Results: flex_attention
======================================================================
Configuration Differences:
  max-autotune   : default         → True

Test Scope: 8 input shapes, 3 backends
Metrics: latency, tflops, tbps

----------------------------------------------------------------------
Performance Summary
----------------------------------------------------------------------

compiled:
  latency     : +4628.4% avg [-9.2% to +32086.7%]
  tflops      : -58.7% avg [-99.7% to +10.1%]

eager:
  latency     : +18.3% avg [-25.3% to +94.3%]
  tflops      :  -6.5% avg [-48.5% to +33.9%]

sdpa_cudnn:
  latency     : +61.6% avg [-0.2% to +402.3%]
  tflops      : -18.7% avg [-80.1% to +0.2%]

----------------------------------------------------------------------
Detailed Comparison
----------------------------------------------------------------------

Metric: latency
Backend        (B, Hq, M, Hkv, N, D) | Mask TypeConfig A    Config B    Difference  
-----------------------------------------------------------------------
compiled           (8, 16, 128, 16, 128, 128) |            noop0.042       13.524      +32086.7%
                   (8, 16, 256, 16, 256, 128) |            noop0.367       6.677       +1720.1%
                   (8, 16, 512, 16, 512, 128) |            noop0.151       3.218       +2030.4%
                 (8, 16, 1024, 16, 1024, 128) |            noop0.467       4.900       +948.9%
                 (8, 16, 2048, 16, 2048, 128) |            noop1.618       5.009       +209.6%
                 (8, 16, 4096, 16, 4096, 128) |            noop10.211      13.731      +34.5%
                 (8, 16, 8192, 16, 8192, 128) |            noop48.173      51.291       +6.5%
               (8, 16, 16384, 16, 16384, 128) |            noop197.567     179.391      -9.2%

eager              (8, 16, 128, 16, 128, 128) |            noop12.882      17.358      +34.7%
                   (8, 16, 256, 16, 256, 128) |            noop19.619      38.112      +94.3%
                   (8, 16, 512, 16, 512, 128) |            noop18.895      20.836      +10.3%
                 (8, 16, 1024, 16, 1024, 128) |            noop26.071      20.973      -19.6%
                 (8, 16, 2048, 16, 2048, 128) |            noop67.131      50.127      -25.3%
                 (8, 16, 4096, 16, 4096, 128) |            noop229.971     265.439     +15.4%

sdpa_cudnn         (8, 16, 128, 16, 128, 128) |            noop0.029       0.146       +402.3%
                   (8, 16, 256, 16, 256, 128) |            noop0.052       0.054        +2.3%
                   (8, 16, 512, 16, 512, 128) |            noop0.120       0.120        +0.4%
                 (8, 16, 1024, 16, 1024, 128) |            noop0.389       0.388        -0.2%
                 (8, 16, 2048, 16, 2048, 128) |            noop1.469       1.840       +25.2%
                 (8, 16, 4096, 16, 4096, 128) |            noop8.710       12.442      +42.8%
                 (8, 16, 8192, 16, 8192, 128) |            noop39.267      46.642      +18.8%
               (8, 16, 16384, 16, 16384, 128) |            noop170.709     172.274      +0.9%


Metric: tflops
Backend        (B, Hq, M, Hkv, N, D) | Mask TypeConfig A    Config B    Difference  
-----------------------------------------------------------------------
compiled           (8, 16, 128, 16, 128, 128) |            noop25.655      0.080       -99.7%
                   (8, 16, 256, 16, 256, 128) |            noop11.753      0.646       -94.5%
                   (8, 16, 512, 16, 512, 128) |            noop114.164     5.359       -95.3%
                 (8, 16, 1024, 16, 1024, 128) |            noop147.693     14.081      -90.5%
                 (8, 16, 2048, 16, 2048, 128) |            noop170.549     55.092      -67.7%
                 (8, 16, 4096, 16, 4096, 128) |            noop108.104     80.390      -25.6%
                 (8, 16, 8192, 16, 8192, 128) |            noop91.654      86.081       -6.1%
               (8, 16, 16384, 16, 16384, 128) |            noop89.392      98.449      +10.1%

eager              (8, 16, 128, 16, 128, 128) |            noop0.083       0.062       -25.8%
                   (8, 16, 256, 16, 256, 128) |            noop0.219       0.113       -48.5%
                   (8, 16, 512, 16, 512, 128) |            noop0.909       0.825        -9.3%
                 (8, 16, 1024, 16, 1024, 128) |            noop2.636       3.277       +24.3%
                 (8, 16, 2048, 16, 2048, 128) |            noop4.095       5.484       +33.9%
                 (8, 16, 4096, 16, 4096, 128) |            noop4.781       4.142       -13.4%

sdpa_cudnn         (8, 16, 128, 16, 128, 128) |            noop37.036      7.373       -80.1%
                   (8, 16, 256, 16, 256, 128) |            noop82.090      80.226       -2.3%
                   (8, 16, 512, 16, 512, 128) |            noop143.127     142.595      -0.4%
                 (8, 16, 1024, 16, 1024, 128) |            noop176.660     177.054      +0.2%
                 (8, 16, 2048, 16, 2048, 128) |            noop187.063     149.364     -20.2%
                 (8, 16, 4096, 16, 4096, 128) |            noop126.234     88.373      -30.0%
                 (8, 16, 8192, 16, 8192, 128) |            noop112.004     94.293      -15.8%
               (8, 16, 16384, 16, 16384, 128) |            noop103.054     102.117      -0.9%


Metric: tbps
Backend        (B, Hq, M, Hkv, N, D) | Mask TypeConfig A    Config B    Difference  
-----------------------------------------------------------------------

after

======================================================================
A/B Test Results: flex_attention
======================================================================
Configuration Differences:
  max-autotune   : default         → True

Test Scope: 8 input shapes, 3 backends
Metrics: latency, tflops, tbps

----------------------------------------------------------------------
Performance Summary
----------------------------------------------------------------------

compiled:
  latency     : +676.5% avg [-0.0% to +1697.3%]
  tflops      : -50.8% avg [-94.4% to +0.0%]
  tbps        : -50.8% avg [-94.4% to +0.0%]

eager:
  latency     :  -0.2% avg [-1.6% to +0.7%]
  tflops      :  +0.2% avg [-0.7% to +1.6%]
  tbps        :  +0.2% avg [-0.7% to +1.6%]

sdpa_cudnn:
  latency     :  +0.5% avg [-3.2% to +3.7%]
  tflops      :  -0.5% avg [-3.6% to +3.3%]
  tbps        :  -0.5% avg [-3.6% to +3.3%]

----------------------------------------------------------------------
Detailed Comparison
----------------------------------------------------------------------

Metric: latency
Backend        (B, Hq, M, Hkv, N, D) | Mask TypeConfig A    Config B    Difference  
-----------------------------------------------------------------------
compiled           (8, 16, 128, 16, 128, 128) |            noop0.153       2.674       +1652.3%
                   (8, 16, 256, 16, 256, 128) |            noop0.150       2.687       +1697.3%
                   (8, 16, 512, 16, 512, 128) |            noop0.166       2.675       +1512.3%
                 (8, 16, 1024, 16, 1024, 128) |            noop0.464       2.691       +480.1%
                 (8, 16, 2048, 16, 2048, 128) |            noop1.590       2.699       +69.7%
                 (8, 16, 4096, 16, 4096, 128) |            noop6.082       6.079        -0.0%
                 (8, 16, 8192, 16, 8192, 128) |            noop23.893      23.922       +0.1%
               (8, 16, 16384, 16, 16384, 128) |            noop94.829      94.844       +0.0%

eager              (8, 16, 128, 16, 128, 128) |            noop7.632       7.585        -0.6%
                   (8, 16, 256, 16, 256, 128) |            noop7.565       7.613        +0.6%
                   (8, 16, 512, 16, 512, 128) |            noop7.559       7.613        +0.7%
                 (8, 16, 1024, 16, 1024, 128) |            noop9.188       9.043        -1.6%
                 (8, 16, 2048, 16, 2048, 128) |            noop34.731      34.755       +0.1%

sdpa_cudnn         (8, 16, 128, 16, 128, 128) |            noop0.032       0.031        -3.2%
                   (8, 16, 256, 16, 256, 128) |            noop0.055       0.057        +3.7%
                   (8, 16, 512, 16, 512, 128) |            noop0.119       0.119        +0.0%
                 (8, 16, 1024, 16, 1024, 128) |            noop0.377       0.376        -0.3%
                 (8, 16, 2048, 16, 2048, 128) |            noop1.419       1.435        +1.1%
                 (8, 16, 4096, 16, 4096, 128) |            noop5.564       5.667        +1.8%
                 (8, 16, 8192, 16, 8192, 128) |            noop21.943      22.170       +1.0%
               (8, 16, 16384, 16, 16384, 128) |            noop88.103      88.137       +0.0%


Metric: tflops
Backend        (B, Hq, M, Hkv, N, D) | Mask TypeConfig A    Config B    Difference  
-----------------------------------------------------------------------
compiled           (8, 16, 128, 16, 128, 128) |            noop7.065       0.403       -94.3%
                   (8, 16, 256, 16, 256, 128) |            noop28.840      1.605       -94.4%
                   (8, 16, 512, 16, 512, 128) |            noop103.968     6.448       -93.8%
                 (8, 16, 1024, 16, 1024, 128) |            noop148.722     25.636      -82.8%
                 (8, 16, 2048, 16, 2048, 128) |            noop173.525     102.232     -41.1%
                 (8, 16, 4096, 16, 4096, 128) |            noop181.501     181.562      +0.0%
                 (8, 16, 8192, 16, 8192, 128) |            noop184.792     184.570      -0.1%
               (8, 16, 16384, 16, 16384, 128) |            noop186.240     186.210      -0.0%

eager              (8, 16, 128, 16, 128, 128) |            noop0.141       0.142        +0.6%
                   (8, 16, 256, 16, 256, 128) |            noop0.568       0.564        -0.6%
                   (8, 16, 512, 16, 512, 128) |            noop2.273       2.257        -0.7%
                 (8, 16, 1024, 16, 1024, 128) |            noop7.479       7.599        +1.6%
                 (8, 16, 2048, 16, 2048, 128) |            noop7.914       7.909        -0.1%

sdpa_cudnn         (8, 16, 128, 16, 128, 128) |            noop33.825      34.953       +3.3%
                   (8, 16, 256, 16, 256, 128) |            noop77.672      74.898       -3.6%
                   (8, 16, 512, 16, 512, 128) |            noop144.631     144.631      +0.0%
                 (8, 16, 1024, 16, 1024, 128) |            noop182.361     182.858      +0.3%
                 (8, 16, 2048, 16, 2048, 128) |            noop193.676     191.603      -1.1%
                 (8, 16, 4096, 16, 4096, 128) |            noop197.597     194.026      -1.8%
                 (8, 16, 8192, 16, 8192, 128) |            noop200.428     198.382      -1.0%
               (8, 16, 16384, 16, 16384, 128) |            noop199.678     199.601      -0.0%


Metric: tbps
Backend        (B, Hq, M, Hkv, N, D) | Mask TypeConfig A    Config B    Difference  
-----------------------------------------------------------------------
compiled           (8, 16, 128, 16, 128, 128) |            noop0.110       0.006       -94.3%
                   (8, 16, 256, 16, 256, 128) |            noop0.224       0.012       -94.4%
                   (8, 16, 512, 16, 512, 128) |            noop0.405       0.025       -93.8%
                 (8, 16, 1024, 16, 1024, 128) |            noop0.289       0.050       -82.8%
                 (8, 16, 2048, 16, 2048, 128) |            noop0.169       0.099       -41.1%
                 (8, 16, 4096, 16, 4096, 128) |            noop0.088       0.088        +0.0%
                 (8, 16, 8192, 16, 8192, 128) |            noop0.045       0.045        -0.1%
               (8, 16, 16384, 16, 16384, 128) |            noop0.023       0.023        -0.0%

eager              (8, 16, 128, 16, 128, 128) |            noop0.002       0.002        +0.6%
                   (8, 16, 256, 16, 256, 128) |            noop0.004       0.004        -0.6%
                   (8, 16, 512, 16, 512, 128) |            noop0.009       0.009        -0.7%
                 (8, 16, 1024, 16, 1024, 128) |            noop0.015       0.015        +1.6%
                 (8, 16, 2048, 16, 2048, 128) |            noop0.008       0.008        -0.1%

sdpa_cudnn         (8, 16, 128, 16, 128, 128) |            noop0.529       0.546        +3.3%
                   (8, 16, 256, 16, 256, 128) |            noop0.607       0.585        -3.6%
                   (8, 16, 512, 16, 512, 128) |            noop0.565       0.565        +0.0%
                 (8, 16, 1024, 16, 1024, 128) |            noop0.356       0.357        +0.3%
                 (8, 16, 2048, 16, 2048, 128) |            noop0.189       0.187        -1.1%
                 (8, 16, 4096, 16, 4096, 128) |            noop0.096       0.095        -1.8%
                 (8, 16, 8192, 16, 8192, 128) |            noop0.049       0.048        -1.0%
               (8, 16, 16384, 16, 16384, 128) |            noop0.024       0.024        -0.0%

A/B test was showing empty results for some metrics like gbps that return tuples instead of simple values. Now we only compare compatible metric types (floats and objects with p50 attributes) and skip the rest. Also cleaned up some duplicate code while we were at it.
@jialiangqu jialiangqu force-pushed the fix-ab-test-complex-metrics branch from c9910a8 to a4c99bd Compare September 2, 2025 23:59
@jialiangqu jialiangqu marked this pull request as draft September 3, 2025 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant