|
| 1 | +# A/B Testing Documentation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The A/B testing feature allows you to compare two different configurations in a single run, helping you quickly evaluate the performance impact of different parameter settings. |
| 6 | + |
| 7 | +## Basic Usage |
| 8 | + |
| 9 | +### Command Format |
| 10 | +```bash |
| 11 | +python run.py --op <operator> --side-a="<configuration A>" --side-b="<configuration B>" |
| 12 | +``` |
| 13 | + |
| 14 | +### Parameters |
| 15 | +- `--op`: Name of the operator to test (single operator only) |
| 16 | +- `--side-a`: Parameter string for configuration A |
| 17 | +- `--side-b`: Parameter string for configuration B |
| 18 | + |
| 19 | +## Configuration Types |
| 20 | + |
| 21 | +### 1. Global Parameter Testing |
| 22 | +Global parameters are tritonbench-level settings that affect the entire benchmark behavior: |
| 23 | + |
| 24 | +```bash |
| 25 | +# Test different warmup parameters |
| 26 | +python run.py --op vector_add --side-a="--warmup 25" --side-b="--warmup 100" |
| 27 | + |
| 28 | +# Test different precision settings |
| 29 | +python run.py --op flash_attention --side-a="--precision fp16" --side-b="--precision fp32" |
| 30 | + |
| 31 | +# Test different device settings |
| 32 | +python run.py --op gemm --side-a="--device cuda" --side-b="--device cpu" |
| 33 | +``` |
| 34 | + |
| 35 | +### 2. Operator-Specific Parameter Testing |
| 36 | +Each operator has its own specific parameters: |
| 37 | + |
| 38 | +```bash |
| 39 | +# Test different head counts for flex_attention |
| 40 | +python run.py --op flex_attention --side-a="--n-heads-q 8" --side-b="--n-heads-q 16" |
| 41 | + |
| 42 | +# Test different matrix sizes for gemm |
| 43 | +python run.py --op gemm --side-a="--m 1024 --n 1024 --k 1024" --side-b="--m 2048 --n 2048 --k 2048" |
| 44 | +``` |
| 45 | + |
| 46 | +### 3. Mixed Parameter Testing |
| 47 | +You can test both global and operator-specific parameters simultaneously: |
| 48 | + |
| 49 | +```bash |
| 50 | +# Test both warmup and data type |
| 51 | +python run.py --op flash_attention --side-a="--warmup 50 --dtype fp16" --side-b="--warmup 100 --dtype bf16" |
| 52 | + |
| 53 | +# Global precision + operator-specific parameters |
| 54 | +python run.py --op vector_add --side-a="--precision fp16 --n 1000000" --side-b="--precision fp32 --n 5000000" |
| 55 | +``` |
| 56 | + |
| 57 | +## Parameter Formats |
| 58 | + |
| 59 | +### Equal Sign Format After --side Flag |
| 60 | +You must use the equal sign after the --side-a or --side-b flag: |
| 61 | +```bash |
| 62 | +python run.py --op flex_attention --side-a="--warmup 25" --side-b="--warmup 100" |
| 63 | +``` |
| 64 | + |
| 65 | +### Default Configuration |
| 66 | +If you provide an empty string `""`, it represents the default configuration: |
| 67 | +```bash |
| 68 | +# Compare custom configuration against default |
| 69 | +python run.py --op vector_add --side-a="--warmup 100 --precision fp16" --side-b="" |
| 70 | + |
| 71 | +# Compare default against custom configuration |
| 72 | +python run.py --op flash_attention --side-a="" --side-b="--dtype bf16 --batch-size 16" |
| 73 | +``` |
| 74 | + |
| 75 | +### Multiple Parameters |
| 76 | +```bash |
| 77 | +python run.py --op flash_attention --side-a="--warmup 50 --dtype fp16 --batch-size 8" --side-b="--warmup 100 --dtype bf16 --batch-size 16" |
| 78 | +``` |
| 79 | + |
| 80 | +## Output Format |
| 81 | + |
| 82 | +A/B test output consists of three sections: |
| 83 | + |
| 84 | +### 1. Configuration Analysis |
| 85 | +Shows differences between the two configurations: |
| 86 | +``` |
| 87 | +Configuration Differences: |
| 88 | + warmup : 25 → 100 |
| 89 | + precision : fp16 → fp32 |
| 90 | +``` |
| 91 | + |
| 92 | +### 2. Performance Summary |
| 93 | +Shows average performance changes for each backend and metric: |
| 94 | +``` |
| 95 | +Performance Summary |
| 96 | +---------------------------------------------------------------------- |
| 97 | +
|
| 98 | +torch_add: |
| 99 | + latency : +37.8% avg [-22.2% to +96.4%] |
| 100 | + gbps : -27.4% avg [-49.1% to +28.6%] |
| 101 | +
|
| 102 | +triton_add: |
| 103 | + latency : +41.5% avg [-12.5% to +96.9%] |
| 104 | + gbps : -29.3% avg [-49.2% to +14.3%] |
| 105 | +``` |
| 106 | + |
| 107 | +### 3. Detailed Comparison |
| 108 | +Shows specific numerical comparisons for each metric across different input sizes and backends: |
| 109 | +``` |
| 110 | +Metric: latency |
| 111 | +Backend x_val Config A Config B Difference |
| 112 | +----------------------------------------------------------------------- |
| 113 | +torch_add 4096 0.009 0.007 -22.2% |
| 114 | + 8192 0.007 0.007 +0.0% |
| 115 | + 16384 0.008 0.007 -12.5% |
| 116 | +... |
| 117 | +``` |
| 118 | + |
| 119 | +## Error Handling |
| 120 | + |
| 121 | +The system automatically handles the following error conditions: |
| 122 | +- Configuration parsing failures: Provides clear error messages |
| 123 | +- Benchmark execution failures: Shows specific error reasons |
| 124 | +- Empty results: Detects and reports empty result issues |
| 125 | +- Parameter parsing errors: Issues warnings and uses default values |
| 126 | + |
| 127 | +## Limitations |
| 128 | + |
| 129 | +1. **Single Operator Restriction**: A/B testing only supports single operators, not multi-operator comparisons |
| 130 | +2. **Common Inputs**: Both configurations must have overlapping input sizes for comparison |
| 131 | +3. **Common Backends**: Only backends that exist in both configurations will be compared |
| 132 | +4. **Sequential Execution**: Still investigating how and how much running A/B sequentially will affect B's performance |
| 133 | + |
| 134 | +## Troubleshooting |
| 135 | + |
| 136 | +### Configuration Parsing Failures |
| 137 | +Ensure parameter string format is correct, especially proper use of quotes: |
| 138 | +```bash |
| 139 | +# Correct |
| 140 | +python run.py --op vector_add --side-a="--warmup 25" --side-b="--warmup 100" |
| 141 | + |
| 142 | +# Wrong: missing quotes |
| 143 | +python run.py --op vector_add --side-a=--warmup 25 --side-b=--warmup 100 |
| 144 | +``` |
| 145 | + |
| 146 | +### No Common Input Sizes or Backends |
| 147 | +Check that both configurations can run successfully and produce comparable results. |
0 commit comments