Skip to content

Commit cc3dfc3

Browse files
committed
Add A/B testing documentation
1 parent 5dcd266 commit cc3dfc3

File tree

1 file changed

+147
-0
lines changed

1 file changed

+147
-0
lines changed

docs/ab_testing.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# A/B Testing Documentation
2+
3+
## Overview
4+
5+
The A/B testing feature allows you to compare two different configurations in a single run, helping you quickly evaluate the performance impact of different parameter settings.
6+
7+
## Basic Usage
8+
9+
### Command Format
10+
```bash
11+
python run.py --op <operator> --side-a="<configuration A>" --side-b="<configuration B>"
12+
```
13+
14+
### Parameters
15+
- `--op`: Name of the operator to test (single operator only)
16+
- `--side-a`: Parameter string for configuration A
17+
- `--side-b`: Parameter string for configuration B
18+
19+
## Configuration Types
20+
21+
### 1. Global Parameter Testing
22+
Global parameters are tritonbench-level settings that affect the entire benchmark behavior:
23+
24+
```bash
25+
# Test different warmup parameters
26+
python run.py --op vector_add --side-a="--warmup 25" --side-b="--warmup 100"
27+
28+
# Test different precision settings
29+
python run.py --op flash_attention --side-a="--precision fp16" --side-b="--precision fp32"
30+
31+
# Test different device settings
32+
python run.py --op gemm --side-a="--device cuda" --side-b="--device cpu"
33+
```
34+
35+
### 2. Operator-Specific Parameter Testing
36+
Each operator has its own specific parameters:
37+
38+
```bash
39+
# Test different head counts for flex_attention
40+
python run.py --op flex_attention --side-a="--n-heads-q 8" --side-b="--n-heads-q 16"
41+
42+
# Test different matrix sizes for gemm
43+
python run.py --op gemm --side-a="--m 1024 --n 1024 --k 1024" --side-b="--m 2048 --n 2048 --k 2048"
44+
```
45+
46+
### 3. Mixed Parameter Testing
47+
You can test both global and operator-specific parameters simultaneously:
48+
49+
```bash
50+
# Test both warmup and data type
51+
python run.py --op flash_attention --side-a="--warmup 50 --dtype fp16" --side-b="--warmup 100 --dtype bf16"
52+
53+
# Global precision + operator-specific parameters
54+
python run.py --op vector_add --side-a="--precision fp16 --n 1000000" --side-b="--precision fp32 --n 5000000"
55+
```
56+
57+
## Parameter Formats
58+
59+
### Equal Sign Format After --side Flag
60+
You must use the equal sign after the --side-a or --side-b flag:
61+
```bash
62+
python run.py --op flex_attention --side-a="--warmup 25" --side-b="--warmup 100"
63+
```
64+
65+
### Default Configuration
66+
If you provide an empty string `""`, it represents the default configuration:
67+
```bash
68+
# Compare custom configuration against default
69+
python run.py --op vector_add --side-a="--warmup 100 --precision fp16" --side-b=""
70+
71+
# Compare default against custom configuration
72+
python run.py --op flash_attention --side-a="" --side-b="--dtype bf16 --batch-size 16"
73+
```
74+
75+
### Multiple Parameters
76+
```bash
77+
python run.py --op flash_attention --side-a="--warmup 50 --dtype fp16 --batch-size 8" --side-b="--warmup 100 --dtype bf16 --batch-size 16"
78+
```
79+
80+
## Output Format
81+
82+
A/B test output consists of three sections:
83+
84+
### 1. Configuration Analysis
85+
Shows differences between the two configurations:
86+
```
87+
Configuration Differences:
88+
warmup : 25 → 100
89+
precision : fp16 → fp32
90+
```
91+
92+
### 2. Performance Summary
93+
Shows average performance changes for each backend and metric:
94+
```
95+
Performance Summary
96+
----------------------------------------------------------------------
97+
98+
torch_add:
99+
latency : +37.8% avg [-22.2% to +96.4%]
100+
gbps : -27.4% avg [-49.1% to +28.6%]
101+
102+
triton_add:
103+
latency : +41.5% avg [-12.5% to +96.9%]
104+
gbps : -29.3% avg [-49.2% to +14.3%]
105+
```
106+
107+
### 3. Detailed Comparison
108+
Shows specific numerical comparisons for each metric across different input sizes and backends:
109+
```
110+
Metric: latency
111+
Backend x_val Config A Config B Difference
112+
-----------------------------------------------------------------------
113+
torch_add 4096 0.009 0.007 -22.2%
114+
8192 0.007 0.007 +0.0%
115+
16384 0.008 0.007 -12.5%
116+
...
117+
```
118+
119+
## Error Handling
120+
121+
The system automatically handles the following error conditions:
122+
- Configuration parsing failures: Provides clear error messages
123+
- Benchmark execution failures: Shows specific error reasons
124+
- Empty results: Detects and reports empty result issues
125+
- Parameter parsing errors: Issues warnings and uses default values
126+
127+
## Limitations
128+
129+
1. **Single Operator Restriction**: A/B testing only supports single operators, not multi-operator comparisons
130+
2. **Common Inputs**: Both configurations must have overlapping input sizes for comparison
131+
3. **Common Backends**: Only backends that exist in both configurations will be compared
132+
4. **Sequential Execution**: Still investigating how and how much running A/B sequentially will affect B's performance
133+
134+
## Troubleshooting
135+
136+
### Configuration Parsing Failures
137+
Ensure parameter string format is correct, especially proper use of quotes:
138+
```bash
139+
# Correct
140+
python run.py --op vector_add --side-a="--warmup 25" --side-b="--warmup 100"
141+
142+
# Wrong: missing quotes
143+
python run.py --op vector_add --side-a=--warmup 25 --side-b=--warmup 100
144+
```
145+
146+
### No Common Input Sizes or Backends
147+
Check that both configurations can run successfully and produce comparable results.

0 commit comments

Comments
 (0)