Skip to content

Commit e6a94ef

Browse files
committed
Use export_llm in CI
ghstack-source-id: bd75cc2 ghstack-comment-id: 2993075817 Pull-Request: #11836
1 parent 6cff4ec commit e6a94ef

17 files changed

+512
-177
lines changed

.ci/configs/README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# CI Configuration Files for LLM Export
2+
3+
This directory contains YAML configuration files used by CI tests for exporting LLM models with the new `extension.llm.export.export_llm` command.
4+
5+
## Usage
6+
7+
These config files can be used with the export command like this:
8+
9+
```bash
10+
python -m extension.llm.export.export_llm --config path/to/config.yaml
11+
```
12+
13+
Or you can override specific parameters:
14+
15+
```bash
16+
python -m extension.llm.export.export_llm --config ci_stories110m_xnnpack_quantized.yaml base.checkpoint=my_checkpoint.pt
17+
```
18+
19+
## Configuration Files
20+
21+
### CI Test Configurations
22+
- `ci_stories110m_xnnpack_quantized.yaml` - Stories110M with XNNPACK quantization (used in test_llama.sh)
23+
- `ci_stories110m_mps.yaml` - Stories110M with MPS backend
24+
- `ci_stories110m_coreml.yaml` - Stories110M with CoreML backend
25+
- `ci_stories110m_qnn.yaml` - Stories110M with QNN backend
26+
27+
### Performance Test Configurations
28+
- `llama3_spinquant.yaml` - Llama3 with SpinQuant (used in apple-perf.yml, android-perf.yml)
29+
- `llama3_qlora.yaml` - Llama3 with QLoRA (QAT + LoRA)
30+
- `llama3_coreml_ane.yaml` - Llama3 with CoreML ANE
31+
- `xnnpack_8da4w_basic.yaml` - Basic XNNPACK 8da4w quantization
32+
- `qwen3_xnnpack_8da4w.yaml` - Qwen3 with XNNPACK 8da4w quantization
33+
34+
### Specialized Configurations
35+
- `stories110m_torchao_lowbit.yaml` - Stories110M with TorchAO lowbit quantization
36+
- `xnnpack_custom_quantized.yaml` - XNNPACK with custom ops and quantization
37+
38+
## Background
39+
40+
These configuration files were created as part of migrating CI tests from the old `examples.models.llama.export_llama` command to the new `extension.llm.export.export_llm` command with hydra configuration support.
41+
42+
The config files help reduce duplication in CI scripts and make it easier to maintain consistent export settings across different test scenarios.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Configuration for CI test_llama.sh - stories110M with CoreML backend
2+
3+
base:
4+
model_class: "stories110m"
5+
6+
model:
7+
dtype_override: "fp32"
8+
use_kv_cache: true
9+
enable_dynamic_shape: false
10+
11+
export:
12+
max_seq_length: 128
13+
max_context_length: 128
14+
15+
backend:
16+
coreml:
17+
enabled: true
18+
19+
debug:
20+
verbose: true
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Configuration for CI test_llama.sh - stories110M with MPS backend
2+
3+
base:
4+
model_class: "stories110m"
5+
6+
model:
7+
dtype_override: "fp32"
8+
use_kv_cache: true
9+
enable_dynamic_shape: false
10+
11+
export:
12+
max_seq_length: 128
13+
max_context_length: 128
14+
15+
backend:
16+
mps:
17+
enabled: true
18+
19+
debug:
20+
verbose: true
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Configuration for CI test_llama.sh - stories110M with QNN backend
2+
3+
base:
4+
model_class: "stories110m"
5+
tokenizer_path: "tokenizer.model"
6+
7+
model:
8+
dtype_override: "fp32"
9+
use_kv_cache: true
10+
enable_dynamic_shape: false
11+
12+
export:
13+
max_seq_length: 128
14+
max_context_length: 128
15+
16+
quantization:
17+
pt2e_quantize: "qnn_16a16w"
18+
calibration_tasks: ["wikitext"]
19+
calibration_limit: 1
20+
calibration_seq_length: 128
21+
calibration_data: "Once"
22+
23+
backend:
24+
qnn:
25+
enabled: true
26+
27+
debug:
28+
verbose: true
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Configuration for CI test_llama.sh - stories110M with XNNPACK quantization
2+
# Used when XNNPACK=ON, CUSTOM=ON, QE=ON modes are enabled
3+
4+
base:
5+
model_class: "stories110m"
6+
7+
model:
8+
dtype_override: "fp32"
9+
use_kv_cache: true
10+
use_sdpa_with_kv_cache: true
11+
12+
export:
13+
max_seq_length: 128
14+
max_context_length: 128
15+
16+
quantization:
17+
qmode: "8da4w"
18+
group_size: 128
19+
embedding_quantize: "8,1024"
20+
21+
backend:
22+
xnnpack:
23+
enabled: true
24+
extended_ops: true
25+
26+
debug:
27+
verbose: false

.ci/configs/llama3_coreml_ane.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Configuration for Llama3 with CoreML ANE
2+
# Used in apple-perf.yml
3+
4+
base:
5+
model_class: "llama3_2"
6+
7+
model:
8+
dtype_override: "fp32"
9+
use_kv_cache: true
10+
enable_dynamic_shape: false
11+
12+
export:
13+
max_seq_length: 128
14+
max_context_length: 128
15+
16+
quantization:
17+
embedding_quantize: "4,32"
18+
19+
backend:
20+
coreml:
21+
enabled: true
22+
ios: 18
23+
quantize: "c4w"
24+
compute_units: "cpu_and_ne"
25+
26+
debug:
27+
verbose: false

.ci/configs/llama3_qlora.yaml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Configuration for Llama3 with QLoRA (QAT + LoRA)
2+
# Used in apple-perf.yml and android-perf.yml
3+
4+
base:
5+
model_class: "llama3_2"
6+
use_lora: 16
7+
preq_mode: "8da4w_output_8da8w"
8+
preq_group_size: 32
9+
preq_embedding_quantize: "8,0"
10+
11+
model:
12+
dtype_override: "fp32"
13+
use_kv_cache: true
14+
use_sdpa_with_kv_cache: true
15+
enable_dynamic_shape: false
16+
17+
export:
18+
max_seq_length: 2048
19+
max_context_length: 2048
20+
21+
quantization:
22+
use_qat: true
23+
24+
backend:
25+
xnnpack:
26+
enabled: true
27+
extended_ops: true
28+
29+
debug:
30+
verbose: false

.ci/configs/llama3_spinquant.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Configuration for Llama3 with SpinQuant
2+
# Used in apple-perf.yml and android-perf.yml
3+
4+
base:
5+
model_class: "llama3_2"
6+
preq_mode: "8da4w_output_8da8w"
7+
preq_group_size: 32
8+
preq_embedding_quantize: "8,0"
9+
10+
model:
11+
dtype_override: "fp32"
12+
use_kv_cache: true
13+
use_sdpa_with_kv_cache: true
14+
enable_dynamic_shape: false
15+
16+
export:
17+
max_seq_length: 2048
18+
max_context_length: 2048
19+
20+
quantization:
21+
use_spin_quant: "native"
22+
23+
backend:
24+
xnnpack:
25+
enabled: true
26+
extended_ops: true
27+
28+
debug:
29+
verbose: false
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Configuration for Qwen3-0.6B with XNNPACK 8da4w quantization
2+
# Used in apple-perf.yml and android-perf.yml
3+
4+
base:
5+
model_class: "qwen3-0_6b"
6+
params: "examples/models/qwen3/0_6b_config.json"
7+
8+
model:
9+
dtype_override: "fp32"
10+
use_kv_cache: true
11+
use_sdpa_with_kv_cache: true
12+
13+
export:
14+
max_seq_length: 128
15+
max_context_length: 128
16+
17+
quantization:
18+
qmode: "8da4w"
19+
group_size: 32
20+
embedding_quantize: "8,0"
21+
22+
backend:
23+
xnnpack:
24+
enabled: true
25+
extended_ops: true
26+
27+
debug:
28+
verbose: false
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Configuration for stories110M with TorchAO lowbit quantization
2+
# Used in CI test_llama_torchao_lowbit.sh
3+
4+
base:
5+
model_class: "stories110m"
6+
7+
model:
8+
dtype_override: "fp32"
9+
use_kv_cache: true
10+
use_sdpa_with_kv_cache: true
11+
12+
export:
13+
max_seq_length: 128
14+
max_context_length: 128
15+
16+
quantization:
17+
qmode: "torchao:8da3w" # QLINEAR_BITWIDTH=3
18+
group_size: 128 # QLINEAR_GROUP_SIZE=128
19+
embedding_quantize: "4,32" # QEMBEDDING_BITWIDTH=4, QEMBEDDING_GROUP_SIZE=32
20+
21+
backend:
22+
xnnpack:
23+
enabled: false
24+
25+
debug:
26+
verbose: false

0 commit comments

Comments
 (0)