Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Below are list of available recipes grouped by different criteria. Click the lin
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| [google-bert-bert-base-multilingual-cased](google-bert-bert-base-multilingual-cased/aitk) | [laion-CLIP-ViT-B-32-laion2B-s34B-b79K](laion-CLIP-ViT-B-32-laion2B-s34B-b79K/aitk) | [deepseek-ai-DeepSeek-R1-Distill-Qwen-1.5B](deepseek-ai-DeepSeek-R1-Distill-Qwen-1.5B/aitk) | [meta-llama-Llama-3.2-1B-Instruct](meta-llama-Llama-3.2-1B-Instruct/NvTensorRtRtx) | [mistralai-Mistral-7B-Instruct-v0.3](mistralai-Mistral-7B-Instruct-v0.3/aitk) | [microsoft-Phi-3.5-mini-instruct](microsoft-Phi-3.5-mini-instruct/aitk) | [microsoft-Phi-3.5-mini-instruct](microsoft-Phi-3.5-mini-instruct/NvTensorRtRtx) | [Qwen-Qwen2.5-1.5B-Instruct](Qwen-Qwen2.5-1.5B-Instruct/NvTensorRtRtx) | [microsoft-resnet-50](microsoft-resnet-50/aitk) | [google-vit-base-patch16-224](google-vit-base-patch16-224/aitk) |
| [intel-bert-base-uncased-mrpc](intel-bert-base-uncased-mrpc/aitk) | [openai-clip-vit-base-patch16](openai-clip-vit-base-patch16/aitk) | | [meta-llama-Llama-3.2-1B-Instruct](meta-llama-Llama-3.2-1B-Instruct/aitk) | | [microsoft-Phi-4-mini-reasoning](microsoft-Phi-4-mini-reasoning/aitk) | | [Qwen-Qwen2.5-1.5B-Instruct](Qwen-Qwen2.5-1.5B-Instruct/aitk) | | |
| | [openai-clip-vit-base-patch32](openai-clip-vit-base-patch32/aitk) | | | | | | [deepseek-ai-DeepSeek-R1-Distill-Qwen-1.5B](deepseek-ai-DeepSeek-R1-Distill-Qwen-1.5B/NvTensorRtRtx) | | |
|[BAAI/bge-small-en-v1.5](baai-bge-small-en-v1.5/aitk)| [openai-clip-vit-base-patch32](openai-clip-vit-base-patch32/aitk) | | | | | | [deepseek-ai-DeepSeek-R1-Distill-Qwen-1.5B](deepseek-ai-DeepSeek-R1-Distill-Qwen-1.5B/NvTensorRtRtx) | | |
<!-- end_arch_models -->
</details>

Expand Down
6 changes: 6 additions & 0 deletions baai-bge-small-en-v1.5/aitk/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
__pycache__
/cache
/history/*/*
!/history/*/history.config
!/history/*/olive_config.json
.DS_Store
165 changes: 165 additions & 0 deletions baai-bge-small-en-v1.5/aitk/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# BGE-Small-EN-v1.5 Optimization

This folder contains examples of BGE-Small-EN-v1.5 optimization using different workflows for various hardware accelerators.

## Model Overview

BGE-Small-EN-v1.5 is a lightweight English text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence). The model is optimized for sentence and text embedding tasks, providing high-quality vector representations for downstream applications such as semantic search, text classification, and similarity matching.

## Optimization Workflows

This directory provides three different optimization workflows targeting specific hardware accelerators:

- **QDQ for Qualcomm NPU**: Quantization-aware training for Qualcomm Neural Processing Units
- **QDQ for AMD NPU**: Quantization-aware training for AMD Neural Processing Units
- **OpenVINO for Intel NPU**: OpenVINO optimization for Intel Neural Processing Units

## Workflow Details

### QDQ for Qualcomm NPU

This workflow performs quantization-aware training optimization for Qualcomm NPU acceleration. It follows the optimization pipeline:

- *HuggingFace Model → ONNX Model → Quantized ONNX Model*

**Configuration File**: `bge-small-en-v1.5_qdq_qnn.json`

**Key Features**:
- Uses QNN (Qualcomm Neural Network) execution provider
- Implements quantization-aware training with dynamic quantization
- Optimized for Qualcomm NPU hardware architecture
- Supports both activation and weight quantization

### QDQ for AMD NPU

This workflow performs quantization-aware training optimization for AMD NPU acceleration. It follows the optimization pipeline:

- *HuggingFace Model → ONNX Model → Quantized ONNX Model*

**Configuration File**: `bge-small-en-v1.5_qdq_amd.json`

**Key Features**:
- Optimized for AMD NPU architecture
- Implements quantization-aware training with dynamic quantization
- Enhanced performance for AMD hardware
- Supports both activation and weight quantization

### OpenVINO for Intel NPU

This workflow performs OpenVINO optimization for Intel NPU acceleration. It follows the optimization pipeline:

- *HuggingFace Model → OpenVINO IR Model*

**Configuration File**: `bge-small-en-v1.5_context_ov_static.json`

**Key Features**:
- Uses OpenVINO execution provider for Intel NPU
- Implements static quantization for optimal performance
- Custom user script for specialized data processing
- Enhanced accuracy evaluation using MTEB benchmarks

## Dataset Information

### Quantization Datasets
- **QNN/AMD NPU**: Uses MTEB Banking77 test split for quantization calibration
- **Intel NPU**: Uses Wikipedia train split (300 samples) with custom preprocessing

### Evaluation Datasets
- **Primary**: MTEB Banking77 classification task
- **Evaluation Metric**: Custom embedding accuracy for semantic similarity
- **Benchmark**: MTEB (Massive Text Embedding Benchmark) for standardized evaluation

## Performance Evaluation Results

The following results are based on comprehensive evaluation using standard embedding benchmarks and performance metrics. All evaluations use the MTEB Banking77 dataset for consistency.

### Qualcomm NPU (QNN) Performance

| Metric | Value |
|--------|-------|
| **Accuracy** | 85.57% |
| **Latency (avg)** | 14.83 ms |
| **Latency (min)** | 13.66 ms |
| **Latency (max)** | 17.92 ms |
| **Latency (p90)** | 15.52 ms |
| **Throughput (avg)** | 70.97 tokens/sec |
| **Throughput (max)** | 72.83 tokens/sec |
| **Throughput (min)** | 68.47 tokens/sec |

### AMD NPU Performance

| Metric | Value |
|--------|-------|
| **Accuracy** | 83.66% |
| **Latency (avg)** | 8.58 ms |
| **Latency (min)** | 7.54 ms |
| **Latency (max)** | 9.43 ms |
| **Latency (p90)** | 9.13 ms |
| **Throughput (avg)** | 107.26 tokens/sec |
| **Throughput (max)** | 130.15 tokens/sec |
| **Throughput (min)** | 88.90 tokens/sec |

### Intel NPU Performance

| Metric | Value |
|--------|-------|
| **Accuracy** | 85.42% |
| **Latency (avg)** | 3.33 ms |
| **Latency (min)** | 2.30 ms |
| **Latency (max)** | 6.39 ms |
| **Latency (p90)** | 4.01 ms |
| **Throughput (avg)** | 312.15 tokens/sec |
| **Throughput (max)** | 421.12 tokens/sec |
| **Throughput (min)** | 199.13 tokens/sec |

## Optimization Techniques

### Quantization Strategies
- **Dynamic Quantization**: Used for QNN and AMD NPU workflows
- **Static Quantization**: Used for Intel NPU workflow with OpenVINO
- **Mixed Precision**: Combines different precision levels for optimal performance

### Model Optimization Features
- **Input Optimization**: Fixed input shapes for better inference performance
- **Memory Optimization**: Efficient memory usage through quantization
- **Hardware-Specific Tuning**: Custom optimizations for each NPU architecture

## Requirements

The following dependencies are required for running the optimization workflows:

```
olive-ai
datasets
optimum
mteb
polars-lts-cpu (QNN only)
```

## Usage

1. **Select Workflow**: Choose the appropriate configuration file based on your target hardware:
- For Qualcomm NPU: `bge-small-en-v1.5_qdq_qnn.json`
- For AMD NPU: `bge-small-en-v1.5_qdq_amd.json`
- For Intel NPU: `bge-small-en-v1.5_context_ov_static.json`

2. **Configure Parameters**: Adjust quantization parameters such as activation type, weight type, and quantization dataset according to your specific requirements.

3. **Run Optimization**: Execute the optimization pipeline using the selected configuration.

4. **Evaluate Results**: Use the provided evaluation scripts to assess model performance on your target hardware.

## Performance Notes

- **Accuracy**: Measured using custom embedding accuracy metrics from MTEB benchmark
- **Latency**: Measured in milliseconds per inference
- **Throughput**: Measured in tokens per second
-
## Model Information

- **Model ID**: `BAAI/bge-small-en-v1.5`
- **Model Type**: Text Embedding Model
- **Framework**: HuggingFace Transformers
- **Optimization Target**: Hardware-specific acceleration for embedding generation

*Note: Performance metrics may vary depending on hardware specifications, system environment, and workload characteristics. The values provided here are for reference and may not reflect performance on all devices or configurations.*
203 changes: 203 additions & 0 deletions baai-bge-small-en-v1.5/aitk/bge-small-en-v1.5_context_ov_static.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
{
"input_model": {
"type": "HfModel",
"model_path": "BAAI/bge-small-en-v1.5",
"task": "feature-extraction",
"io_config": {
"input_names": [
"input_ids",
"attention_mask",
"token_type_ids"
],
"input_shapes": [
[
1,
128
],
[
1,
128
],
[
1,
128
]
],
"input_types": [
"int64",
"int64",
"int64"
],
"output_names": [
"last_hidden_state",
"state"
]
}
},
"systems": {
"local_system": {
"type": "LocalSystem",
"accelerators": [
{
"device": "npu",
"execution_providers": [
"OpenVINOExecutionProvider"
]
}
]
}
},
"data_configs": [
{
"name": "quantize_data_config",
"user_script": "user_script.py",
"load_dataset_config": {
"type": "bge_small_en_dataset",
"data_name": "wikipedia",
"split": "train",
"max_samples": 300
},
"dataloader_config": {
"batch_size": 1,
"drop_last": true
}
},
{
"name": "accuracy_data_config",
"type": "HuggingfaceContainer",
"load_dataset_config": {
"data_name": "mteb/banking77",
"split": "test"
},
"pre_process_data_config": {
"max_length": 128,
"padding": "max_length",
"input_cols": ["text"]
},
"dataloader_config": {
"batch_size": 1
}
},
{
"name": "evaluation_data_config",
"type": "HuggingfaceContainer",
"load_dataset_config": {
"data_name": "mteb/banking77",
"split": "test"
},
"pre_process_data_config": {
"max_length": 128,
"padding": "max_length",
"input_cols": ["text"]
},
"dataloader_config": {
"batch_size": 1
}
}
],
"evaluators": {
"common_evaluator": {
"metrics": [
{
"name": "accuracy",
"type": "custom",
"sub_types": [
{
"name": "embedding_accuracy",
"priority": 1,
"higher_is_better": true,
"goal": { "type": "max-degradation", "value": 0.05 }
}
],
"user_config": {
"user_script": "user_script.py",
"evaluate_func": "eval_accuracy"
}
},
{
"name": "latency",
"type": "latency",
"data_config": "evaluation_data_config",
"sub_types": [
{ "name": "avg", "priority": 2, "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p50", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p75", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p90", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p95", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p99", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "min", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "max", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } }
]
},
{
"name": "throughput",
"type": "throughput",
"data_config": "evaluation_data_config",
"sub_types": [
{ "name": "avg", "priority": 3, "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p50", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p75", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p90", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p95", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "p99", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "min", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } },
{ "name": "max", "metric_config": { "warmup_num": 20, "repeat_test_num": 100 } }
]
}
]
}
},
"passes": {
"optimum_convert": {
"type": "OpenVINOOptimumConversion",
"extra_args": {
"device": "npu",
"task": "feature-extraction"
}
},
"io_update": {
"type": "OpenVINOIoUpdate",
"input_shapes": [
[
1,
128
],
[
1,
128
],
[
1,
128
]
],
"static": true
},
"ov_quantize": {
"type": "OpenVINOQuantization",
"target_device": "npu",
"data_config": "quantize_data_config",
"model_type": "TRANSFORMER",
"user_script": "user_script.py",
"transform_fn": "custom_transform_func",
"extra_configs": [
{
"advanced_quantization_parameters": {
"smooth_quant_alpha": 0.6
}
}
]
},
"encapsulation": {
"type": "OpenVINOEncapsulation",
"target_device": "npu",
"ov_version": "2025.1"
}
},
"cache_dir": "cache",
"evaluate_input_model": false,
"evaluator": "common_evaluator",
"host": "local_system",
"output_dir": "models/bge-small-en-v1.5/openvino",
"target": "local_system"
}
Loading