Skip to content

Commit bd1adbd

Browse files
to support temporal_ids in resampler of minicpm4.5
2 parents 39223a4 + b9500dc commit bd1adbd

26 files changed

+914
-546
lines changed

.github/workflows/build_documentation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ jobs:
5151
run: |
5252
pip install --upgrade pip uv
5353
uv pip install git+https://github.com/huggingface/doc-builder
54-
uv pip install .[quality] nncf openvino neural-compressor[pt]>3.4 diffusers accelerate
54+
uv pip install .[quality] nncf openvino neural-compressor[pt]>3.4 diffusers accelerate datasets
5555
5656
- name: Make documentation
5757
shell: bash

.github/workflows/build_pr_documentation.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,17 +38,18 @@ jobs:
3838
run: |
3939
pip install --upgrade pip uv
4040
uv pip install git+https://github.com/huggingface/doc-builder
41-
uv pip install .[quality] nncf openvino neural-compressor[pt]>3.4 diffusers accelerate
41+
uv pip install .[quality] nncf openvino neural-compressor[pt]>3.4 diffusers accelerate datasets
4242
4343
- name: Make documentation
4444
shell: bash
4545
run: |
4646
make doc BUILD_DIR=./doc-build VERSION=pr_${{ env.PR_NUMBER }}
47-
mv ./doc-build/optimum.intel optimum-intel
47+
cd doc-build
48+
mv optimum.intel optimum-intel
4849
echo ${{ env.COMMIT_SHA }} > ./commit_sha
4950
echo ${{ env.PR_NUMBER }} > ./pr_number
5051
5152
- uses: actions/upload-artifact@v4
5253
with:
5354
name: doc-build-artifact
54-
path: optimum-intel
55+
path: doc-build

.github/workflows/test_openvino.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ jobs:
4848
- name: Setup Python
4949
uses: actions/setup-python@v5
5050
with:
51-
python-version: 3.9
51+
python-version: "3.10"
5252

5353
- name: Install dependencies
5454
run: |

.github/workflows/test_openvino_nightly.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ jobs:
7070
- name: Setup Python
7171
uses: actions/setup-python@v5
7272
with:
73-
python-version: 3.9
73+
python-version: "3.10"
7474

7575
- name: Install dependencies
7676
run: |

.github/workflows/test_openvino_notebooks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ jobs:
2424
strategy:
2525
fail-fast: false
2626
matrix:
27-
python-version: [3.9]
27+
python-version: "3.10"
2828
test_file: [
2929
"optimum_openvino_inference.ipynb",
3030
"question_answering_quantization.ipynb",

.github/workflows/test_openvino_slow.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ jobs:
5454
- name: Setup Python
5555
uses: actions/setup-python@v5
5656
with:
57-
python-version: 3.9
57+
python-version: "3.10"
5858

5959
- name: Install dependencies
6060
run: |

docs/source/openvino/optimization.mdx

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -889,23 +889,26 @@ If quantization_config is not provided, model will be exported in 8 bits by defa
889889
4-bit weight quantization can be achieved in a similar way:
890890

891891
```python
892-
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
892+
from optimum.intel import OVModelForCausalLM
893893

894-
quantization_config = OVWeightQuantizationConfig(bits=4)
895-
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
894+
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config={"bits": 4})
896895
```
897896

897+
For some models, we provide preconfigured 4-bit weight-only quantization [configurations](https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/openvino/configuration.py) that offer a good trade-off between quality and speed. This default 4-bit configuration is applied automatically when you specify `quantization_config={"bits": 4}`.
898+
898899
Or for vision-language pipelines:
899900
```python
900901
model = OVModelForVisualCausalLM.from_pretrained(
901902
"llava-hf/llava-v1.6-mistral-7b-hf",
902-
quantization_config=quantization_config
903+
quantization_config={"bits": 4}
903904
)
904905
```
905906

906907
You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
907908

908909
```python
910+
from optimum.intel import OVWeightQuantizationConfig
911+
909912
quantization_config = OVWeightQuantizationConfig(
910913
bits=4,
911914
sym=False,

notebooks/openvino/vision_language_quantization.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
"metadata": {},
4545
"outputs": [],
4646
"source": [
47-
"! pip install \"optimum-intel[openvino]\" datasets num2words torchvision\n",
47+
"! pip install \"optimum-intel[openvino]\" datasets num2words torchvision transformers==4.52.*\n",
4848
"! pip install git+https://github.com/huggingface/optimum-benchmark.git"
4949
]
5050
},

optimum/exporters/ipex/modeling_utils.py

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1030,7 +1030,17 @@ def forward(
10301030
class _IPEXLlamaAttention(_IPEXAttention):
10311031
def __init__(self, module, device, config) -> None:
10321032
super().__init__(module, device, config)
1033-
if getattr(config, "quantization_config", None) is None:
1033+
# Skip concat_qkv creation for TP mode (when using DTensor)
1034+
is_tp_mode = (
1035+
hasattr(self.q_proj, "weight")
1036+
and type(self.q_proj.weight).__name__ == "DTensor"
1037+
or hasattr(self.k_proj, "weight")
1038+
and type(self.k_proj.weight).__name__ == "DTensor"
1039+
or hasattr(self.v_proj, "weight")
1040+
and type(self.v_proj.weight).__name__ == "DTensor"
1041+
)
1042+
1043+
if getattr(config, "quantization_config", None) is None and not is_tp_mode:
10341044
concat_weight = torch.concat([self.q_proj.weight, self.k_proj.weight, self.v_proj.weight]).contiguous()
10351045
bias_list = [bias for bias in [self.q_proj.bias, self.k_proj.bias, self.v_proj.bias] if bias is not None]
10361046
use_bias = bias_list != []
@@ -1131,11 +1141,20 @@ def __init__(self, module, device, config) -> None:
11311141
self.module_device = device
11321142

11331143
if not config.compile and getattr(config, "quantization_config", None) is None:
1134-
# LinearAllreduce cannot use fused op LinearAdd
1135-
if module.down_proj.__class__.__name__ not in ["LinearAllreduce"]:
1144+
# Check if in TP mode (using DTensor)
1145+
is_tp_mode = (
1146+
hasattr(module.down_proj, "weight")
1147+
and type(module.down_proj.weight).__name__ == "DTensor"
1148+
or hasattr(module.gate_proj, "weight")
1149+
and type(module.gate_proj.weight).__name__ == "DTensor"
1150+
or hasattr(module.up_proj, "weight")
1151+
and type(module.up_proj.weight).__name__ == "DTensor"
1152+
)
1153+
1154+
if not is_tp_mode:
11361155
self.mlp_linear_add = LinearAdd(module.down_proj)
1137-
if isinstance(self.act_fn, nn.SiLU):
1138-
self.linear_silu_mul = Linear2SiluMul(module.gate_proj, module.up_proj)
1156+
if isinstance(self.act_fn, nn.SiLU):
1157+
self.linear_silu_mul = Linear2SiluMul(module.gate_proj, module.up_proj)
11391158

11401159
def forward(self, hidden_states: torch.Tensor, residual: torch.Tensor = None, **kwargs):
11411160
if hasattr(self, "linear_silu_mul"):

optimum/exporters/openvino/__main__.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -520,9 +520,18 @@ class StoreAttr(object):
520520
"Quantization of the weights requires nncf, please install it with `pip install nncf`"
521521
)
522522

523+
from optimum.intel.openvino.configuration import _GPTOSSQuantizationConfig
523524
from optimum.intel.openvino.quantization import _weight_only_quantization
524525

525-
_weight_only_quantization(submodel, quantization_config)
526+
if isinstance(quantization_config, _GPTOSSQuantizationConfig):
527+
# A workaround for GPT-OSS model is required to run quantization twice, this way it is possible to
528+
# selectively quantize some weights to 4 bits and some to 8 bits.
529+
_weight_only_quantization(submodel, quantization_config.quantization_config1)
530+
_weight_only_quantization(
531+
submodel, quantization_config.quantization_config2, verify_not_optimized=False
532+
)
533+
else:
534+
_weight_only_quantization(submodel, quantization_config)
526535
compressed_submodel_path = submodel_path.parent / f"{submodel_path.stem}_compressed.xml"
527536
save_model(submodel, compressed_submodel_path, compress_to_fp16=False)
528537
del submodel

0 commit comments

Comments
 (0)