Skip to content

Commit 12c9995

Browse files
author
Nikita Savelyev
authored
[WC] Introduce flexible group size value search (#3556)
### Changes Introduce flexible group size search logic as a part of mixed precision algorithm. When enabled, each weight for which the channel size is not divisible by the general group size value will be compressed to a newly calculated group size. The new group size value is the maximal power of two (i.e., 2^k) such that: - channel size is divisible by it; - it is less than the originally specified group size value; - it is greater than or equal to `min_flexible_group_size` (16 by default). If it's not possible to find a value satisfying these requirements, such weight is compressed to backup precision. If ratio < 1.0 and some weights have to be compressed to the backup precision because of group size issues, then these weights won't contribute to the ratio of backup mode group. This method is disabled by default. ### Reason for changes Some models may have channel size values that are not divisible by the default group size. In such case a user can now provide `nncf.AdvancedCompressionParameters(enable_flexible_group_size=True)` advanced parameter instead of an ignored scope. Example models: - `microsoft/Phi-4-multimodal-instruct` (lm_model and vision_embeddings_model) - `HuggingFaceH4/Qwen2.5-Math-1.5B-Instruct-PRM-0.2` ### Metrics Results for phi4-multimodal are below. | Language Model Precision | Vision Embed. Model Precision | WWB Similarity | Time of image-to-text request (sec.) | Time of audio-to-text request (sec.) | |------------------------------------|-------------------------------------|----------------|---------------------------------------|----------------------------------------| | FP16 | FP16 | 99.19% | 31.21 | 17.76 | | Mixed precision: int4 or bf16 | Mixed precision: int4 or bf16 | 77.51% | 22.37 | 10.93 | | Mixed precision: int4 or int8 | Mixed precision: int4 or int8 | 79.03% | 19.95 | 9.47 | | int4 with mixed group size: 128 or 64 | int4 with mixed group size: 128 or 16 | 81.36% | 19.89 | 9.16 | Last row corresponds to `nncf.AdvancedCompressionParameters(enable_flexible_group_size=True)`. Third row corresponds to `nncf.AdvancedCompressionParameters(enable_flexible_group_size=True, min_flexible_group_size=128)` Second row corresponds to `nncf.AdvancedCompressionParameters(enable_flexible_group_size=True, min_flexible_group_size=128)` with `backup_mode="none"`. Inference time results are expected. Similarity not so much, but still no degradation for group size 16 case. ### Related tickets 167337 ### Tests Added test cases which assert that the expected log messages are printed. https://github.com/openvinotoolkit/nncf/actions/runs/15852358755
1 parent 71ae2c1 commit 12c9995

20 files changed

+408
-178
lines changed

src/nncf/quantization/advanced_parameters.py

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,22 @@ class AdvancedCompressionParameters:
369369
370370
:param statistics_path: Directory path to dump statistics.
371371
:type statistics_path: str
372+
:param lora_adapter_rank: Rank of lora adapters for FQ_LORA format. Defaults to 256.
373+
:type lora_adapter_rank: int
374+
:param enable_flexible_group_size: Whether to enable flexible group size searching. When enabled, each weight
375+
for which the channel size is not divisible by the general group size value will be compressed to a newly
376+
calculated group size. The new group size value is the maximal power of two (i.e., 2^k) such that:
377+
- channel size is divisible by it;
378+
- it is less than the originally specified group size value;
379+
- it is greater than or equal to `min_flexible_group_size`.
380+
381+
If it's not possible to find a value satisfying these requirements, such weight is compressed to the backup
382+
precision. If ratio < 1.0 and some weights have to be compressed to the backup precision because of group size
383+
issues, then these weights won't contribute to the ratio of backup mode group.
384+
:type enable_flexible_group_size: bool
385+
:param min_flexible_group_size: Minimum group size for flexible group size searching. Defaults to 16. The reason
386+
behind this argument is to avoid too small group size values, which may lead to performance issues.
387+
:type min_flexible_group_size: int
372388
:param awq_params: Advanced parameters for AWQ algorithm.
373389
:type awq_params: AdvancedAWQParameters
374390
:param scale_estimation_params: Advanced parameters for Scale Estimation algorithm.
@@ -377,8 +393,6 @@ class AdvancedCompressionParameters:
377393
:type gptq_params: AdvancedGPTQParameters
378394
:param lora_correction_params: Advanced parameters for Lora Correction algorithm.
379395
:type lora_correction_params: AdvancedLoraCorrectionParameters
380-
:param lora_adapter_rank: Rank of lora adapters for FQ_LORA format. Defaults to 256.
381-
:type lora_adapter_rank: int
382396
:param backend_params: Backend-specific parameters.
383397
:type backend_params: dict[str, Any]
384398
:param codebook: The codebook (LUT) for the weight compression.
@@ -387,13 +401,15 @@ class AdvancedCompressionParameters:
387401
"""
388402

389403
statistics_path: Optional[str] = None
404+
lora_adapter_rank: int = 256
405+
enable_flexible_group_size: bool = False
406+
min_flexible_group_size: int = 16
390407
awq_params: AdvancedAWQParameters = field(default_factory=AdvancedAWQParameters)
391408
scale_estimation_params: AdvancedScaleEstimationParameters = field(
392409
default_factory=AdvancedScaleEstimationParameters
393410
)
394411
gptq_params: AdvancedGPTQParameters = field(default_factory=AdvancedGPTQParameters)
395412
lora_correction_params: AdvancedLoraCorrectionParameters = field(default_factory=AdvancedLoraCorrectionParameters)
396-
lora_adapter_rank: int = 256
397413
backend_params: dict[str, Any] = field(default_factory=dict)
398414
codebook: Optional[TTensor] = None
399415

src/nncf/quantization/algorithms/weight_compression/algorithm.py

Lines changed: 102 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
from nncf.quantization.algorithms.weight_compression.mixed_precision import MIXED_PRECISION_CRITERIA
4646
from nncf.quantization.algorithms.weight_compression.scale_estimation import ScaleEstimation
4747
from nncf.quantization.algorithms.weight_compression.weight_lowering import WeightCompressionConfig
48+
from nncf.quantization.algorithms.weight_compression.weight_lowering import get_reduction_channel_size
4849
from nncf.scopes import IgnoredScope
4950
from nncf.scopes import get_ignored_node_names_from_ignored_scope
5051
from nncf.tensor import Tensor
@@ -318,11 +319,13 @@ def __init__(
318319
advanced_parameters if advanced_parameters is not None else AdvancedCompressionParameters()
319320
)
320321

321-
primary_config = self._get_primary_config()
322322
criterion_cls = MIXED_PRECISION_CRITERIA.get(self._sensitivity_metric)
323-
self._mixed_precision_algo = criterion_cls(primary_config, self._ratio, self._subset_size)
323+
self._mixed_precision_algo = criterion_cls(self._ratio, self._subset_size)
324324
self._statistics_path = self._advanced_parameters.statistics_path
325325

326+
self._enable_flexible_group_size = self._advanced_parameters.enable_flexible_group_size
327+
self._min_flexible_group_size = self._advanced_parameters.min_flexible_group_size
328+
326329
if self._awq:
327330
awq_params = self._advanced_parameters.awq_params
328331
self.awq_algo = AWQ(
@@ -454,7 +457,7 @@ def _get_ratio_defining_params(
454457

455458
return ratio_defining_params
456459

457-
def _get_primary_config(self):
460+
def _get_primary_config(self, group_size: int) -> WeightCompressionConfig:
458461
codebook_values = None
459462

460463
if self._mode == CompressWeightsMode.CB4_F8E4M3:
@@ -464,7 +467,7 @@ def _get_primary_config(self):
464467

465468
return WeightCompressionConfig(
466469
mode=self._mode,
467-
group_size=self._group_size,
470+
group_size=group_size,
468471
codebook_values=codebook_values,
469472
)
470473

@@ -474,6 +477,7 @@ def _set_weight_compression_config(
474477
model: TModel,
475478
graph: NNCFGraph,
476479
statistics_points: StatisticPointsContainer,
480+
group_size_values: dict[str, int],
477481
) -> None:
478482
"""
479483
Sets the appropriate compression configuration for weights based on some criteria.
@@ -483,13 +487,92 @@ def _set_weight_compression_config(
483487
:param model: The model.
484488
:param graph: The model graph associated with the model.
485489
:param statistics_points: Statistics points.
490+
:param group_size_values: A dictionary mapping weight names to their group size values.
486491
"""
487-
primary_config = self._get_primary_config()
488-
if self._ratio == 1:
489-
for weight_param in ratio_defining_params:
490-
weight_param.compression_config = primary_config
492+
if self._ratio < 1 and len(ratio_defining_params) > 0:
493+
primary_precision_weight_params = self._mixed_precision_algo.apply(
494+
model, graph, statistics_points, weight_params=ratio_defining_params
495+
)
491496
else:
492-
self._mixed_precision_algo.apply(model, graph, statistics_points, weight_params=ratio_defining_params)
497+
primary_precision_weight_params = ratio_defining_params
498+
499+
for weight_param in primary_precision_weight_params:
500+
weight_param.compression_config = self._get_primary_config(group_size_values[weight_param.weight_name])
501+
502+
# Check if group size is valid for each weight in ratio_defining_params
503+
failed_nodes = []
504+
for w_params in ratio_defining_params:
505+
if w_params.compression_config is None or w_params.compression_config.group_size == -1:
506+
continue
507+
reduction_channel_size, _ = get_reduction_channel_size(w_params.weight_shape, w_params.reduction_axes)
508+
if reduction_channel_size % w_params.compression_config.group_size != 0:
509+
failed_nodes.append((w_params.node_with_weight.node_name, reduction_channel_size))
510+
if len(failed_nodes) > 0:
511+
names = ",".join(f'"{name}"' for name, _ in failed_nodes)
512+
msg = (
513+
"Failed to apply group-wise quantization with "
514+
f"group size value {self._group_size} and channel size value {failed_nodes[0][1]}.\n"
515+
"Ensure that the group size is divisible by the channel size, "
516+
"or include this node and others with similar issues in the ignored scope:\n"
517+
f"nncf.compress_weight(\n\t..., \n\tignored_scope=IgnoredScope(names=[{names}]\n\t)\n)"
518+
)
519+
raise nncf.InvalidGroupSizeError(msg)
520+
521+
def _get_flexible_group_size_data(
522+
self, weight_params: list[WeightCompressionParameters]
523+
) -> list[tuple[WeightCompressionParameters, int]]:
524+
"""
525+
Compute flexible group size values.
526+
:param weight_params: Weight parameters for which to compute flexible group size.
527+
:return: A list of tuples, where each tuple pair contains a WeightCompressionParameters object and the
528+
group size values associated with it. If group size can't be assigned to some weight parameter
529+
it won't be included in the result.
530+
"""
531+
flexible_group_size_not_found_weight_params = []
532+
group_size_data = []
533+
for w_params in weight_params:
534+
reduction_channel_size, _ = get_reduction_channel_size(w_params.weight_shape, w_params.reduction_axes)
535+
if reduction_channel_size % self._group_size == 0:
536+
# The weight can be compressed with the given group size, nothing else to do
537+
group_size_data.append((w_params, self._group_size))
538+
continue
539+
540+
# Find the maximal power of two that divides reduction_channel_size
541+
flexible_group_size = reduction_channel_size & (~reduction_channel_size + 1)
542+
543+
if flexible_group_size < self._min_flexible_group_size:
544+
flexible_group_size_not_found_weight_params.append(w_params)
545+
else:
546+
group_size_data.append((w_params, flexible_group_size))
547+
548+
node_strings = []
549+
for i, (w_params, new_group_size) in enumerate(group_size_data):
550+
if new_group_size == self._group_size:
551+
continue
552+
weight_shape = w_params.weight_shape
553+
reduction_channel_size, _ = get_reduction_channel_size(weight_shape, w_params.reduction_axes)
554+
node_strings.append(
555+
f"{w_params.node_with_weight.node_name} "
556+
f"(weight shape: {weight_shape}, adjusted group size: {new_group_size})"
557+
)
558+
if len(node_strings) > 0:
559+
nncf_logger.info(
560+
f"Wasn't able to set the specified group size value ({self._group_size}) to some nodes. These nodes "
561+
f"will have an adjusted group size value:\n\t" + "\n\t".join(node_strings)
562+
)
563+
564+
if len(flexible_group_size_not_found_weight_params) > 0:
565+
node_strings = [""] * len(flexible_group_size_not_found_weight_params)
566+
for i, w_params in enumerate(flexible_group_size_not_found_weight_params):
567+
weight_shape = w_params.weight_shape
568+
reduction_channel_size, _ = get_reduction_channel_size(weight_shape, w_params.reduction_axes)
569+
node_strings[i] = f"{w_params.node_with_weight.node_name} (weight shape: {weight_shape})"
570+
nncf_logger.warning(
571+
"Large enough flexible group size value cannot be found for some nodes. They will be compressed "
572+
"according to the backup mode. Nodes:\n\t" + "\n\t".join(node_strings)
573+
)
574+
575+
return group_size_data
493576

494577
@staticmethod
495578
def _proportion_str(num_weights_list: list[int], total_num_weights: int, total_num_params: int) -> str:
@@ -625,7 +708,6 @@ def apply(
625708
if weight_dtype not in SUPPORTED_DATA_TYPES:
626709
continue
627710
weight_shape = self._backend_entity.get_weight_shape(node, weight_port_id, graph)
628-
weight_size = reduce(operator.mul, weight_shape, 1)
629711
reduction_axes = self._backend_entity.get_reduction_axes(node, weight_port_id, graph)
630712
if (
631713
self._group_size != -1
@@ -654,13 +736,21 @@ def apply(
654736
)
655737
wc_config = WeightCompressionConfig(mode=mode)
656738
weight_params = WeightCompressionParameters(
657-
weight_name, node, weight_port_id, weight_size, reduction_axes, wc_config
739+
weight_name, node, weight_port_id, weight_shape, reduction_axes, wc_config
658740
)
659741
all_weight_params.append(weight_params)
660742
weight_names.add(weight_name)
661743

662744
ratio_defining_params = self._get_ratio_defining_params(all_weight_params, is_last_layer_shared)
663-
self._set_weight_compression_config(ratio_defining_params, model, graph, statistic_points)
745+
if self._enable_flexible_group_size and self._group_size != -1:
746+
# Compute flexible group size values if enabled
747+
flexible_group_size_data = self._get_flexible_group_size_data(ratio_defining_params)
748+
group_size_values = {w_param.weight_name: group_size for w_param, group_size in flexible_group_size_data}
749+
# Select a subset of ratio_defining_params that can be compressed with some group size
750+
ratio_defining_params = [w_param for w_param, _ in flexible_group_size_data]
751+
else:
752+
group_size_values = {w_param.weight_name: self._group_size for w_param in ratio_defining_params}
753+
self._set_weight_compression_config(ratio_defining_params, model, graph, statistic_points, group_size_values)
664754
ignored_scope_weight_statistics = self._get_ignored_scope_weight_statistics(model, graph)
665755
nncf_logger.info(
666756
self._get_bitwidth_distribution_str(

src/nncf/quantization/algorithms/weight_compression/config.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,10 @@
88
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
99
# See the License for the specific language governing permissions and
1010
# limitations under the License.
11+
import operator
1112
from dataclasses import dataclass
1213
from dataclasses import field
14+
from functools import reduce
1315
from typing import Optional, TypeVar
1416

1517
import numpy as np
@@ -86,19 +88,20 @@ class WeightCompressionParameters:
8688
:param weight_name: Unique weight name.
8789
:param node_with_weight: Node with weight in the NNCF graph.
8890
:param weight_port_id: Number of elements in the weight array.
89-
:param num_weights: Number of elements in the weight array.
91+
:param weight_shape: Shape of the weight array.
9092
:param reduction_axes: Axes, along which to reduce (collect) different statistics (e.g. min, max).
9193
:param compression_config: Configuration of weight compression for the weight node.
9294
"""
9395

9496
weight_name: str
9597
node_with_weight: NNCFNode
9698
weight_port_id: int
97-
num_weights: np.uint64
99+
weight_shape: tuple[int, ...]
98100
reduction_axes: tuple[int, ...]
99101
compression_config: Optional[WeightCompressionConfig] = field(default_factory=WeightCompressionConfig)
100102

101-
def __post_init__(self):
102-
# Explicitly cast num_weights to avoid overflow on finding total number of weights.
103-
# The issue happens on Windows, because np.ndarray.size() returns np.int32 and sum of weights is more than 2^32.
104-
self.num_weights = np.uint64(self.num_weights)
103+
@property
104+
def num_weights(self) -> np.uint64:
105+
if not hasattr(self, "_num_weights"):
106+
self._num_weights = np.uint64(reduce(operator.mul, self.weight_shape, 1))
107+
return self._num_weights

src/nncf/quantization/algorithms/weight_compression/handle_errors.py

Lines changed: 0 additions & 32 deletions
This file was deleted.

src/nncf/quantization/algorithms/weight_compression/mixed_precision.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,18 +41,16 @@
4141

4242
class MixedPrecisionCriterion(Algorithm):
4343
"""
44-
Assigns mixed quantization scheme (e.g. uniform int8 or uniform int4/non-uniform fp4)
44+
Computes mixed quantization scheme (e.g. uniform int8 or uniform int4/non-uniform fp4)
4545
for weights based on some criteria.
4646
"""
4747

48-
def __init__(self, primary_config: WeightCompressionConfig, ratio: float, subset_size: Optional[int] = None):
48+
def __init__(self, ratio: float, subset_size: Optional[int] = None):
4949
"""
50-
:param primary_config: Configuration on how to compress (quantize) weights to primary precision.
5150
:param ratio: The ratio between primary and backup precisions (e.g. 0.9 means 90% of layers quantized to NF4
5251
and the rest to INT8_ASYM).
5352
:param subset_size: Size of dataset subset for statistics.
5453
"""
55-
self._primary_config = primary_config
5654
self._ratio = ratio
5755
self._subset_size = subset_size
5856
self._algorithm_key = f"MPC_{hash(self)}"
@@ -79,15 +77,17 @@ def apply(
7977
statistic_points: Optional[StatisticPointsContainer] = None,
8078
dataset: Optional[Dataset] = None,
8179
weight_params: list[WeightCompressionParameters] = None,
82-
) -> None:
80+
) -> list[WeightCompressionParameters]:
8381
"""
84-
Assigns quantization precision based on computed layers' sensitivities, ratio of parameters.
82+
Selects which weights should be compressed to a primary (4 bit) precision based on computed layers'
83+
sensitivities, ratio of parameters.
8584
"""
8685
self._set_backend_entity(model)
8786

8887
scores = self._calc_sensitivity(model, graph, weight_params, statistic_points)
8988
num_all_weights = sum(wp.num_weights for wp in weight_params)
9089

90+
primary_precision_weight_params = []
9191
indexes_of_layers_in_ascending_order_of_scores = [
9292
i[0] for i in sorted(enumerate(scores), reverse=False, key=lambda x: x[1])
9393
]
@@ -97,8 +97,9 @@ def apply(
9797
current_ratio = (num_weights_in_4bit + weight_param.num_weights) / num_all_weights
9898
if current_ratio >= self._ratio:
9999
break
100-
weight_param.compression_config = self._primary_config
100+
primary_precision_weight_params.append(weight_param)
101101
num_weights_in_4bit += weight_param.num_weights
102+
return primary_precision_weight_params
102103

103104
@abstractmethod
104105
def _set_backend_entity(self, model: TModel) -> None:

0 commit comments

Comments
 (0)