@@ -97,4 +97,21 @@ This `pipeline_quant_config` can now be passed to [`~DiffusionPipeline.from_pret
9797In this case, ` quant_kwargs ` will be used to initialize the quantization specifications
9898of the respective quantization configuration class of ` quant_backend ` . ` components_to_quantize `
9999is used to denote the components that will be quantized. For most pipelines, you would want to
100- keep ` transformer ` in the list as that is often the most compute and memory intensive.
100+ keep ` transformer ` in the list as that is often the most compute and memory intensive.
101+
102+ The config below will work for most diffusion pipelines that have a ` transformer ` component present.
103+ In most case, you will want to quantize the ` transformer ` component as that is often the most compute-
104+ intensive part of a diffusion pipeline.
105+
106+ ``` py
107+ pipeline_quant_config = PipelineQuantizationConfig(
108+ quant_backend = " bitsandbytes_4bit" ,
109+ quant_kwargs = {" load_in_4bit" : True , " bnb_4bit_quant_type" : " nf4" , " bnb_4bit_compute_dtype" : torch.bfloat16},
110+ components_to_quantize = [" transformer" ],
111+ )
112+ ```
113+
114+ Diffusion pipelines can have multiple text encoders. [ ` FluxPipeline ` ] has two, for example. It's
115+ recommended to quantize the text encoders that are memory-intensive. Some examples include T5,
116+ Llama, Gemma, etc. In the above example, you quantized the T5 model of [ ` FluxPipeline ` ] through
117+ ` text_encoder_2 ` while keeping the CLIP model intact (accessible through ` text_encoder ` ).
0 commit comments