You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/quantization/modelopt.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ Unless required by applicable law or agreed to in writing, software distributed
9
9
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10
10
specific language governing permissions and limitations under the License. -->
11
11
12
-
# Nvidia ModelOpt
12
+
# NVIDIA ModelOpt
13
13
14
14
[nvidia_modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
15
15
@@ -19,7 +19,6 @@ Before you begin, make sure you have nvidia_modelopt installed.
19
19
pip install -U "nvidia_modelopt[hf]"
20
20
```
21
21
22
-
23
22
Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
24
23
25
24
The example below only quantizes the weights to FP8.
0 commit comments