You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs/models/hf.md
+28-2Lines changed: 28 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ By default, this exposes an [LMQL/LMTP inference API](https://github.com/eth-sri
28
28
29
29
**Endpoint and Port** By default, models will be served via port `8080`. To change this, you can specify the port via the `--port` option of the `lmql serve-model` command. On the client side, to connect to a model server running on a different port, you can specify the port when constructing an [`lmql.model`](../lib/generations.md#lmql-llm-objects) object:
30
30
31
-
```
31
+
```python
32
32
lmql.model("gpt2", endpoint="localhost:9999")
33
33
```
34
34
@@ -58,4 +58,30 @@ If you want more control over model loading and configuration, you can pass addi
58
58
59
59
```python
60
60
lmql.model("local:gpt2", cuda=True)
61
-
```
61
+
```
62
+
63
+
## Quantization
64
+
65
+
Quantization reduces the precision of model parameters to shrink model size and boost inference speed with minimal accuracy loss. LMQL supports two quantization formats: AWQ (using AutoAWQ) and GPTQ (using AutoGPTQ).
66
+
67
+
### AutoAWQ
68
+
69
+
AWQ minimizes quantization error by protecting crucial weights, promoting model efficiency without sacrificing accuracy. It's ideal for scenarios requiring both compression and acceleration of LLMs.
70
+
71
+
Install AutoAWQ following the repo instructions. To use AWQ-quantized models, run:
72
+
73
+
To use `AWQ`-quantized models, first install [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the instructions in the repo.
AutoGPTQ reduces model size while retaining performance by lowering the precision of model weights to 4 or 3 bits. It's suitable for efficient deployment and operation of LLMs on consumer-grade hardware.
82
+
83
+
Install [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) following the repo instructions. To use GPTQ-quantized models, run:
0 commit comments