You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- [Single socket version (32 cores)](#single-socket-version-32-cores)
22
24
- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
@@ -155,7 +157,11 @@ the output quality doesn't change. But the speed of decoding and prefill
155
157
is speed up which is inspiring. So our showcase makes use of this finding*
156
158
157
159
## How to Run
158
-
### V0.2.2 longer context
160
+
### V0.2.2 longer context & FP8 kernel
161
+
#### longer context
162
+
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
163
+
164
+
159
165
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
160
166
```
161
167
- match:
@@ -167,6 +173,18 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix
167
173
prefill_device: "cuda"
168
174
absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
169
175
```
176
+
#### FP8 kernel
177
+
178
+
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
179
+
-**FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
180
+
-**Hybrid Quantization Architecture**:
181
+
- Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
182
+
- Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
183
+
184
+
So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
0 commit comments