Skip to content

Commit 04a5198

Browse files
authored
Merge pull request kvcache-ai#667 from Azure-Tang/update-readme
[update] Update doc.
2 parents 911084b + c6fbf6a commit 04a5198

File tree

2 files changed

+24
-6
lines changed

2 files changed

+24
-6
lines changed

doc/en/DeepseekR1_V3_tutorial.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@
1616
- [Memory consumptions:](#memory-consumptions)
1717
- [Benchmark results](#benchmark-results-2)
1818
- [How to Run](#how-to-run)
19-
- [V0.2.2 longer context](#v022-longer-context)
19+
- [V0.2.2 longer context \& FP8 kernel](#v022-longer-context--fp8-kernel)
20+
- [longer context](#longer-context)
21+
- [FP8 kernel](#fp8-kernel)
2022
- [V0.2 \& V0.2.1 Showcase](#v02--v021-showcase)
2123
- [Single socket version (32 cores)](#single-socket-version-32-cores)
2224
- [Dual socket version (64 cores)](#dual-socket-version-64-cores)
@@ -155,7 +157,11 @@ the output quality doesn't change. But the speed of decoding and prefill
155157
is speed up which is inspiring. So our showcase makes use of this finding*
156158

157159
## How to Run
158-
### V0.2.2 longer context
160+
### V0.2.2 longer context & FP8 kernel
161+
#### longer context
162+
To use this feature, [install flashinfer](https://github.com/flashinfer-ai/flashinfer) first.
163+
164+
159165
If you want to use long context(longer than 20K) for prefill, enable the matrix absorption MLA during the prefill phase, which will significantly reduce the size of the kv cache. Modify yaml file like this:
160166
```
161167
- match:
@@ -167,6 +173,18 @@ If you want to use long context(longer than 20K) for prefill, enable the matrix
167173
prefill_device: "cuda"
168174
absorb_for_prefill: True # change this to True to enable long context(prefill may slower).
169175
```
176+
#### FP8 kernel
177+
178+
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
179+
- **FP8 GPU Kernel Integration**: FP8 linear layer acceleration kernels integrated in KTransformers
180+
- **Hybrid Quantization Architecture**:
181+
- Attention and Shared-Expert modules use FP8 precision (enhances computational accuracy)
182+
- Experts modules retain GGML quantization (GGUF format, reside in CPU to save GPU memory)
183+
184+
So those who are persuing the best performance can use the FP8 linear kernel for DeepSeek-V3/R1.
185+
186+
The detailed guide is [here](./fp8_kernel.md).
187+
170188
### V0.2 & V0.2.1 Showcase
171189
#### Single socket version (32 cores)
172190
Our local_chat test command is:

doc/en/fp8_kernel.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# FP8 Linear Kernel for DeepSeek-V3
1+
# FP8 Linear Kernel for DeepSeek-V3/R1
22

33
## Overview
44
The DeepSeek-AI team provides FP8 safetensors for DeepSeek-R1/V3 models. We achieve performance optimization through the following works:
@@ -17,8 +17,8 @@ So those who are persuing the best performance can use the FP8 linear kernel for
1717
### Using Pre-Merged Weights
1818

1919
Pre-merged weights are available on Hugging Face:
20-
[KVCache-ai/DeepSeek-V3](https://huggingface.co/KVCache-ai/DeepSeek-V3)
21-
[KVCache-ai/DeepSeek-R1](https://huggingface.co/KVCache-ai/DeepSeek-R1)
20+
[KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-V3)
21+
[KVCache-ai/DeepSeek-R1-GGML-FP8-Hybrid](https://huggingface.co/KVCache-ai/DeepSeek-R1)
2222
> Please confirm the weights are fully uploaded before downloading. The large file size may extend Hugging Face upload time.
2323
2424

@@ -29,7 +29,7 @@ pip install -U huggingface_hub
2929
# Optional: Use HF Mirror for faster downloads in special area.
3030
# export HF_ENDPOINT=https://hf-mirror.com
3131

32-
huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3 --local-dir <local_dir>
32+
huggingface-cli download --resume-download KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid --local-dir <local_dir>
3333
```
3434
### Using merge scripts
3535
If you got local DeepSeek-R1/V3 fp8 safetensors and q4km gguf weights, you can merge them using the following scripts.

0 commit comments

Comments
 (0)