Skip to content

Commit 538e8c9

Browse files
authored
fix hfoptions blocks in unit 1 8.mdx
1 parent 896777d commit 538e8c9

File tree

1 file changed

+41
-5
lines changed
  • chapters/en/chapter2

1 file changed

+41
-5
lines changed

chapters/en/chapter2/8.mdx

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,19 @@ TGI, vLLM, and llama.cpp serve similar purposes but have distinct characteristic
1515
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png" alt="Flash Attention" />
1616

1717
<Tip title="How Flash Attention Works">
18+
1819
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [Chapter 1.8](/course/chapter1/8), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences.
1920

2021
The key innovation is in how it manages memory transfers between High Bandwidth Memory (HBM) and faster SRAM cache. Traditional attention repeatedly transfers data between HBM and SRAM, creating bottlenecks by leaving the GPU idle. Flash Attention loads data once into SRAM and performs all calculations there, minimizing expensive memory transfers.
2122

2223
While the benefits are most significant during training, Flash Attention's reduced VRAM usage and improved efficiency make it valuable for inference as well, enabling faster and more scalable LLM serving.
24+
2325
</Tip>
2426

2527
**vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient.
2628

2729
<Tip title="How PagedAttention Works">
30+
2831
PagedAttention is a technique that addresses another critical bottleneck in LLM inference: KV cache memory management. As discussed in [Chapter 1.8](/course/chapter1/8), during text generation, the model stores attention keys and values (KV cache) for each generated token to reduce redundant computations. The KV cache can become enormous, especially with long sequences or multiple concurrent requests.
2932

3033
vLLM's key innovation lies in how it manages this cache:
@@ -35,11 +38,13 @@ vLLM's key innovation lies in how it manages this cache:
3538
4. **Memory Sharing**: For operations like parallel sampling, pages storing the KV cache for the prompt can be shared across multiple sequences.
3639

3740
The PagedAttention approach can lead to up to 24x higher throughput compared to traditional methods, making it a game-changer for production LLM deployments. If you want to go really deep into how PagedAttention works, you can read the [the guide from the vLLM documentation](https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html).
41+
3842
</Tip>
3943

4044
**llama.cpp** is a highly optimized C/C++ implementation originally designed for running LLaMA models on consumer hardware. It focuses on CPU efficiency with optional GPU acceleration and is ideal for resource-constrained environments. llama.cpp uses quantization techniques to reduce model size and memory requirements while maintaining good performance. It implements optimized kernels for various CPU architectures and supports basic KV cache management for efficient token generation.
4145

4246
<Tip title="How llama.cpp Quantization Works">
47+
4348
Quantization in llama.cpp reduces the precision of model weights from 32-bit or 16-bit floating point to lower precision formats like 8-bit integers (INT8), 4-bit, or even lower. This significantly reduces memory usage and improves inference speed with minimal quality loss.
4449

4550
Key quantization features in llama.cpp include:
@@ -49,9 +54,8 @@ Key quantization features in llama.cpp include:
4954
4. **Hardware-Specific Optimizations**: Includes optimized code paths for various CPU architectures (AVX2, AVX-512, NEON)
5055

5156
This approach enables running billion-parameter models on consumer hardware with limited memory, making it perfect for local deployments and edge devices.
52-
</Tip>
53-
5457

58+
</Tip>
5559

5660
### Deployment and Integration
5761

@@ -63,8 +67,6 @@ Let's move on to the deployment and integration differences between the framewor
6367

6468
**llama.cpp** prioritizes simplicity and portability. Its server implementation is lightweight and can run on a wide range of hardware, from powerful servers to consumer laptops and even some high-end mobile devices. With minimal dependencies and a simple C/C++ core, it's easy to deploy in environments where installing Python frameworks would be challenging. The server provides an OpenAI-compatible API while maintaining a much smaller resource footprint than other solutions.
6569

66-
67-
6870
## Getting Started
6971

7072
Let's explore how to use these frameworks for deploying LLMs, starting with installation and basic setup.
@@ -146,7 +148,9 @@ response = client.chat.completions.create(
146148
)
147149
print(response.choices[0].message.content)
148150
```
151+
149152
</hfoption>
153+
150154
<hfoption value="llama.cpp" label="llama.cpp">
151155

152156
llama.cpp is easy to install and use, requiring minimal dependencies and supporting both CPU and GPU inference.
@@ -235,7 +239,9 @@ response = client.chat.completions.create(
235239
)
236240
print(response.choices[0].message.content)
237241
```
242+
238243
</hfoption>
244+
239245
<hfoption value="vllm" label="vLLM">
240246

241247
vLLM is easy to install and use, with both OpenAI API compatibility and a native Python interface.
@@ -306,6 +312,7 @@ response = client.chat.completions.create(
306312
)
307313
print(response.choices[0].message.content)
308314
```
315+
309316
</hfoption>
310317

311318
</hfoptions>
@@ -333,6 +340,7 @@ docker run --gpus all \
333340
```
334341

335342
Use the InferenceClient for flexible text generation:
343+
336344
```python
337345
from huggingface_hub import InferenceClient
338346

@@ -380,7 +388,9 @@ response = client.chat.completions.create(
380388
)
381389
print(response.choices[0].message.content)
382390
```
391+
383392
</hfoption>
393+
384394
<hfoption value="llama.cpp" label="llama.cpp">
385395

386396
For llama.cpp, you can set advanced parameters when launching the server:
@@ -487,7 +497,9 @@ output = llm(
487497

488498
print(output["choices"][0]["text"])
489499
```
500+
490501
</hfoption>
502+
491503
<hfoption value="vllm" label="vLLM">
492504

493505
For advanced usage with vLLM, you can use the InferenceClient:
@@ -579,6 +591,7 @@ formatted_prompt = llm.get_chat_template()(chat_prompt) # Uses model's chat tem
579591
outputs = llm.generate(formatted_prompt, sampling_params)
580592
print(outputs[0].outputs[0].text)
581593
```
594+
582595
</hfoption>
583596

584597
</hfoptions>
@@ -610,7 +623,9 @@ client.generate(
610623
repetition_penalty=1.1, # Reduce repetition
611624
)
612625
```
626+
613627
</hfoption>
628+
614629
<hfoption value="llama.cpp" label="llama.cpp">
615630

616631
```python
@@ -635,7 +650,9 @@ output = llm(
635650
repeat_penalty=1.1,
636651
)
637652
```
653+
638654
</hfoption>
655+
639656
<hfoption value="vllm" label="vLLM">
640657

641658
```python
@@ -648,6 +665,7 @@ params = SamplingParams(
648665
)
649666
llm.generate("Write a creative story", sampling_params=params)
650667
```
668+
651669
</hfoption>
652670

653671
</hfoptions>
@@ -659,14 +677,17 @@ Both frameworks provide ways to prevent repetitive text generation:
659677
<hfoptions id="inference-frameworks" >
660678

661679
<hfoption value="tgi" label="TGI">
680+
662681
```python
663682
client.generate(
664683
"Write a varied text",
665684
repetition_penalty=1.1, # Penalize repeated tokens
666685
no_repeat_ngram_size=3, # Prevent 3-gram repetition
667686
)
668687
```
688+
669689
</hfoption>
690+
670691
<hfoption value="llama.cpp" label="llama.cpp">
671692

672693
```python
@@ -686,7 +707,9 @@ output = llm(
686707
presence_penalty=0.5, # Additional presence penalty
687708
)
688709
```
710+
689711
</hfoption>
712+
690713
<hfoption value="vllm" label="vLLM">
691714

692715
```python
@@ -695,6 +718,7 @@ params = SamplingParams(
695718
frequency_penalty=0.1, # Penalize token frequency
696719
)
697720
```
721+
698722
</hfoption>
699723

700724
</hfoptions>
@@ -706,6 +730,7 @@ You can control generation length and specify when to stop:
706730
<hfoptions id="inference-frameworks" >
707731

708732
<hfoption value="tgi" label="TGI">
733+
709734
```python
710735
client.generate(
711736
"Generate a short paragraph",
@@ -714,7 +739,9 @@ client.generate(
714739
stop_sequences=["\n\n", "###"],
715740
)
716741
```
742+
717743
</hfoption>
744+
718745
<hfoption value="llama.cpp" label="llama.cpp">
719746

720747
```python
@@ -729,7 +756,9 @@ response = client.completions.create(
729756
# Via direct library
730757
output = llm("Generate a short paragraph", max_tokens=100, stop=["\n\n", "###"])
731758
```
759+
732760
</hfoption>
761+
733762
<hfoption value="vllm" label="vLLM">
734763

735764
```python
@@ -741,6 +770,7 @@ params = SamplingParams(
741770
skip_special_tokens=True,
742771
)
743772
```
773+
744774
</hfoption>
745775

746776
</hfoptions>
@@ -752,6 +782,7 @@ Both frameworks implement advanced memory management techniques for efficient in
752782
<hfoptions id="inference-frameworks" >
753783

754784
<hfoption value="tgi" label="TGI">
785+
755786
TGI uses Flash Attention 2 and continuous batching:
756787

757788
```sh
@@ -763,7 +794,9 @@ docker run --gpus all -p 8080:80 \
763794
--max-batch-total-tokens 8192 \
764795
--max-input-length 4096
765796
```
797+
766798
</hfoption>
799+
767800
<hfoption value="llama.cpp" label="llama.cpp">
768801

769802
llama.cpp uses quantization and optimized memory layout:
@@ -789,7 +822,9 @@ For models too large for your GPU, you can use CPU offloading:
789822
--n-gpu-layers 20 \ # Keep first 20 layers on GPU
790823
--threads 8 # Use more CPU threads for CPU layers
791824
```
825+
792826
</hfoption>
827+
793828
<hfoption value="vllm" label="vLLM">
794829

795830
vLLM uses PagedAttention for optimal memory management:
@@ -806,6 +841,7 @@ engine_args = AsyncEngineArgs(
806841

807842
llm = LLM(engine_args=engine_args)
808843
```
844+
809845
</hfoption>
810846

811847
</hfoptions>
@@ -818,4 +854,4 @@ llm = LLM(engine_args=engine_args)
818854
- [vLLM GitHub Repository](https://github.com/vllm-project/vllm)
819855
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
820856
- [llama.cpp GitHub Repository](https://github.com/ggerganov/llama.cpp)
821-
- [llama-cpp-python Repository](https://github.com/abetlen/llama-cpp-python)
857+
- [llama-cpp-python Repository](https://github.com/abetlen/llama-cpp-python)

0 commit comments

Comments
 (0)